terms nonce formation and neologism are two obvious key words associated ...... response is especially common when the word in question is in some way un- .... geocaching n. a type of scavenger hunt in which participants are given the geo- ...... found in English (also Dutch), though with a more complex matrix of variable.
Contents
List of Figures ........................................................................................................7 List of Tables..........................................................................................................9 Introduction ..........................................................................................................13 Chapter 1: Lexical innovations: neologisms and nonce words 1. Introduction ......................................................................................................17 2. Nonce formation...............................................................................................17 2.1. Formal structure: complex and simplex words ..........................................19 2.2. Productivity vs. creativity ..........................................................................20 2.2.1. Introduction..........................................................................................20 2.2.2. Exemplification....................................................................................20 2.2.3. Summary and conclusion .....................................................................26 2.3. Intentionality vs. unintentionality ..............................................................29 2.4. Other characteristics proposed in the literature..........................................31 2.4.1. Hohenhaus’s (1998) scalar definition ..................................................31 2.4.2. Context-dependence and non-lexicalizability ......................................32 2.4.3. Evaluation ............................................................................................32 2.5. Definition of nonce-formation revisited.....................................................33 2.6. Conclusion..................................................................................................33 3. Neologism ........................................................................................................35 3.1. Disambiguating the term neologism...........................................................36 3.2. Institutionalization......................................................................................37 3.2.1. Neologisms as part of the community-dependent norm ......................40 3.2.2. Factors conditioning the chances of institutionalization ......................42 3.2.3. Indicators of advanced/complete institutionalization...........................46 3.3. Degrees of lexical currency........................................................................49 3.4. The ‘nonce-word – neologism – institutionalized word’ cline...................50 3.5. The lexicographical approach: criteria for entry ........................................51 4. Conclusion: nonce-words and neologisms .......................................................52 Chapter 2: Language corpora and corpus linguistics 1. What is a corpus? .............................................................................................55 1.1. Corpus issues: authenticity.........................................................................59 1.2. Corpus issues: representativeness, sampling, corpus and sample size.......61
2 Historical perspective ........................................................................................64 3 The corpus approach – characteristics, advantages and applications ................72 Chapter 3: Linguistic variability and register variation 1. Introduction ......................................................................................................79 2. Linguistic variation...........................................................................................79 3. Sociolinguistics ................................................................................................82 4. Variationist sociolinguistics .............................................................................85 4.1. Orderly heterogeneity.................................................................................86 4.2. Variable rules .............................................................................................88 5. Register and register variation..........................................................................92 6. Multi-dimensional analysis of register variation: Biber (1988) .......................95 7. English nominalizations in Biber (1988)........................................................110 Chapter 4: A register-sensitive study of English nominalizations 1. Introduction ....................................................................................................115 2. Aims and research questions ..........................................................................120 3. Methodology ..................................................................................................123 3.1. The BNC genres and super-genres...........................................................123 3.2. Mark Davies’s online BNC interface .......................................................126 3.3. The data....................................................................................................127 3.4 Procedure...................................................................................................128 4. Results and discussion....................................................................................132 4.1. Register variation among -ness, -ity, -ion and -ment nominalizations .....132 4.2 Register variation among -ance/-ence, -ship, -(c)y, -hood, -age, -dom, -ery and -al nominalizations .....................................................................147 4.3 Considerations of morphological structure ...............................................154 4.3.1 Affix ordering .....................................................................................154 4.3.2 Register variation among -ness, -ity, -ion, -ment and -(c)y nominalizations: structural effects .....................................................159 4.4 Morphological productivity and lexical innovations.................................170 Conclusions ........................................................................................................197 Appendix ………………………………………………………………............201 References ..........................................................................................................209
List of Figures
Chapter 3 1 One-dimensional plot of four genres: nominalizations and passives ..............103 2 One-dimensional plot of four genres: 1st and 2nd person pronouns and contractions .....................................................................................................103 3 One-dimensional plot of four genres: 3rd person pronouns and past tense verbs.......................................................................................................104 4 Mean scores for Dimension 1 for each of the genres ......................................112 5 Mean scores for Dimension 3 for each of the genres ......................................113 Chapter4 1 Normalized joint token frequencies of -ness, -ity, -ion and -ment across registers (per 1 million words) ........................................................................135 2 Normalized joint token frequencies of -ness, -ity, -ion and -ment across registers ...........................................................................................................137 3 Normalized token frequencies of -ness, -ity, -ion and -ment across registers (organized by registers)...................................................................................138 4 Normalized token frequencies of -ness, -ity, -ion and -ment in Spoken..........139 5 Normalized token frequencies of -ness, -ity, -ion and -ment in Fiction ..........139 6 Normalized token frequencies of -ness, -ity, -ion and -ment in News.............139 7 Normalized token frequencies of -ness, -ity, -ion and -ment in Academic......140 8 Normalized token frequencies of -ness, -ity, -ion and -ment in Nonacademic..........................................................................................................140 9 Normalized token frequencies of -ness, -ity, -ion and -ment in Pop................141 10 Normalized token frequencies of the suffixes across registers (organized by suffixes)...........................................................................................................143 11 Normalized token frequencies of -ness across registers ................................143 12 Normalized token frequencies of -ity across registers...................................145 13 Normalized token frequencies of -ion across registers..................................145 14 Normalized token frequencies of -ment across register.................................145 15 Normalized joint token frequencies of the eight suffixes across registers ....148 16 Normalized joint token frequencies of the twelve suffixes across registers..148 17 Normalized token frequencies of suffixes across registers............................149 18 Normalized token frequencies of suffixes across registers............................149 19 Normalized token frequencies in Spoken and Fiction, Group 2....................150
20 Normalized token frequencies in News and Pop...........................................151 21 Normalized token frequencies in Academic and Non-academic...................151 22 Normalized token frequencies of denominal/de-adjectival -ery across registers ...........................................................................................................152 23 Normalized token frequencies of deverbal -ery across registers ...................152 24 Normalized token frequencies of -ship across registers ................................153 25 Normalized token frequencies of -(c)y across registers.................................168 26 Raw word type counts across registers..........................................................171 27 Raw joint types of the twelve suffixes across register...................................172 28 Raw type count of -ness across registers .......................................................173 29 Raw type count of -ity across registers..........................................................178 30 Raw type count of -ion across registers.........................................................181 31 Raw type count of -ment across registers ......................................................183 32 Raw types and normalized tokens in -ment ...................................................185 33 Raw type count of -(c)y across registers........................................................186 34 Raw type count of de-adjectival -ance/-ence across registers .......................188 35 Normalized token frequencies of de-adjectival -ance/-ence across registers 188 36 Raw type count of deverbal -ance/-ence across registers..............................189 37 Normalized token frequencies of deverbal -ance/-ence across registers.......189 38 Raw type count of -dom across registers .......................................................190 39 Normalized token frequencies of -dom across registers ...............................190 40 Raw type count of denominal/de-adjectival -ery across registers ................191 41 Raw type count of deverbal -ery across registers ..........................................192 42 Raw type count of -hood across registers......................................................193 43 Normalized token frequencies of -hood across registers...............................193 44 Raw type count of -ship across registers .......................................................194
List of Tables
Chapter 3 1 /t/, /d/-deletion rate: the following segment effect ............................................90 2 /t,d/-deletion rate: the morphological effect ......................................................90 3 Frequency counts for texts (6) – (8) ................................................................101 4 The co-occurrence patterns underlying the five major dimensions of English ........................................................................................................104 Chapter 4 1 Type count cut-off points at X million word tokens .......................................130 2 Rate of word type increase for newspapers .....................................................130 3 Rate of word type increase for non-academic prose........................................131 4 Joint token frequencies of the suffixes -ness, -ity, -ion and -ment across registers ...........................................................................................................133 5 Joint token frequencies of the suffixes -ness, -ity and -ion across registers ....133 6 Joint token frequencies of the suffixes -ness, -ity, -ion and -ment across registers ..............................................................................................................134 7 Joint token frequencies of the suffixes -ness, -ity, -ion and -ment in News ....134 8 Joint token frequencies of the suffixes -ness, -ity, -ion and -ment in Non-Acad and Pop............................................................................................................136 9 Frequency ratios of suffixes per register .........................................................141 10 Normalized token frequencies of -ness, -ity, -ion and -ment in Fiction, Pop and News.........................................................................................................142 11 The frequency ratios of registers ...................................................................146 12 Frequencies of -ness word tokens per 1 million tokens of text .....................160 13 Register-to-register ratios for -ness overall ...................................................162 14 Register-to-register ratios per suffix combination of -ness ...........................162 15 Frequencies of -ity word tokens per 1 million tokens of text .......................163 16 Register-to-register ratios for -ity overall ......................................................164 17 Register-to-register ratios per suffix combination of -ity ..............................165 18 Frequencies of -ion word tokens per 1 million tokens of text .......................166 19 Register-to-register ratios for -ion overall .....................................................166 20 Register-to-register ratios per suffix combination of -ion .............................166 21 Frequencies of -ment word tokens per 1 million tokens of text ....................168 22 Frequencies of -(c)y word tokens per 1 million tokens of text ......................169
23 [simplex root+ness] tokens and types across registers ..................................173 24 [simplex root+y+ness] tokens and types across registers ..............................174 25 [-ful+ness] tokens and types across registers.................................................174 26 [-ish+ness] tokens and types across registers ................................................174 27 [-ous+ness] tokens and types across registers ...............................................174 28 [-ed+ness] tokens and types across registers .................................................174 29 [-ive+ness] tokens and types across registers ................................................174 30 [-less+ness] tokens and types across registers...............................................174 31 [-ing+ness] tokens and types across registers................................................174 32 Totals of new word types in -ness across registers........................................176 33 Totals of new word types in -ness across types of base form........................176 34 New word types in -ness across types of base form and across registers......177 35 [simplex root+ity] tokens and types across registers .....................................178 36 [-able+ity] tokens and types across registers.................................................178 37 [-al+ity] tokens and types across registers.....................................................178 38 [-ous+ity] tokens and types across registers ..................................................178 39 [-ile+ity] tokens and types across registers....................................................178 40 [-ic+ity] tokens and types across registers.....................................................179 41 [-ive+ity] tokens and types across registers ...................................................179 42 Totals of new word types in -ity across registers...........................................179 43 Totals of new word types in -ity across types of base form...........................180 44 New word types in -ity across types of base form and across registers ...........................................................................................................180 45 [unsuffixed root +ation] tokens and types across registers ...........................181 46 [-ate+ion] tokens and types across registers..................................................181 47 [-ize+ation] tokens and types across registers ...............................................181 48 [-ify+cation] tokens and types across registers..............................................182 49 [unsuffixed root+(it)ion] tokens and types across registers ..........................182 50 Totals of new word types in -ion across registers .........................................182 51 Totals of new word types in -ion across types of base form .........................182 52 New word types in -ion across types of base form and across registers ...........................................................................................................183 53 [root +ment] tokens and types across registers..............................................184 54 [en-root+ment] tokens and types across registers..........................................184 55 Totals of new word types in -ment across types of base form.......................185 56 Totals of new word types in -ment across registers.......................................185 57 [-ant+(c)y] tokens and types across registers ................................................187 58 [-ate+(c)y] tokens and types across registers.................................................187 59 [noun+(c)y] tokens and types across registers...............................................187 60 Totals of new word types in -(c)y across types of base form ........................187 61 Totals of new word types in -(c)y across registers ........................................187 62 Totals of new word types in de-adjectival -ance/-ence across
registers ...........................................................................................................189 63 Totals of new word types in deverbal -ance/-ence across registers...............189 64 Totals of new word types in -dom across registers........................................190 65 Totals of new word types in denominal/de-adjectival -ery across registers .192 66 Totals of new word types in deverbal -ery across registers...........................193 67 Totals of new word types in -hood across registers.......................................194 68 Totals of new word types in -ship across registers ........................................195
Introduction
Nominalizations are a well-researched area of English word formation. In fact, the initial impetus for the advent of generative grammar in the 1970s came from Chomsky’s (1970) criticism of the transformational account of derived nominals like destruction, transmission and refusal. Chomsky noted that such nominalizations are too idiosyncratic to be generated via syntactic rules from underlying sentential structure (as in Lees 1960) and instead require lexicalist treatment. The ensuing theoretical debate inevitably went beyond the domain of nominalizations but they have remained in the scope of interest of linguists and have been studied from many different perspectives. Notably, much of the discussion concerning deverbal nominalizations (ending in -ion, -ment, -al, etc.) concentrated on semantic non-compositionality (e.g. Chomsky 1970), argument structure (e.g. Anderson 1979) and morphological productivity (e.g. Plag 1999). De-adjectival nominals, on the other hand, especially -ness and -ity, have often been the object of investigations concerned with affix ordering and selectional restrictions (e.g. Selkirk 1982) and the significance of the Latinate vs. native distinction in English morphology (e.g. Aronoff 1976). With the more recent growth of interest in the study of language use, as opposed to linguistic structure, nominalizations have also found their way into explorations of register variation. In a seminal analysis of systematic differences between language varieties (i.e. the multi-dimensional analysis developed by Biber 1988), nominalizations are one of several dozen linguistic features that define so-called dimensions of variation along which registers can be contrasted (see Chapter 3). In this way, linguistically defined features pertaining to formal structure, such as derived nominals, inform the study of language use, conditioned by contextual and situational factors. Biber’s analysis recognizes the role of nominalizations, albeit somewhat indiscriminately: they are considered as a unified category without distinguishing between distinct types of the rightmost suffixes, let alone the varied structure of the base form. 1 Consequently, any potential significance of morphological make-up goes unnoticed.
1
Biber (1988) considers nominalizations as a whole. Biber (1998) and Biber et al. (1999) indicate in very general terms the varied distribution of some suffixes. No mention is made of base-internal complexity. See Chapter 4.
14
Introduction
The present study sets out to fill in this research gap by looking deeper into the morphological complexity of English abstract nominalizations 2 and considering its relevance for the distribution of nominalizations across registers. With this aim in mind, both quantitative and qualitative analyses of corpus data are carried out – the former is based on frequency of occurrence and the latter draws on information pertaining to morphological status and identity. Namely, root– suffix and suffix–suffix combinations are distinguished between and shown to have different effects on the productivity and distribution of the rightmost suffix. Similarly, in suffix–suffix combinations, the identity of the penultimate affix may be a significant factor. Admittedly, this work does not aspire to present a comprehensive multidimensional analysis of linguistic variability encompassing a variety of linguistic features which define several dimensions of variation, as envisaged by Biber and his followers. Instead, we narrow the scope of analysis to suffixal nominalizations so as to explore this single linguistic feature across registers in more detail. Additionally, while looking at these formations in the British National Corpus, we will further refine our focus in order to retrieve and examine innovative coinages derived by means of the same nominalizing suffixes (see Appendix for a complete list of these words). These innovations, again, will be given a register-sensitive and structure-oriented account. Overall, we look at abstract nominalizations from three different perspectives: register variation, lexical innovation (productivity) and structural complexity of the base form. In simple terms, we will investigate a number of nominalizing suffixes as regards: • their distribution across language varieties known as registers (both established and innovative forms) • their productivity in the formation of new words • structural considerations: whether the suffixes show any preferences for different types of base forms Needless to say, the three planes of analysis will overlap naturally and our findings will ultimately make reference to each plane as we proceed. For example, the distribution of nominalizations across registers depends on the identity of the rightmost suffix but we also establish that it depends on the internal complexity of the base form (whether morphologically simplex or complex) and on the type of base-final suffix. Further, the extent to which a suffix gives rise to new words 2
The nominalizations to be investigated are action-denoting Nomina Actionis and Nomina Qualitatis, which denote properties. Overall, twelve derivational suffixes are considered. See Chapter 4 for further details.
Introduction
15
varies across registers. As we find out, it varies still more across types of base form within a given register. Below is an overview of the four chapters included in this dissertation. Chapter 1 prepares the ground for a discussion of lexical innovation: the terms nonce formation and neologism are two obvious key words associated with new vocabulary. Although they both have a secure place in language studies, they are nonetheless notoriously ambiguous in the literature. For example, nonce formations have traditionally been considered to be playful, ephemeral new coinages (used ‘for the nonce’) or, instead, they have been associated with structural oddness. Alternatively, nonce formations have been viewed as momentary lexical creations dying an instant death which are otherwise perfectly regular outcomes of legitimate word-formational processes. Likewise, the word neologism has received multiple interpretations. The aim of this chapter is to establish a coherent picture of what nonce formations and neologisms are and to indicate a connection between them. It is argued that the two concepts may be systematically related through the process of institutionalization and as such they are integral component features of the phenomenon of lexical innovation. In preparation for our analysis of corpus data in the final chapter, Chapter 2 is a review of the corpus-based approach in linguistics. It starts off with a general discussion of several corpus issues; then, there is a brief historical account of the emergence of modern corpus linguistics and the tensions between theoretical linguistics and empirically-oriented linguistic enquiry. Finally, the advantages and possible applications of corpora are presented. Chapter 3 investigates recent advances in the study of register variation as well as linguistic variability in general. The discussion will begin with an overview of the treatment of linguistic variation in several theoretical paradigms. We will focus on those models that have contributed the most to our understanding of variability and offered a maximally accurate description of its workings. Finally, the discussion will narrow down to consider systematic variation across language varieties. We will review state-of-the-art analytic descriptions of registers and comparisons between registers. Central as it is to this study, Chapter 4 is the most extensive. English nominalizations of twelve pre-set structural types are investigated on the basis of the British National Corpus in order to address a series of research questions and aims hitherto overlooked in the literature. These questions and aims are put forward at the onset of the discussion. The methodology adopted is discussed afterwards, and the results obtained are presented and discussed in the remainder of the chapter. Finally, the research questions posed at the beginning of the chapter are addressed again in Conclusions. This book is a revised version of my doctoral dissertation. I would like to take this opportunity to thank all the people who have been involved in this pro-
16
Introduction
ject in various ways. In particular, my thanks are due first and foremost to Professor Bogdan Szymanek for his invaluable support and guidance. I am also very grateful to the reviewers of the original manuscript, Professor Anna MalickaKleparska and Professor Edmund Gussmann, who have kindly suggested a number of improvements.
Chapter 1 Lexical innovations: neologisms and nonce formations
1. Introduction Investigations of new words unavoidably need to start by asking some fundamental questions about what exactly we mean by ‘new words’. One concept that is immediately associated with lexical innovations is that of neologism, which in turn needs clarifying too. In order to define the concept of neologism with maximum accuracy and detail, it is necessary to contrast it against another related term that is highly prominent in the literature, namely that of nonce formation (nonce word). The juxtaposition of the two offers a clearer view of how words of relatively recent origin, as a result of a diachronic transition, come to be recognized as neologisms, genuine and legitimate lexical items. A standard view on the terminological distinction between neologisms and nonce formations is that the two notions are to be distinguished on the basis of the degree to which a new word has joined the common vocabulary of a speech community. Whereas nonce formations, on their emergence, are said merely to have the potential to be spread, accepted and used by the speakers 1 , neologisms are considered to have already gained enough of a foothold in usage to be part of the working vocabulary of a substantial number of language users. The two concepts are closely connected with respect to diachrony. It is another commonly agreed-upon fact that neologisms are ex-nonce formations in that the latter, if they survive by being disseminated and thus gain a degree of currency, go on to become neologisms. The exact mechanism of this transformation goes beyond the scope of this section and will be dealt with in section 3.2. 2. Nonce formation Having established, in most provisional terms for the time being, the close diachronic relation of the two concepts, let us now consider some of the most widely cited criteria of ‘nonceness’ as found in Hohenhaus (2005). Nonce for-
1
Nonce words have this potential to a lesser or greater extent, depending on an array of factors to be discussed in section 3.2.2.
18
Chapter 1
mation 2 is the first stage in the life of a new word just upon its production by the language user. The most fundamental condition is that “the formation is ‘new’ – more precisely: ‘new’ in a psycholinguistic sense, i.e. formed actively (by whatever means) by a speaker – as apposed to retrieved ready-made from his/her storage of already existing listemes in the lexicon” (Hohenhaus 2005: 364). The existence of nonce formations is “typically maximally short-lived: limited to a single occurrence only” (Hohenhaus 2005: 364-365), the purpose of which is to fill a lexical gap. The typical once-only usage of nonce words in effect means that the majority of these creations die out as a result of failure to spread among the speakers of a language community (for reasons to be discussed in section 3.2.2; cf. also Bauer 1983). For this reason, the word ephemeral is another label widely attached to nonce formations (e.g. Bauer 2004: 78; Crystal 2000). So far we have noted several representative properties of nonce formations, listed by Hohenhaus (2005) and commonly put forward by virtually all authorities in the field. Taken collectively, the following characteristics constitute our working definition of nonce formations on which to build on in subsequent sections: • they are novel lexical creations • they are not (yet) part of the lexicon • they are coined for a particular purpose in order to meet a lexical need • they may (sometimes) be used only once and never catch on with other speakers Uncontroversial as they are, these criteria need refinement and elaboration. Moreover, none of them is in itself sufficiently defining and nor are they quite exhaustive when applied in combination; they only outline an incomplete and superficial definition of nonce formations. Although it is possible to identify other categorial characteristics, any further descriptions vary from author to author, and indeed, occasionally from study to study by one and the same scholar. We will now discuss some of these more debatable properties of nonce words. With a view to revising the working definition outlined above, the three questions below will now be investigated with respect to the coverage they have received in the literature: 1) Is there any differentiation in the formal structure of nonce formations? If so, is this fact reflected explicitly in any further subclassification? 2) Is there any distinction to be drawn between the outputs of well-established, productive rule-governed processes on the one hand and creative, purposefully 2
Referred to as Okkasionalismus (Hohenhaus 1996) and okazjonalizm (Smółkowa 2001) in German and Polish linguistics respectively.
Neologisms and nonce words
19
deviant, playful formations on the other? Can the former type be a source of nonce words at all? 3) Is it relevant whether the coining of a nonce formation is intentional /conscious 3 or unintentional/unconscious? Again, is this fact reflected explicitly in terms of taxonomy? 2.1. Formal structure: complex and simplex words Bauer (1983: 45) defines a nonce word as “a new complex word coined by a speaker/writer on the spur of the moment to cover some immediate need”. It is unfortunate that he excludes all new simplex words from ever appearing as nonce words as it is not difficult to conceive of a potential word created from scratch or ex nihilo (i.e. by the use of no existing word-formational formatives, lexical or derivational, precisely to meet the need for a new/better word. Then the only restrictions in such word-manufacture are the language-specific phonotactic limitations. Such coinages, known as root-creations, have long been recognized by morphologists 4 , although, admittedly, they are said to be the least productive way of contributing new additions to the lexical stock (cf. Algeo 1991: 4). Some examples cited in Algeo (1980) after the Barnhart Dictionary of New English since 1963 (1973) include: haroosh, cowabunga, hincty and pizzazz; other more familiar root creations are xerox, Teflon, Kleenex, aspirin, nylon (Yule 1996) gunge, goo, sham, blob and blurb (Baldi and Dawar 2000). As all these and other root creations must have once been coined for the first time and taken some time before they established themselves to a lesser or greater extent, it seems sensible to conclude that there is not much ground for excluding simplex words from the status of nonce formations. This position is also supported by Fischer’s (1998: 5) explicit claim that “words with a simple morphological structure are also nonce formations”. It might be argued that root-creations, by virtue of their mere structural oddness 5 , are to be distinguished from ‘ordinary’ simplex words but even then the nonce status of both groups remains unaffected. Although nonce formations enjoy a relative currency in morphological investigations, this particular aspect of their simplex/complex word composition is mostly left out. Although, to the best of our knowledge, no similar overt omission of simplex words is made elsewhere in the literature, the manner in which various definitions of nonce words are phrased may exhibit a certain bias towards the exclusion of simplex words, or at least the author’s unintentional mar3
Both designations are found in the literature. For example, see the discussion of word-manufacturing in Marchand (1969: 451-454), Bauer (1983: 239), McArthur (1992). 5 Root creations are the only lexical items that “lack an etymology in the traditional sense” (Baldi et al. 2000: 966).
4
20
Chapter 1
ginalisation of the matter. For example, Hohenhaus (2005: 363) assumes the meaning of nonce formations to cover “both perfectly regular outputs of productive rules as well as stylistically (or otherwise) more marked, creative, even deviant ‘playful’ formations”. The implication, it seems, is that possibly these are the only two groups of eligible candidates for nonceness. In view of the above discussion, this study will assume structural make-up to be irrelevant in deciding on the nonceness of a coinage. Considerations of morphological structure, although useful for descriptive and typological purposes, will play no part in discriminating between nonce words and neologisms. 2.2. Productivity vs. creativity 2.2.1. Introduction This aspect of lexical innovation has seen some more coverage in morphological debate. Most authors (e.g. Schultink 1961) recognize the need to keep apart the outcomes of those regular, rule-governed processes that are conventionally applied in word-formation on the one hand, and those marked, irregular creations 6 that, in one way or another, stretch or exceed the capacity of the language system. This distinction traditionally corresponds to the dichotomy of productive and creative morphology, respectively. 7 The latter type is also referred to as rule-changing creativity (e.g. Breckle 1978: 75, Hohenhaus 1998: 258), rulecreating creativity (van Marle 1990), extragrammatical morphology (Dressler and Barbaresi 1994) or expressive morphology (Zwicky and Pullum 1987). Traditionally, it is accepted that the markedness of the creative coinages consists in their being structurally atypical and/or their conveying a special pragmatic or stylistic effect. As nonce words are often associated 8 precisely with that kind of “quirky stylistic ‘novelties’” (Hohenhaus 2005: 363), this preconception is tested in the next section. 2.2.2. Exemplification Let us illustrate the matter in hand with some relevant data. The following are excerpts containing authentic nonce formations as found on the Internet by means of a simple Google search. Phrases such as “if that is a word”, “if you’ll 6
Also referred to as “oddities” (Aronoff 1976: 20), “unpredictable” (Bauer 1983: 232), “deviant” (Hohenhaus 1998, 2005), “playful” (Bauer 2000), “typologically marked” (Baldi et al.2000), “odd, amusing, repulsive, or otherwise remarkable” (Lieber 1992: 3). 7 Typically the former subsumes affixation and compounding and the latter comprises blending, clipping, acronymization and root-creation. 8 For instance, Hohenhaus (1998) maintains that ‘deviation’ is one of the characteristic properties of nonce formations; see also Marchand (1969) and Lieber (1988).
Neologisms and nonce words
21
excuse the term” and “if I may coin a word” were typed in the search box on the assumption that the search engine would retrieve newly-coined, transient lexical creations immediately preceding these phrases. 9 The original spelling and punctuation is preserved throughout. (1) (a) At this price, the DVD is a bargain, and if you can purchase "The Brontes of Haworth" on DVD along with the books - "The Heretic" by Stevie Davies as well as "All Alone: The Life and Private History of Emily Jane Bronte" by Romer Wilson - the latter comprise a start to de-mythicizing (if that is a word…) Charlotte in order to bring the truly brightest light of Emily Jane's genius out of the darkness at last. (b) After we accepted Adam Levin into our program, I got what I remember as a 25-page email from him. It was a kind of e-masterpiece, that presaged what we'd see from him during his time at Syracuse: manic, articulate, full of passion and self-effacing humour, courageous in its form, and very funny. (c) I loved the show,but it defintly jumped when Donna died her hair blonde. There are endless hot blondes on tv,and she is not hot enough to be one of them. However, she was one of the hottest redheads on tv. She lost her uniqueism (if that is a word) (d) And I think your article in SI [Sceptical Inquirer] gets it exactly right with regard to both the definition of free will and the attempt to salvage libertarian free will by recourse to quantum spookery, If I may coin a term. (e) It's my contention that the entire prosecution of the Iraq war should be viewed as a year long penis ad. The entire exercise was one of base appeal to the part of the male (and female) psyche that believes such blatant idiocy. The lizard brain that simply reacts on the level of survival and safety. Big == safe. Firm == strong. Virility == moral righteousness. It's like a Freud fest-o-thon. (f) A glaring example of this incompetence is the attack on the corps commander, Karachi. While it is certainly the case that attackers invariably choose the target, and the time and place of the attack there are (or should be!!) standard operating procedures (SOPs) for dealing with it when it comes. Particularly when the attackee, if I may coin a word, is a very senior officer of the army, no less. 9
Exceptionally, items (b) and (e) are included despite the lack of such prompt phrases.
22
Chapter 1 (g) While biodiversity is declining more rapidly than ever before, thanks to us, “bibliodiversity” (my word, not his) stands at about a million new titles per year, and growing. (h) When I first got here in 1999, engineering was all Linux and business was all Windows. Eventually, as engineering became more biznified (if I may coin a word), Windows machines worked their way in. (i) Quine is fond of the formula that while sentences are either true or false, a predicate is either true or false of something. For Frege, we remember, the predicative ‘is’ is merely a clumsily disguised ‘of’. Ofness, if I may coin a word, thus plays a crucial role in both systems. (j) "By my chemical knowledge, merely," replied Holmes. "A merely worldly vessel leaves a phosphorescent bubble in its wake. That one we have just discovered is not so, but sulphurescent, if I may coin a word which it seems to me the English language is very much in need of.” (k) If you think that tobacco is an aberration, try researching the prohibitionist movement in the late 19th and early 20th Centuries. It followed the same path as tobacco, and the Belly Bolsheviks are copying both the prohibitionists and the tobaccohibitionists (if I may coin a word.) (l) The capitalization was done out of shear ad-hoc'ishness (if that is a word :-) We couldn't find a proper wordlist to start with and had to generate our own
Each word in bold type in each excerpt illustrates the phenomenon of nonce formation; they are ephemeral, coined on the spot, sometimes intended to be witty and expressive. Of all the highlighted words, the following belong to productive word-formation: de-mythicizing, uniqueism, quantum spookery, attackee. The words de-mythicizing, uniqueism and attackee represent wordformation at its most productive and regular, i.e. affixation, whereas quantum spookery is a representative of compounding, the second most prolific wordformational mechanism in English (cf. Bauer 1994); hence the term productive morphology, prototypically understood as including these two processes. 10 However, although in quantitative terms, affixation and compounding generate the most derived words, in qualitative terms, some of these constructions, 10
Conversion (zero derivation) and back-formation (back-derivation) are also included in this category by some, although not with the same degree of prototypical categorial membership.
Neologisms and nonce words
23
especially compound words, are not as predictable and transparent as may be expected from the designation productive, regular, rule-governed morphology. This is so because as compounds, and nonce compounds especially so, are notoriously unpredictable as to their exact meaning (see Downing 1977, Štekauer 2005). In practice, most of them either have one determinate meaning that the speakers have come to associate them with, e.g. doorstep, earring, couch potato, or their semantics can be inferred from the context (e.g. apple juice seat analyzed by Hohenhaus (1998)). Yet when it comes to the semantics of novel compound words, the matter is open to multiple interpretations. Such is the case with quantum spookery, whose exact meaning will remain vague until and unless it just so happens that it starts to be used often enough for one particular meaning from among all its potential meanings to be established (see section 3.2 for a detailed discussion of institutionalization of meaning). It should be noted with regard to this set of items, i.e. de-mythicizing, uniqueism, quantum spookery and attackee, that, although they can all be associated with productive morphology as defined above, the nonce word uniqueism exhibits a feature that sets it apart from the rest. Namely, uniqueism is coined in place of the more usual uniqueness, which is listed in dictionaries. Whether such word (or affix) substitutions are to be regarded as speaker creativity or language errors (perhaps slips of the tongue) is a matter that might be resolved on the grounds of intentionality, which will be discussed in section 2.3. For the time being, however, what is noteworthy is that whatever makes the speaker/writer coin a nonce word and use it in place of an established word, even if it is just a trivial cause like a temporary memory lapse (cf. Bauer 2000: 832), the very fact of the coinage illustrates the usefulness and potential of nonce formations, which the speaker/writer can fall back on at any time as need arises. A similar type of word substitution can be observed in the nonce formations highlighted in the excerpts below:
(2) (a) I must say that i don't eat a huge amount, mainly because i can never deconcentrate (if that is a word) from the task at hand. I always drink lots and lots of fluids however - usually from sweating it out chasing people around i guess :lol: (b) Q: What are the requirements to invest? Is there a certain level of net worth? What is the minimum net worth? A: It is $1 million […] Q: Does it include the value of a person’s home? A: It does. And there are some who will raise the bar a little bit and say that you have to have $1 million in investibles, if that is a word. Or you could say assets, but it doesn't count someone's house.
24
Chapter 1 (c) The Chrysalids is about how there are these kids who are telepathic and if someone found out they were they would be banished because they would be considered blasphemic. (if that is a word.)
Blasphemic is analogous to uniqueism in that it is a marked construction employing a non-standard suffix and filling in for the usual lexeme (cf. the usual blasphemous). However, the other two forms are replacements of a different kind. The word investibles was used to mean assets, which was actually used by the speaker to correct him/herself. Similarly, de-concentrate does not merely utilize a non-standard prefix, but rather the entire formation is new (at least in this particular sense; OED lists deconcentrate to mean “transfer (authority) from central to local government”), and normally the same concept would have been expressed by take my mind off the task at hand, unwind, etc. Note that deconcentrate is based on the same morphological and semantic pattern as demotivate (actual word) and that there is nothing except for the condition of attestation and currency/listedness that sets these words apart as nonce word and actual word – no compositional or pragmatic markedness is involved. Thus far we have seen that the class of nonce formations is not exclusively composed of unusual, typologically and/or pragmatically marked coinages, but may also be represented by stylistically neutral conventional-looking ones. The same procedure to elicit nonce formations that was deployed in (1) produced more such ‘plain’ examples: (3) (a) Rather than promoting discussion of difficult topics, we resort to emotionalism and iconolization (if I may coin a word) which not only hampers rational discussion but which actually creates a sense of aversion in those who might be most inclined to discuss the topics. (b) But for the more cruisy, sedate students such as myself, life just got a whole lot more 'routiney' (if that’s a word). It's all good though. My assignments for my four subjects seem to be evenly spread in their due dates so hopefully not too many late nights studying and cramming. (c) The result is a very close to literal version of the text, allowing for colloquialisms and language patterns to emerge pertinent to the culture at the time that may have otherwise been “culturalized” If I may coin a word. (d) I am poor and have bad hardware or maybe just inconfigurable (if that is a word) and want some suggestions to affordable solutions.
Neologisms and nonce words
25
The remaining nonce words in (1), i.e. e-masterpiece, fest-o-thon, bibliodiversity, biznified, ofness, sulphurescent, tobaccohibitionists and ad-hoc'ishness will here be assumed to illustrate creative morphology and the reasons for this assumption are twofold. Firstly, their structural make-up is to varying degrees typologically marked in English word-formation; secondly, the formations are presumably meant to convey pragmatic effect of some sort: that of wit, humor, modernity, oddness, expressivity. 11 Arguably, all of them were coined to be eyecatching and therefore draw attention to themselves. In fact, it is typical of creative morphology to perform this particular function of ‘attention-grabbing’ (see Adams 2001: 140, Lehrer 2003, 2007) as well as to supply the language user with a set of formal tools, however vaguely specified, for achieving this pragmatic effect. One way of ensuring attention, as mentioned above, is unusual, irregular word structure – unusual by the standards of the given language of course. And so, with respect to the eight nonce formations in question, ofness must be considered atypical even if considered from a very liberal perspective of what is permissible in English word-formation as prepositions are not customarily used as bases for derivation. To a less obvious degree, ad-hoc'ishness may be regarded as structurally unusual. Even though adjectives are eligible bases for the ish(ness) derivation, ad-hoc seems incongruous in this and presumably any other derivative construction, possibly due to its ‘fixedness’ and, perhaps more importantly, overtly ‘alien’ phonological form (along the same lines, other phrasal borrowings would resist derivation, e.g. per se, a priori). 12 Thus the underlying effect of ad-hoc'ishness is one of linguistic playfulness. It should be noted that irregular internal composition as discussed here is not to be equated with structural opacity understood as a lack of clear morphemic divisibility or transparency. The latter is not exclusively associated with either productive or creative phenomena and should rather be investigated as occurring in individual cases. Ofness and ad-hoc'ishness are therefore part of creative, irregular morphology, although their morpheme boundaries are as transparent as they could possibly be. E-masterpiece is a representative of the multitude of prefixed forms comprising what seems to be a voguish new prefix deriving complex words related to the Internet (the e- being a clipped form of electronic). However, semantically speaking, e-masterpiece must be considered highly context-dependent and thus severely restricted in usage, if not a one-off nonce word. In a similar vein, fest-o11
These two criteria are commonly accepted in the literature as defining (e.g. Baldi et al. 2000). 12 Ad-hocness, however, seems acceptable.
26
Chapter 1
thon exemplifies unconventional suffixes (“folkmorphs” (Baldi et al. 2000: 967)) used “to create appealing names for certain types of jargony expressions [...] restricted to individual semantic domains” (Baldi et al. 2000: 967). Biznified is unconventional in that its base form biz- is a clipped form and thus an outcome of creative morphology in its own right (cf. the transformation of show business into showbiz, which subsequently is fore-clipped to biz with accompanying semantic broadening). The reason clipping is subsumed under creative word-formation is that it is considered non-rule-governed and arbitrary with respect to, for example, the class of eligible input forms as well as their morphological complexity and syllable structure. In fact, the only constraint on the processes of clipping is that the outcome be phonologically well-formed and up to two syllables long (Szymanek 1989: 97). The two nonce blends sulphurescent and tobaccohibitionist were coined by blending respectively sulphur (or sulphureous) + phosphorescent and tobacco (or tobacconist) + prohibitionist with only partial reproduction of the second constituents. It is noteworthy that when sulphureous, and not sulphur, is taken to be the input form in sulphurescent, the two input forms sulphureous and phosphorescent overlap phonologically in the [-res-] part. Such phonological resemblance or exact overlap (haplology) has been indicated (Bauer 1983: 96, Adams 1973: 150)) to facilitate blending and the coining of analogical formations (illustrated by plagiarhythm < plagiarism, shopgrifting < shoplifting, freegan < vegan 13 ). This additional phonological motivation, unusual in word-formation, might presumably be used as yet another proof of the extrasystemic nature of blending. The exact identity of the input forms involved in bibliodiversity is not obvious either in that two possible readings can be envisaged. The first is biblio+ diversity, the other is biblio- + biodiversity. On the first interpretation bibliodiversity is a compound containing a bound combining form (in McCarthy’s (2002) terminology) or a stem compound (in Adams’s (2001) terminology). On the second interpretation, which is supported by the fact that biodiversity appears in close proximity and thus may have been used as a model word, bibliodiversity is the product of blending. The operation is, as in the case of clipping, traditionally considered to be peculiar, irregular and marginal in English word-formation (e.g. Marchand 1969), and its products are termed by Marchand (1969: 452) “artificial new words”. 2.2.3. Summary and conclusion Creative morphology, then, has always been and still is regarded as distinct and indeed separate from productive morphology. Functionally, the former has been associated with a non-literal pragmatic effect, and so terms such as “expressive”, 13
From Macmillan’s on-line Word of the Week.
Neologisms and nonce words
27
“ostentatious”, “whimsical” (Zwicky et al. 1987: 7), “fanciful” and “playful” are commonplace with reference to blending, clipping, acronymization and root creation. 14 While not every single outcome of creative processes can be shown to be stylistically marked (see the discussion above), this tendency is certainly noticeable. And, conversely, some conventional productive morphology may occasionally convey the idea of playfulness and humour, as in get unlost (Lehrer 1996b: 70), unmurder (Kastovsky 1978: 358), kissee, murderee. Structurally speaking, the relevant formations (i.e. blends, clipped forms, acronyms and root creations) are customarily considered to deviate from the productive patterns (see examples (b), (e), (g), (h), (i), (j), (k), in (1)). Non-literal meaning and connotations, and unusual composition may influence each other and contribute to the general markedness. For example, occasionally, words can be “felt by speakers to have a pragmatic effect because of the whimsical and ostentatious affixes they contain” (Baldi et al. 2000: 964). As an example, compare the stylistic effect of biznified in (1h) and lack thereof in business-driven. However, it should be stressed again that these two defining criteria, functional and structural, do not necessarily operate in tandem and in equal participation. And so, it is feasible that individual forms may exhibit varied degrees of structural markedness with a complementary degree of expressivity, or that one of the two elements may be lacking altogether (see above). This claim could be further illustrated by the expressive hypocoristic suffix -y/-ie as in Franky and doggie, whose composition appears virtually conventional and transparent, reminiscent of that of productive formations. Another case in point is expressive diminutivization in Polish, which triggers predictable, non-atypical formal change and is applied with an across-the-board regularity. Taking the above into consideration, it may be sensible to see the distinction between productive and creative processes as a cline allowing for a degree of overlap rather than a clear-cut dichotomy (however, cf. Zwicky et al. (1987: 9) for the opposing view). Another problem is the usual correlation of productive word-formation with the formal mechanisms of derivation and compounding (optionally including conversion and back-derivation) on the one hand, and the usual association of creative word-formation with the processes of blending, clipping, acronymization and root-creation on the other. The correspondences cannot be so clear-cut, as evidenced by derivatives that must be perceived as exceeding the bounds of normal rules (cf. ofness, biznified, ad-hoc’ishness). It has been said above that productive and creative word-formation are considered distinct but also separate. Claims to the effect that the two are in fact 14
See Hohenhaus (2007) for a review of other pragmatic and communicative functions served by nonce formations.
28
Chapter 1
separate (possibly independent?) have some important theoretical implications. Zwicky et al. (1987: 1) write: Not every regularity in the use of language is a matter of grammar. There are many which incorporate or build upon aspects of grammatical organization (including phonology, morphology, syntax, and semantics), but which can be seen as grammatical rules only by stretching the idea of a grammatical rule beyond all recognition; […]. We claim, then, not that rules accounting for such phenomena [expressive morphology – WG] are marginal in their grammar, as some analysts have said, but that the definition of the phenomena in question lies in a domain orthogonal to the grammar. They constitute a linguistic phenomenon that is not within the province of the theory of grammar as ordinarily understood, though it is certainly within the broader sphere of human linguistic abilities.
Thus the proposal that is made in the above quotation is that expressive morphology is a phenomenon allocated to a domain outside grammar (hence the term ‘extragrammatical’ used by Dressler et al. 1994). Conceivably, one such language area to which the authors allude might be performance, as opposed to competence. This is supported by Szymanek (1989: 96-97): The essential feature of word-manufacturing [here understood as comprising clipping, blending, acronymization and analogical formation – WG] is that the new lexemes it produces originate outside the morphological component of a grammar, i.e. their creation is not governed, in any significant way, by the word-formation competence of a speaker. […] [W]ord-manufactoring is now often treated as a phenomenon of language-performance.
Szymanek bases his claim on the grounds that whatever limitations exist on creative word-formation, they are often of extragrammatical nature, such as factors of sociolinguistic acceptance, euphony, tendency towards analogy, economy and originality of expression (blending patterns based on phonological resemblance, mentioned above, would be another one). All these are “hard to encode as grammatical constraints” (Szymanek 1989: 96-97). Additional support comes from Bauer’s (2005: 329) declaration: [W]e can take the creative in creativity literally, and use this term to refer to the less automatic creations, those which are clearly deliberate and independent of the system. […] This leaves the term ‘productivity’ for use with those formations which are clearly part of the system, namely those parts of word-formation which are rule-governed.
Neologisms and nonce words
29
The phrases ‘less automatic’ and ‘clearly part of the system’ implicitly presuppose the existence of other creations which may be ‘more or less automatic’ and ‘not so clearly part of the system’, thus corroborating our earlier suggestion that the borderline between creativity and productivity is, in fact, more of a cline. Indeed, some creativity (i.e. some unconventional morphology) may be said to be productive (e.g. Baldi et al. 2000, Lehrer 1996b) in the sense of being readily available in the coining of new words, rather than in the sense of being governed by a set of specific rules (recall e- prefixation, as in e-masterpiece, which may be seen as only loosely governed in its operation, for example, with respect to its eligible base forms). All in all, the distinction under discussion is valid and useful for descriptive purposes in individual cases, but may be accused of overgeneralization on the theoretical level. With specific reference to nonce formations, the question of creativity is usually brought up due to the misconceived view that these coinages are (invariably) aberrant in form and stylistically marked. As shown above, this is not necessarily the case. 2.3. Intentionality vs. unintentionality Schultink’s (1961) extremely influential definition of productivity presupposes that genuinely productive word-formation is spontaneous and unintentional, i.e. the new words that are the result of productive patterns are coined without the speaker’s awareness that he/she is doing so. Presumably this also entails a similar unawareness on the part of the hearer in that such coinages will go unnoticed. According to Schultink, this is in contrast to the creative employment of wordformation to intentionally coin new words on an unproductive pattern, such that in principle will draw attention to themselves, in order to deliberately produce a stylistic or pragmatic effect. On this account, the now familiar division into productive and creative morphology is reinforced with the addition of yet another distinguishing feature. 15 The idea has been embraced by some scholars – if not entirely in their definition of productivity and creativity (van Marle 1985, Baayen et al. 1991, Lieber 1992), then partially (functionally), for example, in the treatment of blends and acronyms (Bauer 2000: 836). Marchand (1969: 451) claims that “[b]lending can be considered relevant to word-formation only insofar as it is an intentional process of word-coining” and that the process “has no grammatical, but stylistic status”. Similarly, although not quite as restrictedly, blending is perceived by Carstairs-McCarthy (2002: 66) as “more or less self-conscious”. 15
Items differing on the scale of the productivity-creativity continuum are also considered to be characteristic of distinct registers (e.g. Baayen and Renouf 1996, Cowie 2000).
30
Chapter 1
There are, however, problems with this point of view. Plag (1999) argues that the very notion of intentionality is vague and disputable on two accounts. Firstly, speakers vary as to their language-awareness and “what goes unnoticed by one speaker may strike the next as unusual” (Plag 1999: 14). Secondly, some coinages on productive patterns must be considered intentional, e.g. when applied to designate new concepts in science, in which case the purposeful intention behind the coining of the new name is for it to stay in the language. Another argument might be added to those listed by Plag, namely that the speakers’ judgement of what is retrieved from their lexicons and what is produced anew literally as they speak may be inconsistent from speaker to speaker. It would therefore be difficult to ascertain precisely when a new word is being coined not only for linguists, but also for speakers themselves. Bauer (2001) does not consider (un)intentionality a tenable criterion for the distinction either. The reasons he cites are that conscious and unconscious (Bauer’s designations apparently equivalent to intentional/unintentional 16 ) coinages exhibit no necessary differences of phonetic (although an intonational difference can be possible), syntactic, semantic or stylistic nature, “although such factors might be deemed to be relevant on occasions” (Bauer 2001:68). With respect to nonce formations specifically, Crystal (2000) maintains that they are deliberate on the part of the speaker but coined on the spur of the moment without careful planning. For the same reasons given above, it would have been a rather bold claim to argue that nonce words, by definition, are deliberate/intentional (Bauer’s term conscious introduces a slightly different dimension into the discussion). Instead, intentionality has often been cited as a quality of those nonce formations which clearly depart structurally from the standard patterns (e.g. Bauer 2005: 329). In such cases intentionality is used descriptively rather that classificationally. Hohenhaus (2005: 363) speaks of “deliberate deviations” in the sense of intentional deviating from rules rather than the intentionality of the psycholinguistic act of coinage as discussed above. In the light of the above, the present study will consider (un)intentionality non-distinctive at the level of the definition and typology of (but less so in the description of individual) nonce words and neologisms, although some of both may well have been more or less clearly deliberate upon coinage (cf. explicit indications of intentionality in the text surrounding sulphurescent in (1j) as well as the use of inverted commas and quotation marks in (1g), (3b) and (3c)). Note that such phrases as “coined on the spur of the moment” or “coined spontaneously”, which are often used in defining nonce word-formation, should not be 16
We will assume the two pairs of terms to be interchangeable, although it may be argued that, for example, an unintentional coinage, upon its production, may still be consciously identified as such, that is, unintentional but conscious.
Neologisms and nonce words
31
interpreted as bearing on the (un)intentionality criterion, but rather as descriptive devices of the immediate here-and-now context for coinage (in the sense that ‘on the spur of the moment’ does not necessarily equal ‘unconsciously’). Following Bauer (2005), it is also stressed that the noun coinage and the phrase coin a word are not to be understood in this work as ‘usually implying deliberate purpose’ (The Oxford English Dictionary) but rather as devoid of any implications concerning the criterion of (un)intentionality or permanence (note that Hohenhaus (1998: 239) takes the opposite stand). 2.4. Other characteristics proposed in the literature 2.4.1. Hohenhaus’s (1998) scalar definition This section discusses and evaluates the validity of other characteristics of nonce formations as put forward in Hohenhaus (1996, 1998), both of which are some of the most in-depth analyses of the phenomenon in focus. His definition of nonceness is based on a scale of four co-defining features the presence of which ranks nonce formations from basic, meeting one fundamental criterion, through gradually more typical ones displaying more than one feature, to, ultimately, prototypical ones exhibiting all the features. The four criteria envisaged by Hohenhaus (1998) are given below 17 : 1) newness – the formation has to be formed anew rather than retrieved readymade from the mental lexicon 2) context-dependency – “nonce formations are typically interpretable, or indeed usable, only with contextual support” (Hohenhaus 1998: 239-240). 3) deviation – “many nonce formations must also be considered to be deviant, i.e. not conforming to the language’s word-formation rules or wellformedness conditions” (Hohenhaus 1998: 240). 4) non-lexicalizability – as a result of context-dependency and deviation, many nonce formations cannot be lexicalized 18 , i.e. cannot be listed as part of the mental lexicon Two of the above categorial criteria, i.e. newness and deviation, have already been dealt with, and so will be omitted. The next section will examine the two features of context-dependence and non-lexicalizability. 17
The author maintains that the features are “structured in a more or less hierarchical fashion” and the criterion of newness is said to be the “basic” one. Non-lexicalizability is listed as number 4. However, two pages later, non-lexicalizability is claimed to be “the most characteristic feature of NFs” (Hohenhaus 1998: 238-240). 18 Here Hohenhaus uses the term ‘lexicalized’ in the sense of ‘listed’.
32
Chapter 1
2.4.2. Context-dependence and non-lexicalizability The author illustrates the problem with Downing’s (1977) ‘deictic compound’ apple-juice seat, whose full interpretation depends entirely on extralinguistic context 19 (in this case a seat in front of which a glass of apple juice had been placed). It is argued that “the lexicalization potential of such a compound seems to be quite low, since it is based on a relationship of a very temporary, fortuitous nature, and such states are generally not considered name-worthy by the community […]” (Downing (1977: 822) quoted in Hohenhaus ibid). Therefore, Hohenhaus dubbed this and other similar nonce compounds non-lexicalizable in the sense that, due to their deficient interpretability, they cannot be given an entry in the mental lexicon and become a legitimate part of the vocabulary (of either individual or ideal speaker) and are thus ‘exiled’ to the realm of performance. The argument is that such nonce formations are denied listedness because it does not seem worthwhile to list an item of such low usability, an item that cannot be stored as a generic label for future use. Along the same lines of argument he cites a series of ‘dummy compounds’ with the words thing and business acting as semantically empty (and thus non-lexicalizable) head constituents: this Ross thing, that phone business, etc. 2.4.3. Evaluation While Hohenhaus (1998) makes a number of insightful observations concerning the working of some nonce word-formation, two of the features he presents, context-dependence and non-lexicalizability, do not seem to be usable with all the data eligible for analysis. Deictic and dummy compounds may neatly lend themselves to the author’s hypothesis but clearly there are numerous counterexamples. Consider inconfigurable, culturalized and routiney in (3), all of which are immediately interpretable, even when taken out of their context. Given the inadequacy of this criterion, which simultaneously is the basis for claims of nonlexicalizability, it must be inaccurate to assume that the same set of data could never be listed as part of common vocabulary only because they are nonce formations. Indeed, these words are intuitively felt to be potential candidates for listedness (i.e. within a full-entry approach to the mental lexicon) – they are formally and semantically transparent, felt to be highly ‘learnable’ and of poten-
19
The term contextuals was introduced by Clark and Clark (1979) to refer to such items.
Neologisms and nonce words
33
tially frequent usage, if only given enough exposure. 20 In consequence, the two features postulated by Hohenhaus (1998) must be considered untenable. 2.5. Definition of nonce formation revisited We have so far identified the following defining characteristics of nonce formations (those mentioned on page 18 have been revised and repeated here for convenience): 1) They are novel lexical creations (novel in the sense of being put together anew by the speaker rather than retrieved ready-made from the mental lexicon). 2) They are coined (consciously or not) for the particular purpose in order to fill a lexical gap (genuine or perceived – as when a more usual form is temporarily forgotten). 3) They are not part of the core vocabulary: they may be used once or several times by the same speaker (or independently by several speakers) but will never gain currency with many speakers (if they do, they may be on the way to becoming a neologism and only then part of the societal norm). 4) They are typically short-lived: the vast majority will die the moment they are coined and used in spoken production; some others will get recorded (possibly only once) if they happen to be coined in a written medium; still other coinages may survive by catching on in the language, and continue their life as neologisms. 5) They may be simplex or complex words whose composition may represent virtually all word-formational processes, i.e. affixation, compounding, blending, back-derivation, conversion, clipping, acronymization and rootcreation. Similarly, nonce words may represent both so-called productive (i.e. regular, rule-governed) and creative (structurally and pragmatically/stylistically marked) morphology without affecting their status of nonceness. 2.6. Conclusion Hohenhaus (2005:364) observes that nonce words are theoretically interesting and troublesome in that they are intermediate between possible words and actual words. The argument is that once they are coined, they are no longer merely possible but are not part of the lexicon either. This is probably due to the fact 20
For instance, compare two computing-related items, the nonce word inconfigurable from (3) with unroutable (see 3.2.1), which is listed in The Oxford Dictionary of New Words.
34
Chapter 1
that their attestation is at best dubious (especially in spoken language) and their currency virtually none. There is no denying, however, that such an intermediate position should be of great interest and use to morphologists, given that “the simplest task of a morphology, the least we demand of it, is the enumeration of the class of possible words of a language” (Aronoff 1976: 17-18). Nonce formations are actual instantiations of possible words, also in the sense that was of concern to Aronoff, that is, transparent results of regular morphology, before they have a chance to “persist and change” (Aronoff 1976: 18). Notwithstanding the theoretical challenge, the study of non-established vocabulary has tended to be marginalized (e.g. Marchand 1969: 9-10) and neglected in traditional word-formation even though, occasionally, observations (or mere declarations) are made of how very frequent some of the phenomena are in general use. 21 For example, with respect to compounds specifically, Bauer (2001: 36-37) reports two experiments (one by himself, the other by Thiel (1973)) whose results demonstrate that the number of words in both English and German that are attested, but not listed in dictionaries, is much higher than the number of established words. Specifically, the German sample showed that 62.1 per cent of 1,331 compounds found in a single issue of the magazine Die Zeit were not listed in dictionaries. The English sample of 148 compounds extracted from a five-page excerpt of Time magazine yielded 67 items that were not listed in The Oxford English Dictionary (2nd edition). Similar claims are increasingly made in more recent publications, to the effect that, although ephemeral in many cases, “from the point of view of the linguistic system (langue), nonce formations are regular, structurally transparent, predictable naming units generated by (potentially) productive word-formation rules” (Štekauer 2002: 98). Nonce formations are transient and thus are not individually recorded or listed in reference works. This, however, does not prevent 21
Consider the following remarks: “One fact about nonce formations that is not sufficiently appreciated is how large a proportion of complex words that are heard everyday are nonce formation,” Bauer (1983: 46); “[M]any kinds of new words and neologisms, blends, for example have been considered marginal. This attitude is unfortunate, since blending has become a truly productive process in contemporary word-formation, not only in English, but in French, German, and other languages as well. Since many blends are ephemeral and do not remain in the language (Cannon 1986, to appear), it is important to study the processes for blend formation and interpretation,” Lehrer (1996a: 359-360); “Both productivity and creativity give rise to large numbers of neologisms, but in what follows it is only rule-governed innovation, that is, productivity, which will be discussed,” Bauer (1983: 63-64); See also Crystal (2000), Lehrer (1996b), Pyle and Algeo (1993: 277-279).
Neologisms and nonce words
35
the processes and patterns (as emphasized by Adams 1973: 146 and Lehrer 1996a: 359-360 and 1996b) that stand behind their production from being some of the most representative of the synchronic word-formational capacity of the language system and the formal tools thereof (e.g. Hohenhaus 1998: 238, Bauer 2000: 834). Adams (1973: 3), in her discussion of ephemeral formations, argues that “from an inspection of a range of established and transient coinages, we may gain some idea of the various forces at work in English word-formation[.]” Whatever is transient in language suffers the disadvantage of not being documented and researched properly. Hence, it seems worthwhile investigating nonestablished vocabulary items especially since the fact that dictionaries by their very nature list (and indeed prefer) lexicalized material means that the words they list probably do not give an accurate picture of the synchronic possibilities of word-formation. […] Nonce formations can provide just as much evidence about the synchronic system[.] (Bauer 2000: 838)
3. Neologism We have already laid out the rudimentary assumption behind the distinction of nonce words and neologisms and agreed that the two are set apart according to the degree to which an item has entered the common vocabulary of a speech community (see the discussion of ‘common vocabulary’ in Barnhart (1973)). On such a scalar interpretation as envisaged above, it is inevitable that a certain amount of taxonomical fuzziness is to be allowed for. 22 This fuzziness will be reflected in both the terminological distinctions to be drawn below as well as in our treatment and classification of particular items. Such an approach is further supported by the commonly-held view that, diachronically speaking, neologisms are ex-nonce formations in that the latter continue to function as neologisms if they survive by being spread and accepted and thus gain ground among speakers. Bearing this in mind, it is not surprising to infer that this gradual process, at any particular point in time, must render some items completely transformed, but leave others half-way through the transition (see 3.2). The process itself is, of course, only a more or less likely option for nonce words as most of them die an immediate death upon production (the factors conditioning their survival will be discussed in section 3.2.2). However, before we attempt any gradation-based subclassification within the category of neologisms, it is necessary to lay the ground for further discussion. The following issues will be dealt with: 22
This is reminiscent of the ‘squishy categories’ approach as well as the prototype theory (cf. Rosch 1978).
36
Chapter 1
1) We will disambiguate the term neologism itself as several usages are available in the literature (3.1). 2) We will discuss the process of the nonce word-to-neologism transition (3.2) 3.1. Disambiguating the term ‘neologism’ Neologism in the sense used in this dissertation is to be distinguished from the following: 1) The meaning of the same term as employed especially in psychiatry and psycholinguistics: that of a meaningless word coined by a psychotic or aphasic. For example, Aitchinson (1994) uses neologism in the sense of a nonword coined by an aphasic as well as with the meaning of a morphologically new word. 2) Clearly literary and poetic creations 23 (Dressler 1981). 3) The broad meaning of the term that does not discriminate between nonce formations and neologisms, that is, the variant of the term that is popularly (and ambiguously) used to refer to a ‘new word’. Another observation is in order here regarding point (3). This meaning of neologism disregards the diachronic distinction according to which neologisms are not totally new in the absolute sense in which nonce formations are. Compared to nonce formations, neologisms are new merely with respect to their inclusion in the lexicon and thus may aptly be described as “young listemes” (Hohenhaus 2005: 364). The importance of acknowledging this distinction seems vital on three counts: firstly, for reasons of precision and accuracy of linguistic description/theory; secondly, for reasons of lexicography, where the distinction between ‘transient’ and ‘established’ is of the essence (see section 3.5); thirdly, in order to be able to systematically account for the fact that neologisms do evolve from nonce formations. There have been attempts to introduce an umbrella term that would be neutral with regard to this diachronic evolution (that of ‘coinage’ (Bauer: 2001) or ‘new formation’ (Hohenhaus 2005)) and that would subsume both nonce formations and neologisms at the point of their birth, but it is not clear how such a term might be beneficial to the theory. Neologisms have been identified above as new lexical items that start out as nonce formations but, with time, come to be used and recognized as item23
In the tradition of Polish linguistics also referred to as indywidualizmy ‘individualisms’, neologizmy autorskie ‘authorial neologisms’, neologizmy artystyczne ‘artistic neologisms’, neologizmy poetyckie ‘poetic neologisms’ (Smółkowa 2001: 17; my translation – WG)
Neologisms and nonce words
37
familiar. It is now that we turn to investigate in detail what it is that happens at the transition from nonce word to neologism. 3.2. Institutionalization The transition in question is known as the process of institutionalization, a notion which was introduced by Bauer (1983), who envisages the term as representing a stage in the development of a word that is characterised by the following: 1) The formation begins to be accepted by other speakers as a familiar lexical item (Bauer 1983: 48) 2) Its potential for multiple semantic readings is typically reduced to a few or one meaning which is accepted and used by the speakers (Bauer 1983: 48) 3) The formation is “not consciously analyzed by the speaker-hearer” (Bauer 2000: 837) 4) “The words concerned are accepted by speaker-hearers as normal words of their language” (Bauer 2000: 837) This is the phase in the life of an ex-nonce word (now a neologism) when it begins to be disseminated and picked up in the usage of other speakers. Presumably, it is not significant exactly how large a percentage of a speech community has to be involved in order for the process to occur 24 – what matters is that the formation is increasingly recognised as item-familiar and used in this community. This indeterminacy in the number of speakers involved presupposes, in turn, a degree of indeterminacy in the duration 25 as well as the commencement of institutionalization – note that this is congruent with the prevalent fuzziness of linguistic demarcation mentioned in section 3. It must be stressed that the institutionalization of a neologism in its early stage does not produce a fully institutionalized word. Again the degree of currency and recognizability is at the core of the matter – once a coinage begins to spread, it is gradually institutionalized in the sense of the four points given above. It is with the passage of time that it becomes a fully institutionalized word in the sense of the institutionalization of words defined by Bauer (1988: 246) as “their coming into general use in the society and so being listed in dic-
24
In the same context Bauer (2001: 212) speaks of “a large section of the relevant speech community”. 25 Fischer (1998: 69) offers some indication of completion of institutionalization; namely, in corpus-based research, stability or recession in frequency following a peak may indicate completion.
38
Chapter 1
tionaries”. On such definition a degree of gradualness has to be allowed for if the theory is not to be counterintuitive and counter-commonsensical. 26 Another aspect of fully institutionalized words that sets them apart from newly-institutionalized ones (in the senses of the four points above), is that the former also consist of non-recent lexical items that “could in principle still be formed by synchronic rules of word-formation” (Bauer 2000: 837) as opposed to lexicalized words, i.e. those that the above-mentioned rules could no longer produce (through language change). On this approach, words like worker, complexity and bookshop are included as institutionalized words, and these, by no standards, can be assumed to be recent. It is therefore crucial when investigating neologisms to bear in mind the systematic ambiguity of “institutionalized”: the term is used as alternatively meaning either “having started to undergo the process” or “fully institutionalized”. It is not merely a matter of time (or diachrony) for all newly-institutionalized words to complete the process, as some of them may drop out of use on their way to full institutionalization, that is, be deinstitutionalized. In view of the above discussion, this study recognizes the necessity of seeing institutionalization as a gradual process marking its products with distinct degrees of ‘their coming into general use.’ Lipka (2002, 2004) builds on Bauer’s definition and stresses the sociolinguistic aspect of the process: “We define institutionalization as the process of being accepted in the lexicon of a specific speech community (Americans, doctors and medical people, computer freaks, linguists, etc.)” (Lipka 2004 : 11). Lipka’s explicit specificity of the sort of community involved is here juxtaposed with Bauer’s (1988: 246) “coming into general use in the society” or even “in a speech community”, which in any case tends to be interpreted as some sort of global community. This new approach has sparked some discussion (e.g. Hohenhaus 2005) of a number of new perspectives from which to observe institutionalization at work. Lipka (2004: 8) argues: Institutionalization in particular […] depends on different regional, social, ‘stylistic’ and other varieties of a language. It is a matter of smaller or larger speech communities within the national standards of a language such as British and American English, or Swiss, Austrian, and High German.
26
This, in fact, has seen some confusingly contradictory treatment in the literature. For example, Bauer (2001: 46) speaks of “the immediacy with which coinages become institutionalized”, while Hohenhaus (2005) regards a neologism as a not yet fully institutionalized word. Lipka (2004: 8) mentions institutionalization as a gradual phenomenon but only in the form of a declarative statement.
Neologisms and nonce words
39
This claim introduces even more gradation and relativism into the concept of institutionalization – now it is not only the question of “institutionalized to what degree?” but also “institutionalized to whom?” The author stresses the fact that a lexical innovation spreads within a specified community to which the given innovation may be limited before it has (though not necessarily) an opportunity to spread further. The shared knowledge of this innovation among the members of the community (whatever its size) is to be accounted for by the notion of norm (adopted by Lipka from Coseriu (1967, 1975) as a third level of language, in between Saussure’s levels of langue and parole), which is “a collective realization of the language system” Lipka (2004: 8). The norm represents the subpart of the system that is put into actual use and thus disregards the extra possibilities of the system (Coseriu ibid). Strictly in accord with Lipka’s understanding of the concept, one should perhaps say: the norm of a given speech community represents the subpart of the system that is put to actual use within that community. The emphasis on discussing linguistic norms within the frame of a specific community and the implicit recognition of language variation at large is contrasted here with the idealised conception of homogeneous lexical competence common to all speakers. 27 Such is the stance taken by Bauer (2001: 39) when he states that “a neologism is a word which becomes part of the norm of the language”. It seems unwise to uphold such vision of a unified norm and evidence to the contrary is not hard to come by: what is an institutionalized part of the lexicon for one speaker may well be a nonce word for another, especially across but also within speech communities. 28 A new word may be item-familiar to specific individuals even though it has not secured its position among other members of the same speech community. Therefore it appears necessary, once more, to acknowledge a certain amount of gradation in yet another aspect of neology and postulate that recent lexical additions of varied currency and spread are characterized by various degrees of being part of the norm. This in turn undermines the clear-cut distinction between neologisms and nonce formations based on the notion of norm. Lipka’s community-based definition of the norm is a useful construct for capturing certain linguistic facts about institutionalization. For example, the collective realization of the language system within a couple (i.e. the smallest setting of a speech community (Hohenhaus 2005)), family or a group of work colleagues may differ significantly from the idealistic approximation to the collective realization of the system within the societal community at large. Within 27
See Chapter 3 for a discussion of language variation. The same idea is implied in Hohenhaus (1998: 239): “[A neologism] has entered the language’s lexicon not too long ago, but it has – though perhaps not the individual lexicon of every member of the speech community”. 28
40
Chapter 1
such small and individualistic speech communities, certain peculiar lexical innovations may occur that substantiate the claim of the heterogeneous and community-based nature of institutionalization. Hohenhaus (2005: 361) cites examples of lexical idiosyncrasies that have arisen and become established within a couple (jocular singular form shoop from sheep), family members (mice bible for a copy of the Bible showing the teeth marks of mice), specialist group members (the jargon of linguists, Internet users, members of music bands, etc.) This is precisely what Lipka (ibid) refers to as the sociolinguistic aspect of institutionalization. 3.2.1. Neologisms as part of the community-dependent norm The aspect of Coseriu’s (ibid) norm that relates this notion to the subset of possibilities of the system that are exploited in actual use, as well as Lipka’s (ibid) understanding of the concept (community-dependence) neatly lend themselves, in combination, to a theory of neologisms. It seems only natural and self-evident that some (most?) neologisms will arise, spread and be institutionalized first within specified language communities of varied background and size before they can be fostered beyond and into the general vocabulary of the language. Some exceptions are possible, such as that of media-publicized creations which may be spread to and accepted by the general public virtually instantly, but – assuming that this is still innovation within a certain community or language genre, that of journalism – the point remains largely unaffected. In such cases it is only the enormous speed with which the new word is disseminated that is dissimilar from cases of the traditional word-of-mouth spread of a word. This community-based origin of many neologisms has its reflection in lexicography. Dictionaries of neologisms often explicitly comment on the subject areas and their respective communities of speakers that are notably prolific word-coiners. For example, Ayto (1989) in his Introduction to The Longman Register of New Words (henceforth LRNW) writes: Reflecting its continuing vigour, the financial sector remains a prodigal coiner of neologisms, both sober and fanciful. The lay person trying to navigate the City’s treacherous waters has had a Sargasso sea of new jargon to cope with […] Not far behind wealth as a word-creator comes computing. […] In the wider scientific sphere, the search for a theory of everything has given us concepts easier to name than to grasp […] The medical world at large has introduced us over the last two or three years not so much to new illnesses as to ones we have at best been only dimly aware of before […] Crime is as innovative as ever […]
Neologisms and nonce words
41
Other trends emerging in Ayto (1989) and identified therein as such consist of vocabulary contributed by the world of journalism, industry, politics, pop culture, lifestyle, education and charity. Similarly, The Oxford Dictionary of New Words (henceforth ODNW) employs graphic icons marking each entry as relating to the following eleven subject fields: Art and Music, Computing, Environment, Business World, Health & Fitness, Lifestyle & Leisure, Politics, Popular Culture, People & Society, Sports, Science & Technology. With reference to these subject areas a quantitative analysis was carried out in order to establish the percentage of those ODNW entries that must have been coined within a certain specialist genre. A sample of 178 entries was compiled by deriving the first entry from every second page of the dictionary. These were subsequently grouped according to their icon annotation. 33 out of the 178 entries sampled were items related to Computing 29 , a subject area whose lexical innovation must be considered originally a product of a specific jargon, which in turn is characteristic of a certain speech community (in Lipka’s sense). In conclusion, the field of Computing alone takes up 18.5% of the sampled items; thus the coining of a substantial percentage of the total number of entries in ODNW must be assumed to have involved a certain specialist speech community. To sum up, neologisms, unlike nonce formations, are often considered to be part of the norm 30 (Hohenhaus 1998, 2005; Bauer 2000, 2001), which is roughly equivalent to being part of the “common vocabulary” (Barnhart 1973: 13) unless we specifically narrow down the concept of norm to Lipka’s (2002, 2004) norm of a specific speech community. This is in contrast with the common ambiguity of ‘the norm’ as referring to either an unspecified speech community or, perhaps more often than not, the societal norm of the entire language in its idealised
29
These are: antialiasing, Arpanet, BBS, -bot (suffix), BTW, bulletin board, business process re-engineering, data warehouse, electronic, fax-on-demand, High Sierra, jaggies, MailMerge, Mb, Michelangelo, MiniDisc, mouse (v), netizen, newsgroup, open system, peer-to-peer, personal digital assistant, Photo CD, private key, re-engineering, software agent, superscalar, techno- (combining form), telnet, tile, unroutable, -ware (combining form), warez. 30 Depending on their conception of terminology authors may differ on this point. For example, Baayen et al. (1991: 812) write: “[A] formation that has been recently created and that has not found its way into the established vocabulary of the speech community is a ‘new’ type. We will refer to such new types as neologisms. […] [T]hese hapaxes will also be new to the speech community, that is, they will be neologisms”. This divergence in the definition of neologism is probably due to the authors’ interpretation of ‘speech community’ as the community at large.
42
Chapter 1
conception. 31 Additionally, it has been noted that neologisms are in practical terms difficult to distinguish from nonce words on the grounds of their belonging to the norm. Consequently, we have stressed the necessity of gradation, in functional terms – in deciding what is (or is likely to be) and what is not part of the norm – but also in theoretical terms, in delimiting neology and nonce formation relative to the concept of norm. 3.2.2. Factors conditioning the chances of institutionalization Having discussed the basic mechanisms of institutionalization which underpin the transition of nonce formations into neologisms, in this section we will identify the factors conditioning the chances of this process occurring. Lehrer (1996b) calculated the percentage of neologisms that were still current five years after the publication of the dictionary of neologisms in which they were originally listed (Algeo 1991). The study reports that of 239 randomly selected entries, a third were still used in 1996. Although this figure is by no means conclusive of the exact number of neologisms that survive beyond the period of their initial popularity, it is certainly true that not all neologisms stay in the language for good. And such is also the case with the survival of nonce formations – there are factors that decrease as well as boost the likelihood of a nonce formation turning into a neologism. Adams (1973) lists the following: Firstly, speakers’ preferences of an aesthetic nature may play a role in the acceptance of new lexical items into the language. “Innovations in vocabulary are capable of arousing quite strong feelings in people who may otherwise not be in the habit of thinking very much about language” (Adams 1973: 1-2). After Quirk (1968) Adams (1973: 2-3) quotes letters sent to newspaper editors to protest against “horrible jargon” (such as break-down of figures), “vile words” (transportation), “tasteless innovations” (handbook) and other innovative “atrocities” (lay-by). It is all but natural for new words to be met with suspicion, reluctance or even disgust when they are on the verge of being accepted into the language, “[b]ut to protest against lexical innovations is very often to appear ridiculous to later generations: who today would wince at aviation (now that we are thoroughly used to it), about which The Daily Chronicle commented in 1909: ‘You could hardly think of a worse word’” (Adams 1973: 2). Speakers’ negative response is especially common when the word in question is in some way unusual, this being the next factor governing lexical acceptance. Structural oddity may foster negative attitudes towards words – triphibian was once objected to due to the haphazard reanalysis of amphibian that split the 31
E.g. Bauer (2001: 40): A new word “may become part of the norm of the language and turn out to have been a neologism, or it may not, and remain a nonce word” [my emphasis – WG].
Neologisms and nonce words
43
monomorphemic amphi- into two elements. The formation persisted nonetheless, presumably fostered into use through the social status of its creator – Winston Churchill. The status of the coiner is yet another factor that conditions (un)favourable reception of a word among the public (Adams 1973, Bauer 1983). ‘Hybrid’ words combining Greek and Latin elements, or classical and native elements, may also be frowned upon, at least by those aware of such etymological differences (Adams 1973: 2). However, strong convictions of misconceived etymology may cause prejudice where none would be expected – the author mentions the case of partial (< L. partialis, legal term attested as early as in the 17th century, meaning ‘of or belonging to one’s country’), denounced by a professor of law as ‘barbarous’ immediately upon its re-introduction in the 1970s. The same criterion of structural acceptability is discussed by Fischer (1998: 180), who argues that sequences of phonemes that are more typical of English than other sequences will support the institutionalization of the word in question. This is exemplified with a set of relatively recent combining forms techno-, cyber-, info-, and docu-, of which the last one is the least typical of English word-formation in general. The final -u does not correspond to the common pattern of initial combining forms as these typically end in –o, or the same apenthetical vowel is added if the first component is a free morpheme ending in a consonant, e.g. filmography, kissogram. On the other hand, structural simplicity (e.g. the shortness and unmarked phonotactics of a word, such as that illustrated above with the four combining forms) and transparency are said to positively influence institutionalization. However, the lack of morphological transparency and complex structure of a word may be compensated for by phonological, graphic, semantic and stylistic motivation, to use Fischer’s (1998) nomenclature. A phonologically motivated item utilizes a certain preferable sequence of sound, e.g. INSET (in-service-training) is preferred to *IST, *INST or *ISET. Graphic motivation concerns spelling conventions that may facilitate comprehension (capital letters, hyphens, mixing capital and lower case letters). Semantic motivation may be observed in words that share a common formal element (cf. info, infotainment, infomania below). Another kind of semantic motivation is displayed by reverse acronyms 32 , i.e. those acronyms whose form is (intentionally) based on another free-occurring item in order that language users can form associations between the two, thus facilitating word retrieval. Examples include HELP for Haulage Emergency Link Protective, MAD for Microwave Acoustic Delay and ACHE for Analogue Computer of Health Expenditure. Stylistic motivation involves the use of sound patterning devices such as alliteration 32
Reverse acronym is a term used by Stockwell and Minkowa (2001: 9), also cited in Szymanek (2005: 435).
44
Chapter 1
(back-to-basics), assonance (nine-to-five) and consonance (born-again) (Fischer 1998: 13-14). Fischer (1998) stresses the need to study new coinages against the sociohistorical background of the events that brought them to life and govern their distribution. Thus, it is conceivable, and indeed frequent, that new words may come and go along with such extra-linguistic context and thus leave behind a trace of transient currency or vogue that will be connected strictly to a certain period of time. The author illustrates this with supergun, whose high frequency in The Guardian across several months in 1991 was due to the extensive press coverage of the military actions of the Gulf war. In order to set apart such periodic popularity from genuine integration into the lexical core of the language, Fischer distinguishes between topicality and institutionalization respectively. The former is to be understood as the use of a word “in connection with current affairs for a short period of time” (Fischer 1998: 16). Implied in this discussion is the next factor facilitating institutionalization – that of prolonged exposure. This, in turn, is normally triggered by topicality in that when a new word is continually ‘in the news’ as a result of current affairs, it is more likely to establish itself in the language. It may be argued, then, that topicality constitutes a crucial determinant feeding institutionalization. 33 With regard to the social background of lexical innovation mentioned above, let us offer an example of coinage whose institutionalization progressed rapidly due to the introduction of a novel technological feature. According to Fischer’s (1998: 175) corpus-based study, the ‘lexical phrase’ (Fischer’s terminology) pay-per-view exhibited frequency values below 28 per year from 1990 to 1995. Subsequently, due to the introduction of pay-per-view television channels in the UK, the frequency values reached 183 within the span of a year. Such a dramatic increase in frequency numbers is indicative of a period of topicality, thus gradually reinforcing institutionalization. Another factor discussed by Fischer is the competition between two rival expressions impeding the spread of both. The table below is an illustration of such synonymous competitors:
33
Topicality is a notion associated with frequency of occurrence and used in reference to an item being (temporarily) ‘in the news’ and thus receiving a period of exposure. Similarly, although couched within a different context of linguistic enquiry, word token counts are also associated with lexical frequency and can be indicative, among other things, of increased/decreased exposure (see Chapter 4).
Neologisms and nonce words
45
(4) less frequent GUI faction, drama-doc, docudrama, documentary drama right-to-life mortgage-to-rent pay TV, PPV VR, virtual infobahn
more frequent graphical user interface drama documentary pro-life mortgage rescue pay-per-view virtual reality information superhighway Fischer (1998: 178)
The phenomenon is akin to blocking except for the fact that here the dominance of one form over the other may fluctuate as both are very recent (for example, Fischer notes that pay TV is currently catching on and may soon outpace payper-view). The author argues that in such competitive pairs the selection of the preferred item is, by and large, determined by the following factors: length, pronounceability, structural complexity, associations (homonyms, sound similarity), authority (of the coiner) and fashion. Establishing which one of these factors takes priority of application is nevertheless an unrewarding task, and it is not resolved by Fischer. By way of illustration, graphical user interface is preferred to GUI on grounds of its transparency, even though it is lengthy. Yet DAT is favoured over digital audio tape, thus opting for the converse pattern. Similarly information superhighway, despite its length, has been successfully propagated into use with the help of the authority of the Clinton administration (Fischer 1998: 178). Presumably, as repeatedly mentioned by the author regarding institutionalization in general, also in this case all the conditioning factors need to be considered together and balanced in order to account for speakers’ preferences. Furthermore, institutionalization is positively influenced if more new words are being coined on the same pattern as the item in question. Fischer (1998: 140) illustrates this with the series of coinages based on the pay-per-view pattern: pay-per-book (on-line publishing), pay-per-word ads, pay-per-use basis, pay-per-cable (for a telephone service) and pay-per-song. In a similar fashion, the recognizability, and hence also institutionalization, of novel combining forms is enhanced by the spread of pre-existing words whose structure incorporates the same form (Fischer 1998). For example, information and the clipped info have motivated infotainment and these in turn supported infomercial, infomania, infopreneur, infofile, etc. The same observation has led linguists to note word-formational patterns based on analogy, with this term used as a handy label for the origin of such coinages (cf. Van Marle 1990). For example, Szymanek (2005) cites the following as examples of analogical forma-
46
Chapter 1
tions: earwitness (from eyewitness), chaindrink (from chainsmoke) and whitelist (from blacklist). There is another factor that may be significant. A new coinage may be of limited usability to the speakers (cf. apple-juice seat cited above) and hence be redundant in the lexicon. Put another way, there may be no genuine need for many already attested new coinages. Consider a few examples that stand little chance of being institutionalized on account of their limited usability (quoted after WordSpy.com): (5) gurgitator n. a person who competes in eating contests Kodak courage n. the greater-than-usual level of courage exhibited by people who are being photographed or filmed geocaching n. a type of scavenger hunt in which participants are given the geographical coordinates of a cache of items and they use the Global Positioning System to locate the cache
The converse is a situation where some new coinages seem to have no choice but to stay in the language. A case in point are the newly-coined names of those new inventions, discoveries, etc. that are certain to affect society at large and hence establish themselves in the language (e.g. laptop, DVD, etc.). In conclusion, a variety of contributing factors need to be weighed up in order to estimate the probability of a given item to undergo institutionalization. These factors range from purely linguistic (transparency, rival formations, productivity, motivation) through sociolinguistic (topicality, exposure, distribution and topical range) and, finally, some less determinate ones such as speakers’ preferences based on aesthetic considerations and fashions, or the status and authority of the coiner. 3.2.3. Indicators of advanced/complete institutionalization Fischer (1998) also offers an insightful survey of possible indicators of when a word has undergone advanced/complete institutionalization. One of them is the lack of clues to meaning. It is typical of written media to supply hints and clues as to the meaning of any new word that the reader may find difficult to interpret. Fischer (1998) argues that this is achieved by providing paraphrases, synonyms and near-synonyms, antonyms, hyponyms or blatant definitions of the word in question. Also, the context, obviously, provides important clues as well as collocates of the word and area-specific vocabulary. Additionally, Bauer (2000: 836) briefly mentions the tendency for new coinages to “occur in close textual proximity either to their base words or to another derivative from the same base”.
Neologisms and nonce words
47
This clearly can help the reader, providing him with an additional ‘foothold’ for interpretation. Let us illustrate some of these techniques for supplying clues: with authentic data 34 : (6) (i) The title and context provide the base forms and other derivatives (underlined) related to blogebrity: His Los Angeles-based site, PerezHilton.com, has shaken and stirred the gossip crowd since its debut last year. Hilton, 28, posts tabloid photos of celebs and adds cheeky captions and rudimentary doodles. ... It’s been quite a turnaround for Hilton, who said that before becoming a blogebrity he had been in a deep depression and last year filed for bankruptcy and was fired from a job at a celebrity weekly magazine. (—Erin Carlson, “Love him or loathe him, celebrity blogger Perez Hilton says he’ll always be an outsider”, The Associated Press, November 6, 2006) (ii) Outright definition and detailed description of wiki-groaning: There's a new sport on the Internet: competing to come up with the best examples of how Wikipedia, the Web's home-grown reference source, is skewed towards pop-culture topics. For instance, the West Wing of the White House merits a 1,100-word entry on Wikipedia, while "The West Wing," the Aaron Sorkin TV drama, has a 6,800-word write-up. This game already has a name: “Wikigroaning”. (—Jamin Brophy-Warren, “Oh, That John Locke”, The Wall Street Journal, 16 June 2007) (iii) Area-specific vocabulary, the base forms (underlined) and a paraphrase of gorno: And it made me ponder: Has extreme horror gotten, well, too extreme? Will the prevalence and popularity of torture porn — a.k.a. gorno — warp our views on mayhem and murder, inure us, seep into our consciousness in creepy, vestigial ways? Is this stuff — the S & M gear, the leather and vinyl butcher couture, the power tools — being glamorized, fetishized? (Heck, yes!) 34
All examples quoted after WordSpy.com.
48
Chapter 1 (—Steven Rea, “When gory movies are torture to watch”, Philadelphia Inquirer, June 12, 2007) (iv) Synonyms of pickle-stabbers and area-specific vocabulary (underlined): The high-heeled, pointy-toed boot was the trendsetting footwear on the designer runways this season, but that doesn't mean those skittish about pickle-stabbers need worry. (—Deborah Fulsang, “Boot it up”, The Globe and Mail, September 15, 2001)
Regardless of the exact status of the highlighted words, the point here is that they all may indeed cause difficulty to the reader and thus need special treatment on the part of the writer. Therefore, in each case the obscure words are unambiguously ‘unpacked’ one way or another. This is the typical modus operandi when dealing with unfamiliar new words. However, it may be argued that those words which have been around for some time and may be said to have been adopted into the language no longer need to be explained for the reader. Fischer (1998) maintains that this is indeed the case. Complete or advanced institutionalization may be reflected in the absence of clues to the meaning of new words. Another indicator of advanced/complete institutionalization is a decrease in frequency after a certain frequency peak value has been reached (Fischer 1998). This curving pattern of the process may be represented graphically as follows (Fischer 1998: 174): (7)
Fischer (1998: 174) elaborates: At the beginning, the item is hardly known among the speakers of a language community. It may only be used within certain fields or varieties. Due to topicality, it is suddenly or steadily increasingly used. A gap shows between text frequency and total frequency. The concept the item refers to becomes the theme
Neologisms and nonce words
49
of articles. The item may also be used in headlines. The process of institutionalization is initiated and the degree of institutionalization of the item gradually increases. Then a peak or a kind of saturation point occurs, where the topicality reaches its climax. After that, a slow decrease follows. Text and total frequency converge eventually. The level of topicality may also remain the same for a while, but after a time it will also show a decline. The process of institutionalization comes to an end.
Despite the decreased frequency, an institutionalized item is supposed to maintain a certain level of frequency over an extended period of time. 35 Such permanence or acceptance of a novel word in the lexicon is best considered only relative as its actual presence in use will be largely fuelled by the fluctuation of topicality. Fischer (1998) is also concerned with the range of texts in which a word appears. 36 The argument is that the distribution of a word in a variety of texts and topics is indicative of its spread and permanent acceptance. Indeed, the two criteria cited above – that is, permanent frequency over an extended period of time and a multi-topical distribution in a variety of texts – are considered by the author prerequisites for the status of neologism (Fischer 1998: 4). 3.3. Degrees of lexical currency It has been mentioned above that the study of nonce formation and neology needs to take into account a considerable degree of inter-categorial overlap and fuzziness. So far we have noted the following types of indeterminacy: 1) in the boundary between productive and creative morphology (cf. 2.2) 2) in the boundary between intentional and unintentional coining (2.3) 3) in socio-linguistic terms, in the ambiguity of the term speech community (3.2.1) 4) between various degrees of institutionalization (3.2) Points (1) and (2) were reflected in our definition of nonce formations. Point (3) entails practical indeterminacy in systematically setting apart some nonce words and neologisms across speech communities in that a lexical item institutionalized in one speech community may be perceived as a nonce word by outsiders. 35
According to Fischer (1998: 172), in a corpus-based study a word can be considered institutionalized if it occurs at least 40 times in a corpus of about 25,000,000 words within at least a few years. 36 The study investigates creative neologisms (blends, clippings, acronyms, lexical phrases and combining forms) in the written form of the journalistic register exclusively.
50
Chapter 1
Most importantly for this section, point (4) entails various degrees of stabilization of neologisms. At any given point in the development of a language it is possible to point to new lexical items that are more institutionalized than others. In such cases one may speak of stable and unstable neologisms. 37 These of course are notational terms representing a cline rather than a distinction based on dichotomy. Additionally, note that the two terms do not represent two extremes on the continuum of institutionalization but two phases in the life of a word when it is no longer a nonce word and not yet a fully institutionalised word either. In other words, stable and unstable neologisms are cover terms subsuming the qualities of being institutionalised to a small or large degree or, in non-technical terms, having a small or large degree of currency as well as being more/less likely to be retained in future use. A similar distinction is made by Smółkowa (2001) in her investigation of neologisms in Polish. 38 For instance, spam (‘junk e-mail messages’) has reached a significant audience and may be deemed to be a stable neologism. Of course, the term is familiar mainly to computer-literate speakers and may not have gained full mainstream acceptance, but this is exactly what we have come to expect from community-dependent institutionalization. On the other hand, granny dumping (‘the abandonment of elderly relatives by their carers’, listed in ODNW) has not been in as wide a circulation and may therefore be called unstable, also in the sense that it may eventually disappear altogether. Some such neologisms may even be listed in reference works, as granny dumping is, and thus seem to have achieved considerable recognition and permanence. However, it is often the case that dictionaries of neologisms list items that merely “enjoyed a vogue in the given period” (ODNW: iii) rather than permanently came into widespread use. In such cases considerations of stabilization need to be taken into account and the two labels we put forward are convenient notational indicators of varied stabilization. 3.4. The ‘nonce word – neologism – institutionalized word’ cline Just as nonce words stand midway between possible words and actual words, neologisms may be argued to be actual words that are an intermediate stage between nonce words and institutionalized words. As noted above, due to the gradualness and community-dependence of institutionalization as well as the varied degrees of neologistic currency or stabilization, this continuum-based development is also characterised by a degree of clinal merging of one category 37
The Wiktionary-based protologism, diffused and stable neologisms are used to make similar distinctions in currency. 38 Smółkowa employs the following two terms of ustabilizowany (‘stable, stabilized’) and nieustabilizowany (‘unstable, unstabilized’).
Neologisms and nonce words
51
into another. To illustrate the point let us bring together and compare the now familiar words below: (8) gurgitator > blogebrity > granny dumping > spam > e-mail
With varied degrees of categorial membership, the words above may be said to be unstable and stable neologisms at the current stage of development of English. The five items vary in their level of usability (i.e. practical usefulness), institutionalization and frequency. Still other criteria conditioning their lexical currency are such notions as formal and semantic transparency (gurgitator vs. blogebrity, spam vs. e-mail), stylistic markedness (granny dumping vs. e-mail) and topicality (granny dumping as a phenomenon resurfaces in the press periodically as it occurs). In the next section we will look at neologisms from the perspective of a lexicographer and see whether and how the prevailing gradation of their stabilization is reflected in the making of dictionaries. 3.5. The lexicographical approach: criteria for entry The decision as to which neologisms get recorded in reference works is always a matter of the lexicographer’s conscious choice. One such criterion for entry at its simplest is perhaps best observed in Algeo (1991) as well as in the Among New Words column of American Speech, where a word is said to be ‘new’ if it “does not appear in general dictionaries at the time it is included in the column”. This however poses the problem of including attested but insignificant words. Accordingly, most new word dictionaries consistently refer to their choice of entries as incorporating those high-profile words that “came into popular use or enjoyed a vogue in the given period” (ODNW: iii). It is thus down to essentially the same set of criteria that dictionary compilers go by in their decision on the inclusion of a novel lexical item. However, what is considered ‘popular use’ or ‘common vocabulary’ may often be a highly relative notion. Let us review a few examples. Barnhart et al. (1973) has as its aim to include “new terms and meanings which have become a part of the English common or working vocabulary 39 between 1963 and 1972”. Despite this assertion, some of its entries appear, by any 39
Common vocabulary, after OED, is understood in Barnhart et al. (1973: 13) as the “nucleus or central mass of many thousand words whose ‘Anglicity’ is unquestioned[.]” It is “the greater part of the vocabulary of each man, which will be immensely more than the whole vocabulary of any one […], with a well defined centre but no discernable circumference”.
52
Chapter 1
standards, far too exotic, technical or specialised in any other way to be considered ‘common vocabulary’. Compare the entries for dukawallah (‘a shopkeeper in Kenya’), Duchenne dystrophy (‘a form of muscular dystrophy’), druzhinnik (‘a civilian auxiliary policeman in the Soviet Union’) and DSRV (‘Deep Submerge Rescue Vehicle’), all extracted from one and the same double-page. One of Green’s (1991) criteria for inclusion is that “the word or usage has entered the language in the last thirty years” and that it “has entered the mainstream”. 40 Additionally, the author claims to have made an effort to exclude the “linguistic ephemera” of nonce words and one-off coinages and to have tried to concentrate on the “prime candidates, those that have achieved, at least for a reasonable period, the patina of regular use”. Unlike Barnhart et al. (1973), Green’s choice of entries does seem to adhere more to everyday English but whether all the entries have remained part of the language is doubtful. By contrast, the aim of Ayto (1990: introduction) is to chart the latest course of “the frontiers of language [that] advance more precipitously in vocabulary than in any other area” (my emphasis – WG). Of even more interest to us is the declaration that “some of the words it records for the first time will no doubt turn out to have been ephemeral, fashions of the moment, yet equally certainly many will have a long and distinguished career”. Thus the author overtly notes the impossibility of predicting the future of his entries and makes no pretence of listing vocabulary that is or will be in its entirety at the core of the language. Algeo (1993) reports that of 3,565 words recorded as newly entering the language between 1944 and 1976, 58% were not listed in contemporary dictionaries (also cited in Isaacson 1997). As Algeo says: “successful coinages are the exception; unsuccessful ones the rule”. It seems inevitable then that dictionaries of current neologisms, regardless of their editors’ painstaking efforts, will indiscriminately list not only stable and unstable neologisms but also transient coinages and ‘flavours of the month’ whose future fate is largely unpredictable. It is the very nature of neology to be time-specific and subject to diachronic change. It is necessary therefore that we regard these dictionaries as records of lexical innovation as it occurred and gained currency in a given period of time only, not a collection of new lexical additions that have entered the language permanently and irreversibly. 4. Conclusion: nonce words and neologisms The following words are cited in Crystal (2006) as products of a word coinage competition: 40
Admittedly, the author explicitly notes his awareness of the fact that “one person’s mainstream is another’s marginalia” but nevertheless intends to “concentrate on the essentials” (Green 1991: vii).
Neologisms and nonce words
53
(9) blinksync n. the guarantee that, in any group photo, there will always be at least one person whose eyes are closed. circumtreeviation n. the tendency of a dog on a leash to want to walk past poles and trees on the opposite side to its owner. hicgap n. the time that elapses between when hiccups go away and when you suddenly realize it’s happened. kellogulation n. what happens to your breakfast cereal when you are called away by a fifteen-minute phone call just after you have poured milk on it. potspot n. that part of the toilet seat which causes the phone to ring the moment you sit on it.
These certainly must be deemed deliberate and ephemeral as well as stylistically (circumtreeviation and kellogulation also structurally) marked for a humorous effect. They represent one extreme of nonce formation. At the other extreme are nonce words based on productive recognized patterns such as inconfigurable and culturalized, which barely differ from actual and established words, the only difference being in the level of institutionalization (cf. Hohenhaus 2005, Bauer 1983, 2000, 2001). In theory, any item from both extremes has the potential for legitimate membership in the lexicon, although we have identified certain qualities of new words that predispose them to the status of an institutionalized word (3.2.2). However, even overtly deviant items such as circumtreeviation cannot in principle be excluded with absolute certainty. Nonce formation then represents a cline of formations which could be further subcategorised into respective subgroupings based on various criteria (the degree of structural or stylistic markedness, probability of institutionalization, usefulness/usability, etc.). They are nevertheless all nonce formations in the broad sense advocated in this work. 41 For their nonceness status it does not matter how many times each of them has been actually produced by the same or different speakers independently. What does matter, and potentially introduce a terminological distinction, is when the item spreads from speaker to speaker so that they start to use it as an item they now ‘know’, that is, have heard before and picked up in their own use. This means that as of that moment the item is retrieved from their lexicon in language production. This is in contrast to coining a word afresh to fill a lexical need. Let us suppose that circumtreeviation is coined by a speaker to refer to his or her experiences with their pet dog. Upon production the item is a nonce word. It 41
Similarly, Bauer’s (2000) examples of nonce words are regularly derived on recognisable patterns, cf. postponeable, electronification, republicanisation, froncophonise, attentionist, desectetized.
54
Chapter 1
is still a nonce word when it is reproduced by the speaker many times. 42 In global terms, but perhaps not in terms of a speech community-dependant lexical norm, the item is still a nonce word when it establishes itself in the household of that speaker. In an attempt to investigate a lexical norm that is as general in a given language as possible, we still assume the item to be a nonce word. The further the word spreads and the more extensive the speech communities that embrace it are, the greater its chances of becoming an unstable neologism at first and then part of the common vocabulary, that is, a stable neologism. The spread of the word is, again, conditioned by the factors mentioned in 3.2.2, and in this specific case of circumtreeviation, must be highly unlikely. What happens next is a question of whether the word will stay in the language permanently or whether it will fall out of use. This, however, is not a peculiarity of neologisms exclusively, as well-established words too may become obsolete as a fact of language (see Algeo 1993, Crystal 2006: Chapter IV).
42
We choose to reject the frame of idiolect as an instantiation of a lexical norm relevant to the study of neologisms.
Chapter 2 Language corpora and corpus linguistics
1. What is a corpus? The word corpus can be generally understood as a collection of texts in two different although partially overlapping senses. Aston and Burnard (1998: 4-5) discuss these senses by citing two Oxford English Dictionary definitions. According to one of the meanings, a corpus is to be understood as “a body or complete collection of writings or the like; the whole body of literature on any subject”. On this interpretation, one may speak of the “Shakespearean corpus” referring collectively to all the works by Shakespeare. Literary anthologies are also such collections with a non-linguistic purpose of use. In the other meaning, the one that is of interest to this study and corpus linguistics in general, a corpus is “the body of written or spoken material upon which a linguistic analysis is based”. Aston and Burnard (1998: 4-5) acknowledge the potential degree of overlap of the two senses, as when a corpus studied by the linguist consists of texts by the same author for the specific purpose of examining some aspect of the author’s language. They nonetheless stress the essential difference in corpus composition: the linguist’s corpus “is an object designed for the purpose of linguistic analysis, rather than an object defined by accidents of authorship or history” (Aston and Burnard 1998: 4). The exact criteria according to which texts are included in a corpus will vary along with the intended purpose of the entire collection. Similarly, depending on the linguist’s area of interest, he will turn to different corpora consisting of appropriate text samples. 1 By way of illustration, historical corpora may be designed to enable the study of language change (e.g. A Representative Corpus of Historical English Registers (ARCHER), Helsinki Corpus). Others represent varieties based on geographical (Wellington Corpus of Written/Spoken New Zealand English, the International Corpus of English (ICE)) as well as sociological (gender, social class, age) and register-related differences (written vs. spoken, editorials vs. reportage, conversational spoken vs. scripted spoken). There exist highly specialized corpora intended to allow empirical study of learner English (International Corpus 1
For example, in relation to the variety of intended uses, Hunston (2002: 14-16) discusses the applications of the following corpus types: general corpora, specialised corpora, comparable corpora, parallel corpora, learner corpora, pedagogic corpora, historical corpora and monitor corpora.
56
Chapter 2
of Learner English (ICLE)), children’s spoken language (Polytechnic of Wales (POW)), London teenagers’ speech (Bergen Corpus of London Teenage English) and spoken British English (London-Lund). On the other hand, some of the most popular corpora yet (such as the Brown Corpus and the British National Corpus) were meant to cover the full range of language use in their respective national varieties, i.e. American and British. Even though they each concern themselves with one geographical variety of English, and additionally the Brown Corpus represents written English only, the two are said to be “general corpora” (Hunston 2002), “general purpose corpora” (Leech 1991, Tognini-Bonelli 2001: 9), “multi-purpose corpora” (Meyer 2004: 36) or “balanced corpora” (Meyer 2004: xii, Aston and Burnard 1998: 5) in the sense that they are intended for a variety of investigations, ranging from lexical or syntactic studies, through studies of register variation, to dialectal or national differences. Still other corpora are custom-built to fit specific research aims of particular studies. For example, Aston (1995) is a study of the phrase thank you as used in service encounters between customers and shop assistants: for the purpose of his research, the author compiled his own corpus of appropriate conversational exchanges recorded in this particular situational context. Therefore, bearing in mind the intended function of each corpus, the second definition of corpus cited above after the OED still needs revising. TogniniBonelli (2001: 2) takes up this point as well as a few others that usually surface in discussions of corpus composition and sampling. She argues: A corpus can be defined as a collection of texts assumed to be representative of a given language put together so it can be used for linguistic analysis. Usually the assumption is that the language stored in a corpus is naturally-occurring, that is, it is gathered according to explicit design criteria, with a specific purpose in mind, and with a claim to represent larger chunks of language selected according to a specific typology.
Added to the OED definition is here the notion of explicit design criteria serving the specialized functional purpose of any particular corpus. We have already addressed this point with relevant examples. Further, as underlined in many definitions of corpus by many researchers 2 , the contents of corpora should ide2
For example, Francis (1992: 17): “A corpus is a collection of texts assumed to be representative of a given language, dialect or other subset of language to be used for linguistic analysis”, Sinclair (1991: 171): “A corpus is a collection of naturally-occurring text, chosen to characterize a state or variety of a language”, Baker, Hardie and McEnery (2006: 48): “A corpus is different from an archive in that often (but not always) the texts have been selected so that they can be said to be representative of a particular variety or genre, therefore acting as standard reference”.
Corpora and corpus linguistics
57
ally be maximally representative of the language in question, i.e. not necessarily representative of a language in its entirety (as would be expected of balanced corpora), but perhaps the target domain of a language that a corpus is designed to represent (e.g. spoken language, broadsheet newspapers, etc.) In the corpusbased approach to linguistics this is referred to as the representativeness of corpora. Biber et al. (1998: 246) stress its significance in both corpus use and corpus design insofar as the representativeness of a corpus determines the exact shape of research questions to be investigated and the reliability of research findings. For example, a corpus consisting of one type of conversation – such as conversations between teenagers – could not be reliably used in a study of conversational language in general (see below for further discussion of representativeness). Another crucial characteristic and indeed advantage of corpora, as mentioned by Tognini-Bonelli (2002: 2) and as is usually stressed with regard to the characteristics of corpora, is the use of authentic naturally-occurring language (discussed further below). Still other authorities in corpus linguistics point to other typical features of corpora that may be considered definitional. For example, Aarts (1991: 45), also quoted by Tognini-Bonelli (2002: 53), states that “a corpus is understood to be a collection of samples of running text [my emphasis – WG]”. It is a fact that most corpora do in fact lay stress on samples of running text as opposed to, for example, random collections of sentence-level excerpts or word-lists. This is vital to those studies investigating the structure of larger chunks of language. These may include studies in discourse analysis, text cohesion, and different types of reference (e.g. anaphoric and exophoric) over the course of a text. The requirement that texts be complete or at least constitute coherent units with their own beginnings, middles and ends will also be of significance to such studies. 3 Following this criterion and according to a definition thus delimited, Tognini-Bonelli (2002: 54) notes that collections of non-running text such as the citation sections of the OED or collections of proverbs cannot be regarded as corpora. Opposed to this stance is Bauer (2002), who distinguishes between corpora consisting of textual material and those consisting of word-lists. 4 He is intentionally inclusive in his definition of a corpus (he prefers to speak of “a body of language data” rather than “a body of text”) so as to include “word lists and the like” (Bauer 2002: 97), i.e. dictionaries, thesauruses, etc. Bauer (2002: 99-100) argues: 3
Unfortunately, large texts, such as books, are rarely included in their entirety for reasons such as space and copyright limitations. See also Meyer (2004: 38-40) for a more detailed discussion. 4 See also Hunston (2002: 32-37) for a discussion of corpus design with respect to the type of data included as well as various uses of these data.
58
Chapter 2
Although most electronic corpora are made up of texts, these word-lists deserve the title of ‘corpora’ (1) functionally, in that they allow comparisons of language types along several different dimensions and (2) formally since they are bodies of data created for one which may nevertheless be exploited for other purposes ‘for linguistic analysis and description’.
The exact nature of the language data included in a corpus will vary with individual perceptions of how to define a corpus. Cited above is the inclusive stance of Bauer (2002) opposing the viewpoint of Tognini-Bonelli (2002), who seems to support the ‘running text view’. Still other linguists (such as Sinclair 1991 cited above) speak of text without specific reference to the size or textual integrity of the data. Variations in definition such as these are especially noticeable, as Tognini-Bonelli (2002: 53) notes, when linguists dispute the status of unusual collections (corpora?) of language data such as compilations of proverbs. On the other hand, the nature of the data included in a corpus with respect to its physical integrity and size will also depend on how a corpus is investigated and for what purpose (cf. Hunston 2002: 32-37). If the corpus is to be used to examine noun phrases or relative clauses, it may well consist of excerpts of text rather than complete texts. Conversely, the features of discourse analysis will best be considered with complete texts (Meyer 2004: 30). On the other hand, corpora consisting of large and complete samples may well be used by linguists studying individual words in context, e.g. the collocational patterns of a single word, although, for such research, the corpus used might as well consist solely of all occurrences of that particular word and its immediate context. Therefore, in practical and definitional terms, it may be wise to claim that any kind of language data may constitute a corpus. In this respect, the overriding principle is that the aim of analysis largely determines the kind of corpus to be employed so that the type of language to be studied is well represented by a given corpus (see below for discussion of corpus representativeness). Today, corpora are invariably accessed and processed with the aid of a computer and so they are now synonymous with electronic corpora incorporating “a body of text made available in computer-readable form” (Meyer 2004: xii). The reason for the computerization of corpora is simple enough: given the enormous amounts of data, which are the core feature and advantage of all corpus work, the expenditure of time and effort put into the processing and analysis procedures have decreased significantly. More to the point, in pre-electronic corpusbased work many investigations were simply impossible either due to the insufficient amount of data available or the lack of means to handle large databases and keep track of the multitude of linguistic features (see Biber et al. 1998: 112). Kennedy (1998: 5) states some of the advantages of computerized corpora:
Corpora and corpus linguistics
59
Corpus linguistics is thus now inextricably linked to the computer, which has introduced incredible speed, total accountability, accurate replicability, statistical reliability and the ability to handle huge amounts of data […] in addition to greatly increased reliability in such basic tasks as searching, counting and sorting linguistic items, computers can show accurately the probability of occurrence of linguistic items in text.
Accordingly, Tognini-Bonelli (2002: 54) refines her definition of a corpus in that it is a computerized collection of authentic texts, amenable to automatic or semiautomatic processing or analysis. The texts are selected according to explicit criteria in order to capture the regularities of a language, a language variety or a sublanguage.
So far we have noted the following information as to what corpora are in the sense of modern corpus linguistics: • they are computer-processed collections of texts / bodies of language data (written and/or spoken) used for linguistic analysis • they can be tailored to suit different uses and scopes of investigation, e.g. general purpose or specialized corpora (e.g. learner corpora) • they are assumed to be representative of the (variety of) language in question • they are composed of authentic, naturally-occurring language • the texts to be included in a corpus are not randomly chosen but are selected according to specific criteria established by the corpus-builders (see below for discussion of sampling) Below we discuss in more detail the issues of authenticity and representativeness as well as another one relevant to corpus composition – that of sampling (see Tognini-Bonelli 2002: 55-61). 1.1. Corpus issues: authenticity All the material to be found in corpora consists of genuine language communication, written or spoken, as produced and received by people in natural contexts. The obvious advantage of this fact is that linguists have the opportunity to study natural language of whatever variety they may be interested in. Whether the corpus under study consists of nineteenth century short stories, Bible translations or interviews with psychiatric patients, the linguist expects to be confronted with authentic language as is found in its natural production environment. Thus the
60
Chapter 2
starting point of any corpus-based work in linguistics is the authentic data. Although such a practice may seem the obvious modus operandi, reliance on solid evidence has not always been considered of particular importance (see Historical perspective below). Tognini-Bonelli (2002: 56) claims that the centrality of data has not necessarily been recognized even by those linguists who apparently considered it central to their theory. She cites Stubbs (1993:8-9) in reporting “several linguistic milestones, where in spite of occasional lip service being paid to the reliance on evidence to justify theoretical statements, data is hardly referred to or indeed discussed” (Tognini-Bonelli 2002: 56). For example, Stubbs (1993: 9) has this to say about Quirk et al. (1985), based on corpus data of the Survey of English Usage: “This relation between corpus, example sentences and description is not discussed at all in the introduction to Quirk et al. (1985), and the accountability to data of description and theory is therefore undefined”. Both Tognini-Bonelli (2002) and Stubbs (1993) underline the lack of clear correlation between data, description and theoretical claims in theories unconcerned with data-based evidence. 5 In relation to this point Tognini-Bonelli (2002: 56) goes on to make an important claim: The reason behind this lack of systematic correlation between data, description and theoretical statement in most theories not informed by corpus evidence may well be that the evidence from the corpus – if untampered with and respected in its integrity – is distinctly likely to challenge existing linguistic theories with unprecedented insights into the language, obliging the profession to reconsider every aspect of theory and description. We are not here talking about occasional gaffes or deliberate mistakes, but about the core organization of the language. Some examples of this problem were discussed […] in connection with language teaching. If corpus evidence was allowed to be the basis for grammatical statements, many pedagogical grammars would have to revise their prescriptive rules quite drastically.
The bottom line of the above argument is essentially to the effect that, occasionally, authoritative claims on the part of linguists – and here pedagogic grammars are specifically mentioned – tend to misrepresent linguistic facts. This can be due to the writer’s more or less conscious prescriptive approach to language description or the fact that the claims are based on the author’s intuitive assump5
A similar point is made by Joybrato Mukherjee at the Linguist List website in his review of The Cambridge Grammar of the English Language by Pullum and Huddleston (2002): “Although information obtained from corpus-based dictionaries and grammars have been taken into consideration in various regards […], the use of corpus data remains unsystematic because there is no discussion of how the data […] are related to the grammatical description. Additionally, the reader is kept in the dark about which of the example sentences are invented and which ones are authentic”.
Corpora and corpus linguistics
61
tions, which may depart from the picture emerging from corpus findings. Another kind of misrepresentation is when dictionaries, grammars and reference works that are not backed up by corpus findings do not necessarily give wrong information but “are likely to provide only part of the story” (Mahlberg 2005: 15). This problem can be illustrated with one of Tognini-Bonelli’s (2002: 15) examples from language teaching, where explicit instruction on the part of the grammar textbook clashes with what the corpus indicates. When learners are introduced to the usage of any, it is usually through its contrasting pattern with that of some. Typically three sentence structures are said to involve the use of any: 6 • negative sentences, as in I haven’t any matches; or when introduced by hardly, barely, scarcely • questions, as in Did you see any eagles? • after if/whether, and in expressions of doubt Although the generalizations are in themselves valid, corpus-based research shows “a far wider degree of variation with respect to the prescribed structures” (Tognini-Bonelli’s 2002: 15). In fact, in approximately 50 per cent of its occurrences, any is found in affirmative sentences. In conclusion to this section, the authenticity of data to be found in corpora facilitates greater accuracy in language description. 1.2. Corpus issues: representativeness, sampling, corpus and sample size Crucial to the corpus-based approach to linguistics is the analyst’s assumption that findings based on a particular database can be generalized as valid to a larger sample of the same type of language, speech community or the language represented in that corpus as a whole. In other words, corpora are expected to yield insights beyond their own contents (see Tognini-Bonelli 2002: 53-54, 5759; Ooi 1998: 52-55). As corpus work rests on this assumption and on the assumed reliability of the corpus, great care needs to be taken in corpus composition in order to ensure the accuracy of findings. Corpus representativeness – the extent to which “the findings based on its contents can be generalized to a larger hypothetical corpus” (Leech 1991: 27) – is usually associated with several issues in corpus design which ensure that a corpus is maximally representative. 7 These are sampling issues such as corpus and sample size. 6
Tognini-Bonelli (2002) cites Thomson and Martinet (1984: 24) in her formulation of the usage rules and her use of the example sentences. 7 Representativeness is an idealized notion in that it cannot be evaluated objectively, and absolute representativeness is perhaps impossible to ascertain with certainty (cf. Römer
62
Chapter 2
Once the purpose of a corpus is established, appropriate sampling procedures are applied by corpus-builders to achieve the highest possible representativeness. This concerns the corpus-builder’s decisions as to the choice of texts to be selected, the choice of subject matter, the range of registers and their relative proportion 8 , the number and length of text samples, and the demographic characteristics of the language users (cf. Biber et al. 1998: 246-249, Hunston 2002: 2529). All these factors of composition will have a direct effect on corpus findings and should be made explicit to corpus users so they can relate the information derived from the corpus to the typology of texts included in it (Tognini-Bonelli 2002: 59). As regards corpus size, it is generally agreed that the larger the corpus the better (Meyer 2004: 50). If too few texts are included, individual samples may exert an undue influence on findings. In contrast, a large number of texts will average out the idiosyncrasies of individual speakers or writers of the same language variety and thus make for a more representative analysis. However, the nature of particular research projects also determines whether or not a very large corpus is a necessity (see Hunston 2002: 25-26 for further discussion). Generally, it is agreed that common linguistic structures, such as frequently occurring grammatical phenomena, can be reliably studied be means of relatively small corpora, whereas for those constructions that occur infrequently, for example in lexical and lexicographical matters, large corpora will be necessary (cf. Dura 2006, Biber et al. 1998 : 25f, Meyer 2004: 12-13, 32-34). For example, Biber (1993: 253-254) reports that as few as 59.8 text samples are required to obtain valid results in the study of frequent linguistic features such as nouns (assuming a sample length of 2,000 words). Far more texts, as 2005: 40, Tognini-Bonelli 2002: 57). Biber et al. (1998: 246) write: “It is important to realize up front that representing a language – or even part of a language – is a problematic task. We do not know the full extent of variation in languages or all the contextual variables that need to be covered in order to capture all variation in texts”. Leech (1991: 27) is even less optimistic as to the extent of genuine representativeness: “At present, the assumption of representativeness must be regarded largely as an act of faith. In the future we may hope that statistical or other models of what makes a corpus representative of a large population of texts will be developed and will be applied on existing corpora”. Nonetheless, efforts are made to offer an objective basis for representativeness and one such attempt is made by Biber (1993). 8 Considerations of register, such as the range of registers to be represented, are especially relevant to general-purpose corpora such as the BNC, but even in corpora intended for more specialized analyses finer distinctions of language varieties are largely unavoidable and indeed welcome. For example, the Michigan Corpus of Academic Spoken English (MICASE), although focused exclusively on spoken English in an academic setting, makes fine-grained register distinctions between class lectures, class discussions, student presentations, tutoring sessions, dissertation defenses, etc.
Corpora and corpus linguistics
63
many as 1,190 according to Biber, are necessary for representative findings of less frequent phenomena such as conditional clauses. Similarly, in lexical studies concerned with the search for hapax legomena, large size corpora are a prerequisite. If a hapax legomenon is assumed to be a rare, obsolete or new word, the linguist will benefit from the use of the largest corpus available in, for example, establishing levels of morphological productivity (see Baayen et al. 1991, Baayen et al.1996). By way of further illustration, Meyer (2004: 12-13) makes a similar point with an example of two studies of English modals, which yielded similar results, although they were based on very different sized corpora (1.7 and 80 million words respectively). Acknowledging the obvious advantage of the ever-increasing size of corpora, corpus analysts stress that in all kinds of research size cannot compensate for a corpus’s lack of diversity in its range of genres and structural faults that relate to sampling decisions such as the number and length of text samples (Biber et al. 1998: 249, Meyer 2004: 34). 9 It has been mentioned earlier that text samples included in corpora are rarely complete, unless they themselves are shorter than the agreed-upon sample length word limit (e.g. personal letters). This is due to a variety of reasons that may be connected with copyright and space limitations, insufficient funding, annotation facilities, the process of computerization, etc. (Meyer 2004: Chapter 2) Bearing in mind these limitations, another reason for this practice is that corpus compilers, for reasons of representativeness, prefer to include more texts of shorter length and representing more speakers and writers than to include fewer texts of extensive length but representing fewer language users. Ideally, at least for some types of corpus research, complete texts would have been preferred to text excerpts for the benefit of studying aspects of these texts in the context of the entire piece. As more storage space and resources become available for the constructing of corpora, and as more efficient analytical tools are developed, sample size is continually allowed to grow. Whereas early corpora such the Brown Corpus and the LOB (Lancaster–Oslo–Bergen) Corpus consisted of 2,000-word samples, and the London-Lund Corpus consisted of 5,000-word samples (Meyer 2004: 38), more recent corpora range much further in sample size, up to 40,000word samples in the BNC. Sample size has thus a direct influence on how complete a section of a text each sample constitutes. Other than that, the number of words in each sample is 9
An exception to this rule is the case of corpora whose representativeness increases through their sheer size rather than diversity of genres included or careful selection of texts. The COBUILD Bank of English is a case in point. This corpus is designed to be continually extended by adding newer texts (from 20 million words in 1987 to 525 million as of 2005), and its main purpose is to function as a monitor corpus, namely to monitor the occurrence of new linguistic features, such as new words or new word meanings (see Ooi 1998: 55-56).
64
Chapter 2
also important for reliable quantitative counts of linguistic features (Biber 1990, also summarized in Meyer 2004: 39). Here, again, the size of the sample will have to increase if the phenomena to be studied are rather infrequent. 10 Apart from sample size, corpus-builders need to consider the number of samples from each text because “the characteristics of a text may vary dramatically internally” (Biber 1998: 249) with different sections of the same text displaying systematic differences in patterns of language use (Biber 1998: 166f illustrates discrepancies of this kind with the example of Introduction, Methods, Results and Discussion sections of research articles). In conclusion to our discussion of representativeness and corpus composition, several observations have been made: • Corpora are expected to yield findings that are representative of the language variety under study; this is achieved through prudent decisions of corpus composition, in particular: • The larger the corpus, the better. However, the sheer size is not as important as the variety of texts and registers. Additionally, the frequency of occurrence of the studied item will determine the size of the corpus to be used. • For practical reasons, text samples tend to be incomplete excerpts, which may or may not be of relevance depending on the nature of analysis (e.g. discourse analysis requires complete samples). • Sample size and the number of samples also need to be considered by both corpus-builders and corpus users (again, depending on the nature and frequency of occurrence of the items studied). 2. Historical perspective In hindsight, it is clear that the history and development of corpus linguistics is intrinsically connected with the way in which observable data on the one hand and speculation as to the nature of the abstract language system on the other have alternately been of interest to various theoretical schools in different periods (Tognini-Bonelli 2001: 50-52). Broadly speaking, for the predominantly historical linguistics of the nineteenth century, the study of language was equated with the observation of available data. Starting with Saussure, the focus was then shifted away from the data-based approach and towards the language system, abstract par excellence, and defined as the new legitimate object of linguistic study. At that time in the United States, however, Bloomfieldian views prevailed, producing another theoretical framework in which, once again, lin10
Biber (1990) finds that, for frequent items, 1,000-word excerpts are long enough to yield reliable results about a particular genre. Lengthier samples will be necessary for less frequent items.
Corpora and corpus linguistics
65
guists “became concerned to account for observable data, and there was little room for abstract speculation” (Tognini-Bonelli 2001: 51). Then again with Chomsky, corpus data were once more discredited and rejected as an inadequate means of description of the language faculty. Paradoxically, it was in this time of severe criticism from the generativists that linguistics saw the birth of the first computer corpus, the Brown Corpus, completed in 1964 by Nelson Francis and Henry Kucera and eventually leading to the modern-day revival of the databased approach and the emergence of corpus linguistics proper. Below we will consider the early development of the corpus-based approach in some more detail. We will also review some of the criticism levelled against corpora by their opponents as well as counterarguments offered by corpus enthusiasts. The term corpus linguistics is relatively recent and the development of this approach to language studies is currently at its most fruitful. However, assuming that any collection of text that is used as a basis for linguistic analysis may be referred to as a language corpus, corpus linguistics has a surprisingly long history. 11 McEnery and Wilson (2001) report on early corpus-based research and argue that the methodology of the structuralist tradition as well as all preChomskyan linguists in the first half of the twentieth century is predominantly corpus-based or corpus-like. As early as in the late nineteenth century, studies in child language acquisition were conducted based on primitive corpora of children’s utterances recorded in parental diaries. Many of the findings and speculations derived from these early corpora are still used today as sources of normative data in language acquisition (McEnery and Wilson 2001: 3). An exceptionally large corpus by the standards of the nineteenth century was used by Kading (1879), who researched German spelling conventions by examining frequency distributions of letter sequences. His database consisted of 11 million words. Second language pedagogy in the first half of the twentieth century, as McEnery and Wilson (2001: 3) maintain, also drew heavily on corpus data by establishing word counts and compiling corpus-derived vocabulary lists intended for foreign learners (e.g. Fries and Traver 1940, Bongers 1947, Thorndike 1921, Palmer 1933). Still in language teaching, pre-electronic corpus-based research into the most frequent words and grammar structures was also carried out in order to enhance teaching syllabuses (see Aston and Burnard 1998: 19-20). In still other areas of pre-generative linguistics, corpus data and information derived from corpora such as word frequency were used in studies in compara11
Hunston (2002: 2) notes: “Linguists have always used the word corpus to describe a collection of naturally occurring examples of language, consisting of anything from a few sentences to a set of written texts or tape recordings, which have been collected for linguistic study”.
66
Chapter 2
tive linguistics, semantics and descriptive grammar (McEnery and Wilson 2001: 3). Grammar reference works based on corpora were written as early as in the 1950s (Fries 1952), two decades before the acclaimed A Grammar of Contemporary English and three decades before A Comprehensive Grammar of the English Language, both also based on corpora, by Quirk, Greenbaum, Leech and Svartvik (1972 and 1985 respectively). Meyer (2004: xii) also mentions Otto Jespersen’s (1909-1949) multi-volume A Modern English Grammar on Historical Principles, which was only compiled thanks to a massive collection of literary texts used as a foundation of the author’s discussions. In terms of basing research on a body of text representing authentic, naturally-occurring language, works such as those mentioned above may thus be regarded as pioneering the corpus-based approach as we know it, although today a corpus is invariably associated with a computer-readable body of text. Before the advent of electronic corpora, however, data collection was at times a haphazard business. Francis (1992) is a study of language corpora before the era of computers and its author argues that much of the vast data collections for large dictionary and grammar projects of the time was based of citations, rather than texts, contributed by reader editors. This in turn meant that the collections were biased by the perspective of those who collected the material as they are, as Francis (1992: 28) argues, “inevitably skewed in the direction of unusual and interesting constructions that the readers encounter, at the expense of natural use of language”. 12 Attributable to human error, this problem was later eliminated with the introduction of computer corpora. Mahlberg (2005: 14) and Tognini-Bonelli (2001: 52) see the transition from pre-electronic corpus-based work to modern corpus linguistics in a project started in 1959 by Randolph Quirk. The 30-year enterprise, called Survey of English Usage, was an attempt to describe the grammatical stock of an adult educated speaker of English. This in turn was to be based on a corpus which was “reasonably representative of the repertoire of educated professional men and women in their activities, public and private, at work and at leisure, writing and speaking” (Quirk 1974: 167). The 1 million words of contemporary English collected in this project were later used as the foundation of the 1985 A Comprehensive Grammar of the English Language. Typical corpus information such as word token frequency and grammatical context were originally stored on citation
12
Similarly, Biber et al. (1998: 3) argue along the same lines: “[A]nalyses cannot rely on intuitions or anecdotal evidence. In many cases, humans tend to notices unusual occurrences more than typical occurrences, and therefore conclusions based on intuition can be unreliable”. This observation also foreshadows our discussion of the theoretical conflict between generative and corpus-based approaches.
Corpora and corpus linguistics
67
slips, only to be computerized later on. The spoken part of the corpus subsequently became what is now known as the London-Lund Corpus. Following the initial impetus for the development of corpora and empirical language study, there was then a period of disgrace and neglect that began in the late 1950s (Leech 1991: 8). The blame for this lies, McEnery and Wilson (2001: 4) claim, “almost exclusively with one man and his criticism of the corpus as a source of information. That man was Noam Chomsky”. As Chomsky’s views were profoundly influential at the time, the development of corpus-based linguistic analysis was severely hampered for several decades. Very briefly, we present the line of his argument below. It was with the onset of generative grammar that corpus work was discredited as being inadequate to the study of language (McEnery and Wilson 2001). It was one of the prevailing Chomskyan convictions of the time that the legitimate aim of enquiry for the linguist was to model competence, understood as the native speaker’s tacit knowledge of their language. Corpus material, Chomsky argued, was a merely physical, and therefore possibly imperfect, representation of competence, i.e. performance, and as such was dismissed as a basis of linguistic enquiry. The argument was that actual language production, especially in its spoken form, is subject to various performance factors affecting its form, often departing from what is considered correct. It was therefore advisable that the linguist concern himself exclusively with devising a model of linguistic competence. Data observation as such was virtually nonexistent and data analysis typically entailed the linguist’s examination of his/her own examples and grammaticality judgments, drawing heavily on his own intuitions, thus largely discarding genuine empirical research. Alternatively, when dealing with a foreign language, the investigator would elicit relevant forms from native speaker informants by asking “Can you say X?” and then comment on their structure (Milroy and Gordon 2003: 3). Rather than emerging from language as it is used naturally, “the data arise from an explicitly metalinguistic context, one in which the investigator and any informants are thinking about language” (Milroy and Gordon 2003: 3; my emphasis – WG). This has led to the now well-known distinction between corpus-based (descriptive, empirical) linguists and armchair (theoretical, generative) linguists. 13 Whereas the former stress the importance of authentic and empirically verifiable data as both the object of study and the source of conclusions and theories, the 13
The same distinction is reflected in the two directions that linguistics has alternately followed: Chomskyan rationalism (conscious introspective judgments aiming at modelling competence) and empiricism (studying language through naturally occurring data) (McEnery and Wilson 2001: 5).
68
Chapter 2
latter base their discussions on data that may well be “contrived or made-up” by the linguist, “a common practice in linguistics that grew out of the Chomskyan revolution of the 1950s and 1960s with its emphasis on introspection” (Meyer 2004: xiii). These artificial data are typically the linguist’s reflections on their native language and theoretical claims based on those reflections. Chomsky (1984: 44) himself claims: “Maybe someday experiments will be useful, but right now if you sit and think for a few minutes you’re just flooded with relevant data”. As an example of Chomsky’s firm belief in the supposed power of introspection, Harris (1993: 97) quotes an extract from an interview with Chomsky conducted by Anna Granville Hatcher: CHOMSKY: The verb perform cannot be used with mass word objects: one can perform a task. But one cannot perform labour. HATCHER: How do you know, if you don’t use a corpus and have not studied the verb perform? CHOMSKY: How do I know? Because I am a native speaker of the English language. Relying heavily on native speaker intuition, Chomsky is nevertheless wrong, Harris (1993: 97) observes. One can perform magic, as can readily be verified with a simple check in a corpus. Corpora can thus offer means of checking any such generalized statements as well as complementing judgments based solely on intuition. For example, in this particular case, the nominal collocates of the verb perform can easily be identified with a query run in a corpus. Additionally, the relative strength of each collocation pairing can be gauged by means of information pertaining to frequency of occurrence. The larger the corpus involved, the more reliable the findings of such empirical investigations. In a counterargument to Chomsky’s criticism of empirical research, McEnery and Wilson (2001: 14-15) consider the shortcomings of introspectively based study. Firstly, compared to introspective judgments on the part of the informant or linguist himself, corpus data are more verifiable, and observable to anyone. Moreover, the contrived data used by the rationalist linguist are a major fault of the approach by virtue of being artificial and unauthentic. The authors cite Sampson (1992) in his finding that “the type of sentence typically analyzed by the introspective linguist is far away from the type of evidence we tend to see typically occurring in the corpus” (McEnery and Wilson 2001: 14). If unnatural material serves as the springboard of linguistic investigations, the findings as well as theoretical claims based on them are likely to be inaccurate. As McEnery and Wilson (2001: 14) argue:
Corpora and corpus linguistics
69
By artificially manipulating the informant, we artificially manipulate the data itself. This leads to the classic response from the informant to the researcher seeking an introspective judgment on a sentence: ‘Yes I could say that – but I never would.’
The analysis of language offered by generative grammarians may thus be perceived as too far removed from actual language use – a major concern of corpus linguists. Another point in favour of corpora raised by McEnery and Wilson (2001: 15) stems from Chomsky’s criticism that corpus data are “skewed” in the sense that they may not include all the words, sentences or constructions that are possible in a language. This is another argument on which Chomsky bases his superiority of the native speaker as a source of information over the corpus. Namely, the corpus is finite, i.e. does not contain all the possible structures in a language, and thus is not as an adequate tool to describe language, which is in itself non-finite, without the aid of native competence (and hence introspective judgment) enabling the generation of an infinite number of utterances. 14 While the argument is essentially valid, Chomsky overlooks an important aspect of such unrepresented items: if they are missing from the database, it is a significant indication of their frequency. Further still, another point raised by Chomsky in an interview with Aarts (2000: 6) is that corpora do not say what is impossible in a language. Once more, competence is here required to distinguish between ungrammatical sentences and ones yet unattested or absent from the corpus. Admittedly, if we reject introspection and intuition altogether, it is difficult to identify ill-formed, dubious or ambiguous structures, which may well be present in a corpus. However, beyond inquiries about grammaticality judgments, whenever the linguist’s aim is the accurate analysis of a maximally representative language sample, rather than an informant’s subjective judgment, many areas of language structure and use are best studied empirically with the aid of large computer-based databases. McEnery and Wilson (2001: 14) argue that work with corpus data of sorts continued because certain topics, entirely worthy of describing as part of linguistics, could not be effectively studied in the artifi-
14
This was Chomsky’s reaction to the somewhat bold assumptions held by early corpus linguists such that the sentences of a language were finite and thus could be enumerated and collected, however difficult the task may be. These linguists’ hope was then to use a comprehensive corpus enabling complete coverage as “the sole source of evidence in the formation of linguistic theory” (McEnery and Wilson 2001: 7) or “the sole explicandum of linguistics” (Leech 1991: 8).
70
Chapter 2 cial world of well-formedness judgments and idealized speaker-hearers created by Chomsky.
The unpopularity of the corpus-based approach in the generative era was also due to another source of tension between the theorists and the empirically oriented analysts: corpus material was considered a poor insight into another fundamental concept of the generativists, that of Universal Grammar. To the generative grammarian, the ultimate aim of his study was the investigation of those linguistic principles that are applicable to all natural languages. Those abstract principles are at the core of Universal Grammar, i.e. the universal mental blueprint of the language faculty all humans are born with. Meyer (2004: 4) makes a useful distinction in the goals of the two approaches: Of primary concern to the corpus linguist is an accurate description of language; of importance to the generative grammarian is a theoretical discussion of language that advances our knowledge of universal grammar.
Meyer (2004) argues that this is inextricably related to another fundamental goal that Chomsky had envisaged for generative grammar to attain: the highest level of adequacy of description – explanatory adequacy. Meyer (2004: 2-3) illustrates the concept with an example involving cross-linguistic differences in sentence structure. In English it is impossible to omit subject pronouns without compromising the well-formedness of a sentence, unlike in Spanish or Japanese, where the missing information is deducible from the inflected form of the verb. At the level of explanatory adequacy, where principles of Universal Grammar apply, this fact should be expressed in the form of Chomsky’s theory of two-way parameters which can be set as positive or negative according to what is allowed in a particular language. Accordingly, children acquiring English set the pronoun dropping parameter to negative, but speakers of Spanish set the same parameter to positive. Universal Grammar consists of a multitude of such parameters and it is by means of such descriptions that universal cross-linguistic generalizations can be made with a view to attaining the level of explanatory adequacy. In contrast, most corpus linguistics aims at descriptive adequacy (a lower level according to Chomsky) although an obvious advantage of this approach is that particular emphasis is laid on accuracy of description, sometimes lacking from generative accounts. The tension between the generativists and corpus linguists was hardly relieved by the advent of computer corpora. The Brown Corpus, the first electronic corpus, completed in 1964 and a milestone in the present-day revival of corpus linguistics, was “compiled in the face of massive indifference if not outright hostility from those who espoused the conventional wisdom of the new and in-
Corpora and corpus linguistics
71
creasingly dominant paradigm in US linguistics led by Noam Chomsky” (Kennedy 1998: 19). Mahlberg (2005: 15) repeats after Francis (1982: 7f) an anecdote illustrating the typical generative attitude towards corpora at the time. W. Nelson Francis, one of the creators of the Brown Corpus, met Robert Lees, a staunch Chomsky-ite, at a conference in 1962. On finding that Francis had received a grant to produce the Brown Corpus, Lees responded that this was “a useless and foolhardy enterprise” as well as “a complete waste of time and the government’s money”. The rationale behind his criticism was the same familiar conviction that searching through corpus data can and should be replaced with relying on the intuitions of a native speaker. Lees himself argued: “You are a native speaker of English; in ten minutes you can produce more illustrations of any point in English grammar than you will find in many millions of words of random text” (Mahlberg 2005: 15). Once again, Lees’s reasoning testifies to the broad tendency of generativists to ignore the fact that they base their linguistic claims on contrived, decontextualized data. In view of the discussion so far, it may seem that the two opposite sides of the conflict will never be reconciled. But nor necessarily. Mahlberg (2005: 15) argues that with the continuous development of corpus-based study, the two extreme standpoints have given in to a more balanced view, one in which naturally occurring data and intuition go hand in hand in the sense that the information retrieved from the corpus has to be approached critically and interpreted appropriately by the human observer. More and more corpus linguists stress the utmost significance of functional or qualitative interpretations provided by the analyst on top of the statistical or quantitative information (e.g. Johansson 2004, Biber 1998, Meyer 2004; see below for the discussion of quantitative and qualitative descriptions). Similarly, Meyer (2004: xiv) argues that the corpus linguist and generative grammarian are often engaged in complementary, not contradictory, areas of study; while the goals of the corpus linguist and the generative grammarian are often different, there is an overlap between the two disciplines and, in many cases, the findings of the corpus linguist have much to offer to the theoretical linguist.
Later on Meyer (2004: 4) notes that it is not only generative theory in particular that may well benefit from corpus findings, but also linguistic theory in general. Because it is frequently suggested in criticism of corpus linguistics that it is only interested in quantitative description of specific areas of language rather than in contributing to linguistic theory, it is worthwhile outlining Meyer’s (2004: 4) example of how data analysis may add to the study of Universal Grammar. The author cites Haegeman (1987), who claims that parametric variation (exemplified above with cross-linguistic variation of the pro-drop rule) may be shown to
72
Chapter 2
apply between dialects, varieties or genres of one and the same language, not necessarily between distinct languages exclusively. Specifically, Haegeman (1987) discovers that the genre of recipe language features one type of empty category, that of wh-traces, that is not found elsewhere in English. This particular study illustrates the point that although the primary concern of corpus linguists is descriptive adequacy, their findings may well add to linguistic theory at large and – in this case – help advance our understanding of the generative notion of Universal Grammar. This point brings our discussion to the next section, in which we consider in more detail the advantages and applications of corpora in language studies. 3. The corpus approach – characteristics, advantages and applications 15 As was mentioned in the previous section, corpus linguistics has its roots in the broad tradition of empirical studies, where definitive claims and generalizations are based on factual data. As Tognini-Bonelli (2001: 2) writes: Corpus work can be seen as an empirical approach in that, like all types of scientific enquiry, the starting point is actual authentic data. The procedure to describe the data that makes use of a corpus is therefore inductive in that it is statements of a theoretical nature about the language or the culture which are arrived at from observations of actual instances. The observation of language facts leads to the formulation of a hypothesis to account for these facts; this in turn leads to a generalization based on the evidence of the repeated pattern in the concordance; the last step is the unification of these observations in a theoretical statement.
In this respect, the simple strength of corpora is that they are “excellent sources for verifying the falsifiability, completeness, simplicity, strength, and objectivity of any linguistic hypothesis” (Meyer 2004: 4). It may be concluded that corpus analysis is in itself unrestricted to any particular theory or area of linguistics but can be readily used by any linguist seeking evidence for their hypotheses. Corpus linguistics is often referred to as the corpus-based approach to linguistics (e.g. Biber et al. 1998, Meyer 2004). Implicit in this interchangeable use of the two terms is the assumption that “corpus linguistics is more a way of doing linguistics […] than a separate paradigm within linguistics […] on a par with other paradigms within linguistics, such as sociolinguistics or psycholinguistics” (Meyer 2004: xi). The use of the corpus and access software is thus a methodol15
In this section we only consider the applications of corpora in theoretical linguistics, as opposed to areas of applied linguistics such as lexicography and grammar textbooks, language teaching, translation, culture studies, stylistics and forensic linguistics. All these are covered, for example, by Hunston (2002: Chapter 5).
Corpora and corpus linguistics
73
ogy adopted by the corpus linguist in whatever investigations of language he may wish to engage (see below for the applications of corpora in various areas of linguistics). The range of potential uses of the corpus as a research resource must be acknowledged but certain characteristics of corpus work make it especially well suited for investigations of a particular kind. In most general terms, the corpus-based approach is particularly useable in studies of language use, as opposed to structure. Traditionally, linguistic descriptions have focused on language structure. This may have involved the identification of the structural units of language (e.g. morphemes, words, clauses) and the stipulations governing possible combinations of such component elements. Considerations of grammaticality or well-formedness of the resulting constructions were equally important. Furthermore, surface structure has often been distinguished from the underlying structure, and formal means of deriving the former from the latter have been proposed, contested and refuted. In contrast, corpora allow the linguist to study language from a different perspective and focus on language use. The two approaches are compared by Biber et al. (1998: 1) with an example of the following patterns of verb complementation in English: I hope that I can go. I hope to go. I hope I can go.
All three constructions are well-formed and offer optional ways of verb complementation for hope. Formal analysis of these sentences would describe the different types of clauses that are permitted to follow the verb hope, relative to other types of verb complementation for other English verbs. Additionally, the conditions in which that can be omitted would also be accounted for. However, as Biber et al. (1998: 1-3) argue, structural description ignores crucial questions that relate to this set of sentences: Why does language allow for multiple structures that are so similar in meaning and function? Is it possible that each one might be preferred in different linguistic or situational contexts? In this sense, studies of language use investigate how speakers and writers exploit their linguistic resources to achieve various communicative goals. Rather than simply provide a formal description of language and discuss what is structurally possible in it, the emphasis is laid on the analysis of authentic data with a view to providing a thorough functional description of language use. 16 Specifically, in
16
See Meyer (2004: 6-11) for an example of functional analysis of several types of elliptical coordination in spoken and written genres. In the analysis, some ellipsis types are found to be more common than others as well as disproportionately distributed across
74
Chapter 2
the case of the three sentences above, Biber et al. (1998: 2) ask themselves questions such as “Do spoken versus written varieties have different preferences for one of the forms over others? Are the forms usually used with different verbs? Are the forms used preferentially for different specialized meanings?” and later on report findings that indeed show patterns of preferred use of that-clauses and to-clauses across registers (Biber et al. 1998: 73-76). The corpus-based approach thus lends itself to functional description of language use, based on quantitative information that has to do with frequency of occurrence, distribution and co-occurrence patterns. In this sense corpus analysis is a unique combination of quantitative findings and qualitative or functional interpretations of the results obtained. In other uses of corpora, quantitative information may be largely ignored and the object of interest may be purely qualitative, as when the collocates of a word need to be established or in studies of phraseology (see Hunston 2002: 137-157 for a discussion of corpora in phraseology). In still other cases, the information needed may be purely quantitative, as when the frequency of items (words or grammar constructions) needs to be ascertained for the purposes of lexicography (see Hunston 2002: 96-109), teaching syllabuses (Hunston 2002: Chapter 7), comparing registers and corpora, etc. (see Hunston 2002: 5-9 for a discussion of various uses of frequency lists). Most of the time, however, quantitative and qualitative descriptions will go hand in hand as complementary information. For example, in finding the collocates of a word, the analysis will benefit from identifying the most frequent items typically accompanying the word in question, thus enriching the description with information relating to frequency and probability of occurrence. Clauses of verb complementation such as that-clauses and to-clauses are examples of language constructions which can be studied individually or in comparison to one another by means of corpora. Other than that, studies of language use may also focus on text types, groups of texts by the same author, individual language varieties or their comparison, or particular groups of language users. In all these, the focus of attention would be to identify systematic patterns that may enhance our understanding of language use. If one’s aim is to find such patterns, it is necessary, Biber et al. (1998: 1-12) claim, to adopt the methodologies of the corpus-based approach. The authors cite the following reasons. Firstly, some of the advantages stem from the very characteristics of computers. They are reliable and consistent in their quantitative analysis – they do not overlook occurrences of relevant data, change their mind or grow tired. Humans, on the other hand, are error-prone. They tend to notice unusual occurrences and ignore typical ones. Secondly, for findings to be reliable, investigations of language use registers. The frequency distributions of the distinct types of ellipsis are next provided with principled functional explanations as serving different communicative goals.
Corpora and corpus linguistics
75
require the analysis of large amounts of text – this is facilitated by the use of computers, which offer speed and efficiency. This is especially important in studies investigating patterns of high complexity: for example, a range of language features across registers and contextual factors conditioning the occurrence of these features. Additionally, custom-built analysis tools can be tailored to the specific needs of a particular project. Because of the methodological challenges mentioned above, studies of language use have largely been made possible through corpus analysis, especially the application of large electronic corpora. Biber et al. (1998: 5) argue that one particular type of study facilitated by corpora, and until recently unfeasible, is the identification and analysis of “complex ‘association patterns’: the systematic ways in which linguistic features are used in association with other linguistic and non-linguistic features”. Linguistic features are understood as either a word or grammatical construction. Non-linguistic features, on the other hand, include language varieties (registers, dialects, historical periods). As all these features may be studied in relation to one another, the combinatorial possibilities of various types of association are reproduced below after Biber et al. (1998: 6): A. Investigating the use of a linguistic feature (lexical or grammatical) (i) Linguistic associations of the feature - lexical associations (associations with particular words) - grammatical associations (associations with particular grammatical constructions) (ii) Non-linguistic associations of the feature - distributions across registers - distributions across dialects - distributions across periods B. Investigating varieties or texts (e.g., registers, dialects, historical periods) (i) linguistic association patterns - individual linguistic features or classes of features - co-occurrence patterns of linguistic features In Part A, of special interest to this study is point (ii): non-linguistic associations of the feature – distributions across registers. In Chapter 4 we present a case study of this type of association, in which nominalizations are investigated against the background of a set of registers in the BNC. Part B of the diagram above will be the subject matter of Chapter 3, where we discuss the ways in which certain features systematically co-occur in particular genres and thus provide a basis for register distinctions. Below we only briefly discuss the importance of the corpus-based approach to the study of register variation.
76
Chapter 2
By virtue of their division into distinct genres, modern corpora are purposely designed to permit the study of one particular type of language usage, namely register/genre variation, i.e. how speakers and writers use their linguistic resources differently according to situational context or production circumstances. Until the arrival of electronic corpora, the analysis of register variation was both unfeasible and neglected. 17 Meyer (2004: 3) argues: [W]ith the notion of the ideal speaker/hearer firmly entrenched in generative grammar, there has been little concern for variation in a language, which traditionally has been given no consideration in the construction of generative theories of language. This trend has become especially evident in the most recent theory of generative grammar: minimalist theory.
Meyer (2004: 3) juxtaposes the minimalist notions of the “core” of language or “pure instantiations of Universal Grammar” on the one hand and the “periphery” or “marked exceptions” (Chomsky 1995: 19-20) on the other. The elements of the core are those worthy of the linguist’s attention. Variation is one of those elements banished to the periphery of language and not considered in minimalist theory. The minimalist view is thus one “seeking an ever more restrictive view of language” whereas the corpus linguist “embraces complexity” and “sees complexity and variation as inherent in language” (Meyer 2004: 3). Register variation is best studied in relation to patterns of co-occurrence of certain linguistic features. The most elaborate and influential theory of such co-occurrence patterns, so-called multi-dimensional analysis, has been developed by Biber (1988). We postpone a detailed discussion of this analysis until Chapter 3. Continuing our discussion of the corpus-based approach, Hunston (2002) points out that the main argument in favour of corpus analysis is that it offers more reliable insights into language use than is available to native speaker competence (cf. 2.2 Historical perspective above). The simple reason for this, the author argues, is that much of the speaker’s linguistic experience “remains hidden from introspection” (Hunston 2002: 20). In particular, she considers intuition to be a poor guide to the following aspects of language, which are best studied with corpora: collocation, frequency and phraseology. We illustrate each in turn after Hunston (2002: 20-22). Firstly, without corpus evidence, some collocations are difficult to isolate simply because native speakers may not be aware of them, let alone foreign 17
Methodological difficulties aside, linguistic variation at large was marginalized as a result of the prevailing theoretical assumptions: Chomsky (1965: 9) advocates that the subject of study should be “the ideal speaker-hearer in a homogeneous speech community”. For more detailed illustration, see Henry (2003) for a discussion of the neglect of language variation in syntactic theory.
Corpora and corpus linguistics
77
learners, who would obviously benefit from knowledge of this kind. Simple corpus-based concordance searches will retrieve this information with accuracy and speed. Examples include adverb-adjective combinations such as deeply concerned, eminently respectable, broadly comparable, acutely aware. Language speakers are also unreliable in their judgment of frequency. Although they may be aware of broad differences in frequency between very frequent and infrequent lexical items (such as walk vs. trod), they cannot be expected to have such intuitions about other areas of language, such as grammatical constructions. When it comes to phraseology, Hunston (2002: 21) cites the following example, in which corpus information proves useful. The author contrasts two sentences from the Bank of English, in which the verb require is followed by a passive to-infinite clause. Further experiments require to be done. These roses require to be pruned each spring.
While the second sentence seems well-formed, the first seems odd and the reason for this oddity is not quite clear at first. However, on closer inspection of phraseology in the Bank of English, do complements require to be only 3 times out of 302. Resort to corpus data reveals that the past participle following require to be is usually that of a verb with a specific meaning, such as prune, not a general verb like do. So far we have identified the following characteristics and applications of the corpus-based approach: • It is empirical in its analysis of natural language compiled in principled collections of text. • It is computer-based, relying on computer storage space and making use of access software. This facilitates analysis of enormous amounts of data and renders the analysis more reliable. • It is based on quantitative and qualitative information derived from the database, which can additionally be interpreted functionally by the linguist. • It is especially useful for studies of language use, addressing the problems of linguistic variation, association patterns, frequency, phraseology, collocation, discourse analysis, etc. Crucial to the present study is the applicability of corpora in the description of word formation. At their most basic empirical value, corpus data can be shown to question idealized language principles or indicate descriptive gaps. For instance, Bauer and Renouf (2001) report on patterns of compound formations,
78
Chapter 2
productively used in modern English, that are “not described in the major handbooks” or that “break principles laid down as absolute in some of the theoretical works” (Bauer and Renouf 2001: 101). Apart from mere fishing for new, descriptively troublesome or otherwise interesting items, another research opportunity offered here by the corpus approach is, as stated above, the study of principled patterns of use of linguistic features – in this case – of morphological elements. And so, to give but a few examples, Hay and Plag (2004) and Plag and Baayen (2008) investigate the principles and tendencies of suffix combinations and suffix ordering. Research into morphological productivity has progressed with the availability of corpora: Hay (2003) and Hay and Baayen (2003) examine the relation between frequency, parsing and morphological productivity; Baayen and Renouf (1996) study lexical innovations and morphological productivity in newspaper English; Renouf (2007) examines lexical productivity and creativity in broadsheet journalism and Rúa (2007) explores new vocabulary in electronic communication (text messages, e-mails). As regards comparative work in word-formational patterns across registers, sources are few and far between. Plag et al. (1999) examine a range of suffixes across the written and spoken sub-corpora of the BNC. Biber et al. (1998) study the distribution of four nominalizing suffixes across three registers – prose, academic and spoken language; Biber et al. (1999) make some small effort to add to these findings by considering a few more suffixes in the register of academic language only. The present work (see Chapter 4) is another contribution to research into register variation in English morphology. We now move on to the next chapter of this work, where our concern is the study of register variation. Of direct relevance to our analysis of nominalizations in the final part of this dissertation is an overview of the current state of affairs in this area of language use.
Chapter 3 Linguistic variability and register variation
1. Introduction Differences of register are inextricably associated with linguistic variation at large. Contemporary views on how distinct registers exhibit systematic patterns of contrast have largely been determined by the advance in the study of linguistic variation in general. Therefore the discussion will begin with an overview of the treatment of linguistic variation in several approaches to linguistic enquiry. We will focus on those models that have contributed the most to our understanding of variability and offered a maximally accurate description of its workings. Finally, the discussion will narrow down to analyze state of the art analytic descriptions of registers and comparisons between registers. 2. Linguistic variation In its fundamental sense, linguistic variation may be seen as a special kind of form-meaning relation – one that departs from the ideal one-to-one mapping of form and meaning. With regard to the mapping of linguistic content onto phonological form, Anttila (2002: 209) contrasts two opposite deviations from the desired situation: one in which one meaning (M1) corresponds to several forms (F1, F2, …) (i.e. variation) and the other in which several meanings (M1, M2, …) correspond to one invariant form (F1) (i.e. ambiguity). They can be represented graphically as follows (after Anttila 2002: 209): M1 F1
M1 F2
variation
M2 F1 ambiguity
In simple terms, and in Labov’s informal definition, variation involves “different ways of saying the same thing” (also quoted in Guy 2007, Schilling-Estes 2002, Tagliamonte 2006). This is a very broad sense of variation that may take on various representations in its various instantiations. For example, in traditional morphophonology, the negativizing prefix in- is represented in oral production by distinct allomorphs [In], [IN], [Im], [Il] and [Ir] all of which are equal in meaning or function. Additionally, [In] and [IN] can in practice be used inter-
80
Chapter 3
changeably when preceding a velar consonant. At the level of the word, doing and doin’ are also synonymous although the phonological contrast between them may indicate social distinctions of informality, style, register, dialect, etc. In other words, speakers’ choices of alternating variants may convey extralinguistic information that may be semantically neutral but socially significant (see below for more examples). Likewise, the two pronunciations of tomato, i.e. [t´'meIt´U] and [t´'mAÜt´U], as well as the regional and cultural contrast between elevator and lift may illustrate variability encoded at the level of the lexicon. Indeed, all the levels of linguistic representation are subject to variance. Tagliamonte (2006: 6) writes: [L]inguistic variation also encompasses an entire continuum of choices ranging from the choice between English or French, for example, […] by bilingual or multilingual speakers […], to the choice between different constructions, different morphological affixes, right down to the minute microlinguistic level where there are subtle differences in the pronunciation of individual vowels and consonants. Importantly, this is the normal state of affairs.
Similarly, at the level of the sentence multiple variants may readily be shown to convey the same meaning, for example: (1) 1) You got a big family? 2) Do you have a big family? 3) Have you (got) a big family? 1) There's three books on the table. 2) There are three books on the table. (Henry 2002: 268) 1) I ain’t gotta tell you nothing/anything. 2) I haven’t gotta tell you nothing/anything. 3) I don’t have to tell you nothing/anything. (Tagliamonte 2006: 9)
In the three sets of sentences, got varies with have (supported by do) and have (got) (as an auxiliary verb). There’s varies with there are, though only in its contracted form (Henry 2002: 268). Gotta varies with have gotta and have to, ain’t varies with not or don’t, and nothing varies with anything (Tagliamonte 2006). Obviously, each of these utterances will have “its own social value, ranging from highly vernacular to standard” (Tagliamonte 2006: 9) and some of these structures will be socially stigmatized but the fact remains that variant
Linguistic variability and register variation
81
constructions are all-pervasive in language and the conditioning factors behind the preferred choice of an alternant, be they social or purely linguistic, need to be accounted for by linguistic theory. Specifically, while the allomorphy of in- is phonologically conditioned, and like many other phenomena may thus be assumed to apply across the board, the kind of lexical and syntactic variation that is represented in our examples above is associated rather with factors residing outside linguistic structure. Linguistic (internal) factors may thus be distinguished from social, extralinguistic (external) factors and the two types may be shown to facilitate, trigger, and drive linguistic variation. More importantly, most of the time they can be correlated as cooperating in this process. The purpose of the next section is to discuss the recognition of this correlation by sociolinguists and their contribution to our understanding of variation. We will, however, first note another point that relates to the concept of variation – not to language itself, but its speakers. As noted above, variation is commonplace in language, at each of its structural levels, and across languages. Furthermore, individual speakers will display variation too: Different ways of saying more or less the same thing may occur at every level of grammar in a language, in every variety of a language, in every style, dialect or register of a language, in every speaker, often in the same sentence in the same discourse. In fact variation is everywhere, all the time (Tagliamonte 2006: 10).
Variation in language structure and use in the performance of a single speaker, i.e. intra-speaker variation, and often within the same speech act, is illustrated by the author with the following excerpts from the York English Corpus (Tagliamonte 2006: 10-11, 73): (2) Phonology/morphology (t,d-deletion): I did a college course when I lefØ school actually, but I left it because it was business studies. Morphology (neutralization of contrast – adverbial -ly): You go to Leeds and Castleford, they take it so much more seriously… They really are, they take it so seriousØ. Tense/aspect (future temporal reference forms): ...I think she’s gonna be pretty cheeky. I think she’ll be cheeky. Intensifiers: I gave him a right dirty look… and I gave him a really dirty look. Phonology/morphology (neutralization of contrast – [n] and [N]): We were having a good time out in what we were doin’.
82
Chapter 3 Agreement: There was always kids that were gone missing. Syntax (post-posing): I was terrible, really … Very selfish, I was!
One may infer from the examples above that if alternant forms are used by the same speaker not all variation is socially distinctive. Alternatively, it may be argued that intra-speaker variation involves style-switching, where individual speakers switch in and out of language varieties as necessary depending on situational context, at times perhaps within the same speech event. When no social distinction is to be made, the variationist linguist must try to account for variability otherwise, e.g. as linguistically conditioned. Specifically with the notion of intra-speaker variation in mind, Henry (2002: 278) defines variation as “systematic variability between different ways of saying the same thing within the competence of a single speaker”. This viewpoint brings us closer to the study of register variation, whose primary assumption is that every speaker switches in and out of distinct language varieties to accommodate different communicative purposes and to adapt to a particular situational and social context. In this sense, then, register variation has a strong footing in language in the social context, which in turn is the domain of sociolinguistics. 3. Sociolinguistics The inception of the modern study of linguistic variation dates back to the second half of the nineteenth century and is associated with dialectology, i.e. the study of regional speech variation (see Chambers 2002, Hazen 2007). Over time, as increasingly detailed research was conducted, the focus of study became detached from merely dialectal differences, and other factors related to social background aroused the interest of linguists. Hazen (2007: 71) discusses this shift of focus from pure dialectology to early sociolinguistic analyses and cites the work of McDavid (1948) with an explicit remark of his conscious awareness of the broadened focus: McDavid’s (1948) analysis of post-vocalic /-r/ in South Carolina and Georgia is an early sociolinguistic study: he surmises that, “A social analysis proved necessary for this particular linguistic feature, because the data proved too complicated to be explained by merely a geographical statement or a statement of settlement history”.
With an extended area of research, a multitude of phenomena to account for and the need to cover more accurately and more representatively various population samples, the conceptual field of linguistics was ripe for the growth of a compre-
Linguistic variability and register variation
83
hensive model that would integrate social and linguistic aspects of language. In 1963 at the annual meeting of the Linguistic Society of America, William Labov presented the first sociolinguistic research report on the social conditioning of language, thus marking the birth of sociolinguistics proper 1 and a turn in the history of linguistics towards the study of linguistic variation and change as correlated with social factors (Chambers 2002; see also Bayley 2002, Hazen 2007). In the same year, Labov’s influential “Social motivation of a sound change” was published. Essentially, the impetus of this work was the author’s evidence of how social forces – age in this particular study – may drive the occurrence of linguistic variation and change. Another important publication by Labov in the early stages of the approach was “The Social Stratification of English in New York City” (1966), this time emphasizing social class, another pivotal concept of sociolinguistics. Of foundational significance to the development of the approach, however, was yet another publication – Weinreich, Labov and Herzog (1968) – a manifesto of variationist sociolinguistic studies, published in the same year as Chomsky and Halle’s (1968) The Sound Pattern of English but spelling out a theoretical model advocating views quite different to those of the mainstream grammar theory of that time. Broadly speaking, sociolinguistics and linguistics differ in several ways. The first is the inclusion of the social element investigated by the former. If the task of linguistics is to account for the rules of language X, sociolinguistics is concerned with studying “any points at which these rules make contact with the society” (Hudson 1996: 3) – such as alternative means of expression chosen by different social groups. Acknowledging the wide range of ways in which social aspects may impinge on language, the areas of interest for sociologists can vary widely and therefore the proportion of purely linguistic elements on the one hand and social considerations on the other may be equally diverse (cf. Milroy and Gordon 2003). Still, it is an intrinsic property of all sociolinguistic investigations to concern themselves, even if to a minimal extent, with the social context of language structure and use. Depending on the degree of imbalance between the internal (i.e. morphological, phonological, syntactic and lexical, cf. Anttila 2002) and extralinguistic factors (such as age, sex, social class and background, ethnicity, domicile, education, community identity, etc.) 2 as well as the purpose 1
The term sociolinguistics itself was coined a decade before by Haver C. Curie (Chambers 2002). 2 Sociolinguists tend to subsume register under the label of external factors affecting language structure and use (cf. Anttila 2002) even though it may be seen as not as strictly associated with the speaker as factors such as sex, ethnicity, etc. Instead, it may be further distinguished as an external factor pertaining to the variety of language in question. In other words, all external factors are non-linguistic (i.e. non-phonological, non-morphological, non-syntactic and non-lexical) but may be either inherent to the
84
Chapter 3
of research, the different orientations of sociolinguistic research fall into the two domains of sociolinguistics and sociology of language (Tagliamonte 2006: 3). The former puts emphasis on language in social context and the latter studies how language influences society, and thus places society in the center of its attention. 3 Secondly, because of the social bias of the model, sociologists pay special attention to the context of language: the speakers involved and their mutual relation, the nature of their involvement in a given speech event and the circumstantial situation and purpose of interaction. In fact, language is “dependent on the speaker who is using it, and dependent on where it is being used and why” (Tagliamonte 2006: 3). Traditional linguistic theory is not concerned with such information, or indeed language use in general, and is rather focused on language structure, concentrating on discovering, describing and explaining all of its complexity and subtlety. Tagliamonte (2006: 3) notes explicit declarations of such dismissals of non-structural considerations, even in studies of phenomena clearly influenced by society and culture, such as grammatical change. She quotes Roberts and Rousseau (2003: 11): Of course, many social, historical and cultural factors influence speech communities, and hence the transmission of changes […] From the perspective of linguistic theory, though, we abstract away from these factors and attempt, as far [sic] the historical record permits, to focus on change purely as a relation between grammatical systems.
Sociolinguists, on the other hand, are intent on researching both varied structure that is linguistically and/or socially determined and language use with a particular interest in variation of expression and its social distribution. 4 More importantly, with a view to ensuring the authenticity of their data, sociolinguists will speaker or characterize the language variety in question. Additionally, style is also considered an external factor of intra-speaker performance and is juxtaposed in sociolinguistics with vernacular. The vernacular of a speaker is “the style in which the minimum attention is given to the monitoring of speech” (Tagliamante 2006: 8), i.e. a speaker’s ‘normal language’. In contrast, the same speaker may switch into various styles depending on situational context. 3 Similarly, Hudson (1996: 4) defines sociolinguistics as “the study of language in relation to society” and the sociology of language as “the study of society in relation to language”. 4 An identical comparison was made in Chapter 2 between corpus linguistics and theoretical linguistics. As discussion proceeds, the readers will have noted that there are several similarities between how corpus linguistics and sociolinguistics alike contrast with theoretical linguistics in some respects. In fact, Biber (1994: 3) enumerates many areas of linguistic exploration, corpus linguistics being one of them, as “having a footing or otherwise closely related to sociolinguistics”.
Linguistic variability and register variation
85
insist on observation of language rather than elicitation. Tagliamonte (2006: 5) illustrates this difference with the kind of general research questions that the theoretical linguist and the sociolinguist are likely to pose: [I]nstead of asking the question: ‘How do you say X?’ as a linguist might, a sociolinguist is more likely not to ask a question at all. The sociolinguist will just let you talk about whatever you want […] and listen for all the ways you say X.
Thirdly, it transpires from the above that, given its goals of enquiry, sociolinguistics is an empirical enterprise depending on authentic data. Theoretical linguistics, especially in the generative tradition, is notoriously lacking in this respect, as was discussed at length in Chapter 2. Milroy and Gordon (2003: 2) argue: [C]ontemporary sociolinguistics comprises a great many different traditions of research which address correspondingly different sets of research questions. However, all sociolinguists share a common orientation to language data, believing that analyses of linguistic behaviour must be based on empirical data. By this we mean data collected through observation, broadly defined, as opposed to data constructed on the basis of introspection. The most commonly studied data among sociolinguists are those representing speakers’ performance – the way they actually use language.
There are more aspects of sociolinguistics that are best described in contrast with traditional theoretical linguistics. These, however, should rather be discussed as inherent to one particular research agenda’s paradigm within sociolinguistics known as variationist sociolinguistics. This is the topic of the next section. 4. Variationist sociolinguistics Within the sociolinguistic paradigm, the variationist programme is specifically interested in the variability of language, whatever the cause. Both internal and external factors are at work in this respect and they all are of interest to variationists. Schilling-Estes (2002: 203) states: Of all the subfields of sociolinguistics, the study of linguistic variation is perhaps the one with the strongest emphasis on the ‘linguistic’ side of ‘sociolinguistics’. While variationists are indeed concerned with understanding social structures and forces, […] they are also vitally interested in furthering the scientific understanding of language.
Furthermore, in their attempts to do so, their advantage over theoretical linguists, it seems, is that
86
Chapter 3 [u]nlike theoretical linguists, who typically rely on idealized versions of homogeneous languages in their search for underlying structure, variationists maintain that any valid linguistic theory must give central place to the variation and change that pervade all human language.
Indeed, as the term itself implies, variationists make a point of attaching significant weight to the study of variation. In the past some linguists, such as Sapir for example, have recognized the pervasiveness of variation 5 (Milroyand and Gordon 2003: 4). Nevertheless, for Saussure, Chomsky and many others in the structuralist and generative traditions, the legitimate object of study was “the ideal speaker-listener, in a homogeneous speech-community” (Chomsky 1965: 3) and thus homogeneity was a prerequisite of linguistic analysis. And so a typical reaction to the plain fact of variability has been to rule it out of consideration as “a methodological complication […] a kind of noise which obscures the important underlying invariance” 6 (Milroyand and Gordon 2003: 4) in order to offer “a coherent and elegant descriptive and theoretical account” (Milroyand and Gordon 2003: 4) – all at the expense of accuracy. 4.1. Orderly heterogeneity 7 Not only has variation been considered inconvenient, but it has also been largely dismissed as accidental and unstructured (Milroy and Gordon 2003: 5). Phonological variation is a notable exception here, as phonological theory has traditionally concerned itself with phonological and morphophonological variability for the simple reason that much phonological variation in all languages is overtly predictable and orderly. 8 In contrast, morphological and syntactic variability, especially when considered to involve substandard constructions, has not been thought worthy of attention. For example, alternative realizations such as was versus were following a plural subject have been described as “the outcome of dialect mixing, held to be a temporary situation of instability, or instances of free variation” (Milroy and Gordon 2003: 4-5). Commenting on such 5
See Hazen (2007) for an overview of historical precedents of ancient and presociolinguistic scholars commenting on linguistic variation. 6 The peripherality of variation was largely a result of the assumed competence / performance dichotomy (see Chapter 2), whereby variation phenomena were ascribed to the latter, and thus deemed to be of little theoretical value. 7 “Orderly heterogeneity” is an oft-quoted expression coined by Weinreich et al. (1968: 100) and customarily used in the literature of variationist sociolinguistics. 8 Broadly speaking, variation has been a key term in historical linguistics, where diachronic change is detected and measured by comparing variation between two or more points in time (see Montgomery 2007).
Linguistic variability and register variation
87
descriptions, the psychologist Fischer noted that “[f]ree variation is a label, not an explanation. It does not tell us where the variants came from nor why the speakers use them in differing proportions, but is rather a way of excluding such questions from the scope of immediate enquiry” (Fischer 1958: 47-8; cited by Milroy and Gordon 2003: 5). Guy (2007) argues that the reason phonological variation has received the most attention is that there exists a notion that phonology is the only domain in which linguists can speak of variation, arising from the assumption that variability at any other levels of linguistic structure may entail intentional differences in meaning. Hence, linguists will readily agree that running and runnin’ are distinct instantiations of the same underlying form, but hesitate whether the same is the case with Kyle got arrested and Kyle was arrested 9 (Guy 2007: 5). Admittedly, the two sentences may or may not be considered completely synonymous, but it is precisely the job of the linguist to abstract away from the structure and examine any patterns of usage that may emerge in connection with each sentence, e.g. what kind of speaker is likely to use which construction, in what circumstances, register and co-text, and why? Variationist sociolinguistics, on the other hand, from its very inception has recognized and emphasized language’s inherent variability and sought to discover its patterned occurrence. The initial source of that belief is largely to be traced to the two classic publications by Labov (1963 and 1966). 10 Specifically, in his 1963 study of Martha’s Vineyard – an island off the US east coast – Labov studied patterns of diphthongal alternants. He noted that /aI/ and /aU/ (as in mice and mouse) had raised and centralized variants [əI] and [əU], and he found that centralization correlated with age groups and ultimately with a sense of community identity. The rapidly changing social scene on the island allowed social divisions to drive differentiation of phonological variants. Labov concluded that sound change can be shown to be driven by social forces within a community (Hazen 2007). Furthermore, on a more general scale, he also observed that the workings of sound change can be inferred from synchronic variation. These findings, Hazen (2007: 73) claims, “were conceptual turning points in the scientific study of language”. In drawing the conclusions that he did, Labov argued directly against the important Saussurian structuralist dichotomy of synchrony and diachrony, whereby synchronic studies of language are to be kept separate from diachronic study of language change (Milroy and Gordon 2003: 2). 9
See Green (2007) for her discussion of the integration of syntactic variation in syntactic theory. 10 Another innovative feature of Labov’s (1966) study of New York City dialectal differences was that it was the first study of dialectal patterns on a large scale conducted in an urban area, an important switch from traditional research confined to rural dialectology (Hazen 2007).
88
Chapter 3
Thus, an important claim that is commonly held by variationists is that language variability is systematic and speakers’ choices of variable linguistic forms are constrained by internal and/or external factors. A major goal in this area of study is therefore to specify these constraints, as demonstrated by Labov’s findings above. Two other examples are offered below. Further illustration of structured variability correlating with social factors is the influence of gender. Taking the example of the -ing ending again, Cheshire (2002: 426) reports that, in their social class, men use a higher proportion of the alveolar /n/ variant than women and, conversely, women tend to use the velar /N/ variant more often than men. Also, in more general terms, men use a higher frequency of non-standard forms than women. Women, in turn, favour new prestige forms more than men and are more likely to innovate. Another example comes from Tagliamonte (2006). The author argues that although language is primarily used to transmit verbal information, it simultaneously conveys non-linguistic information about the speaker, his age, sex, socioeconomic class, his relation to the hearer and the kind of speech event he considers himself to be engaged in. Distinctions of these kinds are only possible because of variation: speakers’ different choices of linguistic means communicate various extralinguistic information (Tagliamonte 2006: 7). Two corpus excerpts are offered by the author and the readers are invited to guess the relative age (i.e. young or old) of the two speakers: (3) I don’t know, it’s jus’ stuff that really annoys me. And I jus’ like stare at him and jus’ go … like, “huh”. It was sort-of just grass steps down and where I dare say it had been flower beds and goodness-knows-what…
The decision is indeed fairly simple: The speakers are eighteen and seventy-nine years old respectively; both are female. 11 4.2. Variable rules Crucial in variationist research is the observation of “orderly heterogeneity” (Weinreich et al. 1968: 100), but the exact perception of this orderliness requires elaboration. It is yet another fundamental tenet of variationist sociolinguistics that stands in opposition to traditional theoretical linguistics. Namely, structural alternations can be a result of variable rules that apply in different contexts with 11
Further illustration of speech differences associated with age and sex is offered, for instance, in Holmes (1992: Chapter 7).
Linguistic variability and register variation
89
different probability rates. In other words, the distribution of alternating forms may not be as thoroughly universal as, for example, the distribution of the allomorphs /z/, /s/ and /Iz/ of the plural -s ending in English or word-final devoicing of voiced obstruents in Polish, both of which apply across the board, but the application of a rule may well be more or less probable depending on an array of conditioning factors. Let us consider an example of a phonological alternation discussed in Guy (2007), where the dropping of a consonant is a categorical alternation (i.e. applying obligatorily when eligible) in one language, but a variable alternation (i.e. expressed probabilistically as more or less likely to occur) in another language. French liaison involves the pronunciation of a word-final consonant sound when followed by a vowel. Conversely, word-final consonants are not articulated when followed by a word beginning with a consonant: (4) les ami /lez ami/ les tapis /le tapi/
‘the friend’ ‘the rug’
The alternation at work here is exceptionless and completely predictable (categorical). Parallel alternations involving similar phonological effects are to be found in English (also Dutch), though with a more complex matrix of variable conditioning factors. English coronal stops are prone to dropping word-finally, but more often before a vowel than a consonant, e.g.: (5) frequent, preferred: east end possible, but rarer: eas’ end
eas’ side east side Guy (2007: 7)
The alternation is not categorical but variable in the sense that the form eas’ may occur in any context but is much more common before a consonant. The distribution of the alternants is thus to be stated, in this particular case, in terms of a tendency rather than a rule. One may speak of preferred and disfavoured forms rather than correct and incorrect forms. Admittedly, both French and English strongly favour retention of consonants before vowels and deletion before consonants; however, in French the dispreferred cases are non-existent while in English they are possible though less frequent (Guy 2007: 7). Furthermore, in English, different consonants will exert varied effects on the probability of deletion, with stops being the most likely to trigger deletion. The following are percentages of frequency of occurrence for /t/, /d/-deletion for particular sound types (from Labov 1997, cited in Anttila 2002):
90
Chapter 3
Following segment
Following segment
stop
78%
/l/
40%
/w/
68%
pause
17%
fricative
65%
vowel
6%
nasal
57%
/r/
7%
/h/
45%
/y/
5%
Table 1 /t/, /d/-deletion rate: the following segment effect; word-final position only
The findings in the table above are typically explained on the grounds of syllable structure (Anttila 2002): final /t/ or /d/ tend to be preserved when they can be resyllabified as part of the following onset. This is feasible preceding a vowel (lost.Anna - los.tAnna), but not preceding, for example, /l/ (lost.Larry *los.tLarry) as tl- is not a possible onset in English. The resyllabification hypothesis correctly predicts that deletion before /l/ should be more common than before /r/ (40 vs. 7 per cent respectively). Alternatively, Bailey (2002) postulates that, in decreasing order of probability of application, the phonetic features of the conditioning segment can be stated as follows: obstruents > liquids > glides > vowels > pauses. To add further to the complexity of the pattern, it is also conditioned by the morphological status of the segment to be deleted (Anttila 2002). The following are percentages of frequency of occurrence for /t/,/d/-deletion per particular types of morphemic status (from Anttila 2002, based on Guy 1991 and Santa Ana 1992): monomorphemes (cost) irregular past (lost) regular past (tossed)
Guy (1991) 38.1% 33.9% 16.0%
Santa Ana (1992) 57.9% 40.7% 25.7%
Table 2 /t,d/-deletion rate: the morphological effect
Notwithstanding the discrepancies in percentages between the two sources, there is a clear pattern: deletion is most frequent in monomorphemic words, least frequent in regular past forms and of intermediate frequency in irregular past forms (also confirmed in Guy 2002). As regards its functional explanation, the varied morphological effect of the deletion rule is said to be the result of homonymy
Linguistic variability and register variation
91
avoidance: deletion is dispreferred in regular past forms on account of the fact that otherwise past and present forms would be identical (Anttila 2002). Further still, in reference to the same variable rule, Bayley (2002) uncovers even more conditioning factors. As it is, /t,d/-deletion is also constrained by syllable stress, with the two segments being more prone to deletion in stressed syllables; by cluster length, with three-consonant clusters more prone to deletion than two-consonant ones; by the phonetic features of the preceding segment, yielding the order of obstruents > liquids > glides > vowels > pauses; and by voicing agreement of the segments preceding and following the variable (homovoiced > heterovoiced). Thus, linguistic factors of various sorts interact here to co-constrain the variability. Individually and taken collectively, they can be correlated with each of the variant forms realizing the variable. Bayley (2002: 117) observes: “With a large enough set of data, we are able to make statements about the likelihood of co-occurrence of a variable form and any one of the contextual features in which we are interested”. And yet extralinguistic factors need mentioning as well. Wolfram and Fasold (1974) found that the rate of /t,d/-deletion varied considerably with social class. Specifically, the authors noted that Detroit African-Americans consistently favoured some phonological and morphological environments over others in /t,d/deletion, but their relative rates of rule application differed significantly. Variable linguistic forms are thus constrained by multiple internal and external factors co-determining the language user’s selection of a form. It follows that, with such a multitude of factors to be taken into consideration, any attempt to study variation must involve analysis of multiple variables at the same time (see section 6 below for a discussion of multi-dimensional analysis of register variation). Linguistic alternations are therefore of twofold nature: categorical and variable, and both are equally worthy of analysis. Indeed the latter perhaps need special attention as they have only recently been attended to, given the typical traditional view of linguistic structure described by Milroy and Gordon (2003: 4) as follows: [F]rom the earliest days of structural linguistics, analysts produced descriptions based on an underlying assumption that linguistic structure was fundamentally categorical. Following the Axiom of Categoricity, language is seen as operating with a kind of mathematical consistency.
As is evident from the varied occurrence of consonant deletion in French and English, language is not quite as ‘black-or-white’: description of language may involve statements of probability of occurrence, perhaps expressed in statistical form. Theoretical constructs that have been developed to capture language facts
92
Chapter 3
of this nature are the linguistic variable and variable rule. 12 The former is a structural unit parallel to structuralist and generative units of structure, e.g. phonemes and morphemes, in that it is an abstraction underlying physical realizations (in the example above, an underlying consonant that may or may not be articulated), while the latter is to be understood as a rule that operates not in terms of categorical use but frequency/probability of application. In the Variable Rule (VR) model developed by Labov (1969) and Cedergen and Sankoss (1974), rules of grammar are associated with an index of probability. The value of this probability is 1 for categorical rules and less than 1 for variable rules (Guy 2007: 20). Thus variationist theories mark the abandonment of the ‘axiom of categoricity’ (Chambers 1995) – central in traditional linguistic practice – that abstracts away from variation and seeks absolute generalizations that apply across the board. Interest in variable rules, it seems, is a welcome change – as Guy (2007: 23) writes: The assumption of invariance, which has dominated linguistic theory since the Neogrammarians, has been useful in the history of linguistics as a debating stratagem in certain theoretical arguments, and as a heuristic device for driving the research agenda, but it is not a design principle of human language.
5. Register and register variation Given the principal role of variability in sociolinguistics and the social bias of the paradigm, register variation is a closely associated area of study. A register is to be understood as a language variety “associated with a particular configuration of situational characteristics and purposes” 13 (Biber and Conrad 2001: 175). Thus individual registers are defined in terms of such situational and social parameters as: personal and group characteristics of the participants, relations among the participants, the level of formality, the channel of communication, the production and processing circumstances, the purpose of the communication and the subject matter (Biber and Conrad 2001: 175, Biber 1988: 30-31). Distinct registers may be differentiated by their various combinations of values for each 12
This is attributed to Labov (1969) and Cedergren and Sankoff (1974) and the quantified probabilistic model constructed therein. 13 Genre is also used with the same meaning, both in linguistic literature and in the present study (see Chapter 4). Both registers and genres can be contrasted with text types (as in Biber 1988) in that the two former terms refer to categorizations assigned on the basis of external (non-linguistic) criteria while text types are groupings of texts that are parallel by similarity of linguistic form, irrespective of genre categories. By way of illustration, Biber (1988: 70) argues that a science fiction text represents a genre of fiction (by virtue of the author’s purpose) but in its linguistic form it may represent an abstract and technical text type similar to academic prose.
Linguistic variability and register variation
93
of these parameters. Registers may be studied at different levels of specificity and while some can be very general and culturally well-known language varieties (e.g. lectures, novels, news reports, book reviews) others may be more narrowly defined and highly specific (for example, Biber and Conrad (2001) mention the register of methodology sections in experimental psychology articles). Thus defined with respect to its context of use, any register may be considered from a range of different angles, the choice of which will be determined by the purpose of analysis. Indeed, the definitional characteristics of registers are commonly described with very general labels, the interpretation of which is open to individual preferences, perhaps intentionally so. And so registers are alternatively and confusingly defined according to their use in a situational, contextual or social setting. 14 As need may be, the three adjectives may be interpreted as involving analyses of different scopes: investigating extralinguistic or linguistic context (or co-text), or the social dimension of interaction, including purely social characteristics of the speakers such as social class and age. 15 Bearing in mind this indeterminacy of definition, studies of register variation may be concerned with a range of research topics. Admittedly, within the sociolinguistic paradigm, the focus of inquiry may vary greatly. Attention may be paid to inter-speaker (i.e. across speakers) or intraspeaker (i.e. exhibited by the same speaker) variation, thus concentrating on the speaker as the source of variation. Ultimately, the objective here would be to determine a variable feature but the results of such studies would describe variability as observed against a social group determined by geography (i.e. dialects) or any of the non-regional factors such as sex, age, class, status, etc. (i.e. social dialects or sociolects). 16 Dialects and social dialects are referred to as varieties associated with users.
14
Examples include Crystal’s (1991) Dictionary of Linguistics and Phonetics, where the social dimension is explicitly mentioned: “[register is] a variety of language defined according to its use in social situations, e.g. a register of scientific, religious, formal English” Crystal (1991: 295). Indiscriminately, the author subsumes considerations of target audience (“scientific”), topic (“religious”) and level of formality (“formal English”) all under the same umbrella term of social situations. 15 Biber and Finegan (1994: 4) write in reference to the multifaceted nature of register variation: “There are many motivations for examining language varieties in their situations of use and many aspects of the social situation that might be the focus of particular register studies”. In this statement, the authors explicitly include the social context as a possible consideration in register variation studies. Also, on page 7 Biber et al. state: “Register analysts explore the link between linguistic expression and social situation, with a view toward explanation” [emphasis added in both quotations – WG]. 16 See Hudson (1996: 41-48) for a discussion of dialects, social dialects and registers. Essentially, “[t]he term register is widely used in sociolinguistics to refer to ‘varieties according to use’, in contrast with dialects, defined as ‘varieties according to users’”
94
Chapter 3
Alternatively, studies of variation may put emphasis on the language variety itself, as characterizing collectively all speakers who find themselves in a given situational and communicative setting, for example in the activity of writing a letter of application, attending a job interview or telling a joke. This is the case in studies of register variation proper, where registers, in contradistinction to (social) dialects are perceived as varieties associated with uses. A central goal here is the identification of any linguistic features that are found to be typical of a given register – or more precisely – the patterning of features that co-occur to characterize this register. In this sense then the speaker himself is of secondary importance to the actual language he or she generates in that the register will not be described strictly against a certain type of social group but will be associated with its situation of use. In other words, register labels such as cooking recipes, sport commentary, etc., along with their descriptions, focus on the context of production and the characteristics of the language generated therein, rather than the identity of the speaker. Admittedly, depending on individual registers, they may appear to relate to a greater or lesser extent to a particular group of people involved in a given situation. Holmes (1992: 276) argues: Journalese, baby-talk, legalese, the language of auctioneers, race-callers, and sports commentators, the language of airline pilots, criminals, financiers, politicians and disc jockeys, the language of the courtroom and the classroom, could all be examples of different registers. The term ‘register’ here describes the language of groups of people with common interests or jobs, or the language used in situations associated with such groups.
Indeed, at first sight it is rather difficult to discern whether Holmes’s examples are definitely varieties according to users or varieties according to uses and, if the former label seems better-suited, they may well be claimed to be social dialects. However, on further consideration, one might distinguish here between varieties obviously connected with situations of use on the one hand and what may seem social dialects on the other. For example, “the language of airline pilots” cited by Holmes is a register used only in a certain situation – while flying a plane and talking on the radio – and the same pilots will speak quite differently at home after work. The same applies to most of Holmes’s examples and the proclamation of a genuine social dialect is only a possibility in one case – the language of criminals. The registers are thus associated with their respective situational and occupational context more than with the speakers themselves. All registers, more or less prototypically independent of the speaker, may be distin(Hudson 1996: 45; the same distinction is made by Holmes (1992: Sections II and III) and Biber et al. (1994: 4)).
Linguistic variability and register variation
95
guished from classic examples of social dialects – such as Received Pronunciation in Britain – as associated with such purely social speaker-specific criteria as education, class and status. At the other end of the continuum will be purely situational registers such as news reports and letters of complaints, in which the social identity of the speaker is irrelevant. Depending on the nature of the study and its exact goals, the social factors of a speech event will be relevant in determining the social context (e.g. the relative status of the speakers), just as non-social factors will be relevant as determining non-social circumstantial context (e.g. the mode of communication). For example, the British National Corpus includes only British English, thus restricting its coverage through the use of a socially defined criterion. Thus the distinction between varieties associated with users and varieties associated with uses is to be noted with care. If sociolinguistics and register variation are not clearly seen as sharing common ground, this is a simple and useful terminological differentiation that can be made. Hudson (1996: 45) comments on the division between social dialects and registers: “The distinction is needed because the same person may use very different linguistic items to express more or less the same meaning on different occasions, and the concept of ‘dialect’ cannot reasonably be extended to include such variation”. The distinction between (social) dialects as varieties according to users, and registers as varieties according to uses, distinguishes between sociolinguistic studies of different dispositions and implies that register variation is an area of study within sociolinguistics with a narrowed area of interest. Conclusions drawn from research into register variation are crucial in linguistic inquiry for a simple reason: descriptive statements about the workings of a language cannot be generalized as valid for the language as a whole in all of its varieties. Instead, “characteristics of the textual environment interact with register differences, so that strong patters of use in one register represent only weak patterns in other registers” (Biber et al. 2001: 176-177). State of the art description of registers and comparison across registers is attributed to multidimensional analysis of register variation developed by Biber (1988), also extended to cross-linguistic comparison by Biber (1995), and adopted in numerous works that belong in the tradition of what may be characterized as ‘the Biber school’. Below we outline the tenets of the approach and underline in particular those of Biber’s (1988) findings that relate directly to our study of English nominalizations in Chapter 4. 6. Multi-dimensional analysis of register variation: Biber (1988) Biber (1988) examines textual variation across 23 genres of two corpora – the Lancaster-Oslo-Bergen (LOB) corpus and the London-Lund corpus. The number
96
Chapter 3
of words sampled totaled approximately 960,000. The genres included in the study are the following (based on Biber 1988: 67): Written 1. press reportage 2. editorials 3. press reviews 4. religion 5. skills and hobbies 6. popular lore 7. biographies 8. official documents 9. academic prose 10. general fiction 11. mystery fiction 12. science fiction 13. adventure fiction 14. romantic fiction 15. humor 16. personal letters 17. professional letters
Spoken 1. face-to-face conversation 2. telephone conversation 3. public conversations, debates and interviews 4. broadcast 5. spontaneous speeches 6. planned speeches
The author recognizes two levels at which genres can be compared. The first is the situational or functional perspective, where registers are compared along parameters such as formal/informal, interactive/non-interactive, literary /colloquial, involved/detached, abstract/concrete. The second is the linguistic or textual point of view, in which differences among registers are specified quantitatively in terms of strictly structural characterization, e.g. clausal complexity. As regards situational context, Biber distinguishes eight components of a speech event (‘speech situation’ in Biber’s nomenclature; based on Biber 1988: 30): I. Participant roles and characteristics A. Communicative roles of participants 1. addressor(s) 2. addressee(s) 3. audience B. Personal characteristics 1. stable: personality, interests, beliefs, etc. 2. temporary: mood, emotions, etc. C. Group characteristics 1. social class, ethnic group, gender, age, occupation, education, etc.
Linguistic variability and register variation
97
II. Relations among the participants A. Social role relations: relative social power, status, etc. B. Personal relations: liking, respect, etc. C. Extent of shared knowledge 1. cultural world knowledge 2. specific personal knowledge D. ‘Plurality’ of participants III. Setting A. Physical context B. Temporal context C. Superordinate activity type (what larger activity a speech event is part of) D. Extent to which space and time are shared by participants IV. Topic V. Purpose VI. Social evaluation A. Evaluation of the communicative event 1. values shared by whole culture 17 2. values held by sub-cultures or individuals B. Speaker’s attitudes towards content 1. feelings, judgment, attitudinal ‘stance’ 2. key: tone or manner of speech 3. degree of commitment towards the content VII. Relations of participants to the text (the ability of the writer/reader, but not speaker of hearer, to interact with the text: write as slowly, carefully, etc. as s(he) wishes) VIII. Channel A. Primary channel: speech, writing, drums, sign language, etc. B. Number of sub-channels available (lexical/syntactic, prosodic, paralinguistic (gestures) The specification of the situational context of communication facilitates determining the combinatorial possibilities of situational parameters such as those above. Then, in turn, if different situational contexts can be paired systematically with differences at the purely linguistic level, the correlation can be interpreted in functional terms as performing specific communicative functions, appropriate in specific contexts. However, even before the components of a speech situation 17
For example, the author claims that “in Western culture schooled language is more valued than non-schooled language, and writing tends to be valued more highly than speech. In traditional Somali culture, oral poetry is valued more highly than either schooled language or writing” (Biber 1988: 32).
98
Chapter 3
are linked with particular linguistic features and the communicative functions served by those features, particular registers can be descriptively contrasted with one another by means of functional interpretations based solely on various constellations of the speech situation components. Biber (1988: 9-12) offers an illustrative example, which we present below. Consider the following two texts: (6) Conversation – comparing home-made beer to other brands A: I had a bottle of ordinary Courage’s light ale, which I always used to like, and still don’t dislike, at Simon Hale’s the other day… Simply because I’m, mm, going through a lean period at the moment waiting for this next five gallons to be ready, you know. B: mm A: It’s just in the bottle stage. You saw it the other night. B: yeah A: and, mm I mean, when you get used to that beer, which at its best is simply, you know, superb, it really is. B: mm A: you know, I’ve really got it now, really, you know, got it to a T. B: yeah A: and mm, oh, there’s no, there’s no comparison. It tasted so watery, you know lifeless. B: mm
(7) Scientific exposition Evidence has been presented for a supposed randomness in the movement of plankton animals. If valid, this implies that migrations involve kineses rather than taxes (Chapter 10). However, the data cited in support of this idea comprise without exception observations made in the laboratory.
Texts (6) and (7) above can be compared by several situational or functional parameters such as common vs. specialized, unplanned vs. planned (and carefully structured), interactive vs. non-interactive, situation-dependant vs. situation-independent, displaying personal feelings emphatically vs. unemotional and impersonal. These are not dichotomies, however, defining extreme poles of variability. Instead, these parameters define continuums, along which various texts will differ by degrees of a certain characteristic, with a continuous gradation of intermediate positions. These continuums are referred to as dimensions of variation. In relation to the same texts, consider the text below.
Linguistic variability and register variation
99
(8) Panel discussion – discussing corporal punishment as a deterrent to crime A: But Mr Nabarro, we know that you believe this. B: quite A: The strange fact is, that you still haven’t given us a reason for it. The only reason you’ve given for us is, if I may spell it out to you once more, is the following: the only crime for which this punishment was a punishment, after its abolition, decreased for eleven years. You base on this the inference that if it had been applied to crimes it never had been applied to, they wouldn’t have increased. Now this seems to me totally tortuous.
With regard to the dimensions indicated above, text (8) is consistently intermediate between texts (6) and (7). For example, it is relatively unplanned but more carefully organized than text (6); it is interactive, but not to the extent of text (6); it shows more dependence on the immediate situational context than text (7) but not as much as in text (6). Note that text (8) does not consistently resemble either text (6) or (7). Instead, in some respects it is more similar to text (6), in others – it is rather like text (7). This suggests, Biber maintains, that statements about how two texts or two registers are different should not be limited to a single dimension. Text samples and registers in general should be compared along a number of dimensions before attempting definitive statements about register differences (cf. below for illustration of multi-dimensional treatment of textual differences). Abstracting away from the three text samples above and looking at genres in general, Biber cites several examples of situational or functional register differences (1988: 70-71). Among the written genres, press addresses a more general audience than academic prose and it involves considerable effort to maintain a relationship with its audience. Contrary to academic texts, abstract information as well as temporal and physical aspects of the subject matter are of equal importance in press. Compared to press in general, editorial letters assume more shared background knowledge, for example, concerning specific social issues that are commented upon or previous issues of a periodical. Professional letters resemble academic prose structurally (often stating a thesis followed by supporting arguments) but are directed to individuals, enable more interaction between participants, involve an interpersonal relationship and rely heavily on shared background. Fiction is directed to a very broad audience but assumes a great deal of shared cultural knowledge or creates its own internal physical and temporal frame of reference to be shared by the readers. Finally, personal letters are highly informal in style and personal in subject matter, and they assume a high degree of shared background knowledge.
100
Chapter 3
Of the spoken genres, public speeches permit little interaction and, compared to conversation, involve less dependence on shared knowledge. Spontaneous and planned speeches differ in the amount of time permitted for preparation and production. Interviews are different from speeches in that the former have a strictly interactional focus. Face-to-face conversation and, to a lesser extent, telephone conversation focus primarily on interaction to the point of dominating informational content. This is to be contrasted with many other genres, such as news broadcasts, which are tightly restricted to the content being reported. As noted above, genres may be distinguished by various criteria. The above examples may be described as relating to the type of audience or interaction involved, the purpose, level of formality, interrelation between participants (personal, lecturer–passive participant, business-related) and the amount of shared background knowledge. All these are external criteria. However, the bulk of Biber’s (1988) analysis relies on textual relations in English speech and writing, i.e. the systematic linguistic differences and similarities that hold among texts of different registers. However, recall that the two levels of analysis, external and linguistic, are not assumed to be completely independent of each other. Rather, if differences of situational parameters can be paired systematically with linguistic differences, the resulting correlation will facilitate systematic specification of important principles of language use. Ultimately, linguistic features, their communicative functions in language use, and the components of the speech situation come together in Biber’s analysis to offer a comprehensive view of basic, underlying patterns of language variation in English. Above we have seen simple illustration of such variability encoded in situational terms. We also noted Biber’s notion of multi-dimensional analysis. Below we proceed to discuss linguistically defined dimensions along which spoken and written registers differ. Dimensions of variation can be considered from a strictly linguistic point of view. Turning back to texts (6) – (8), the following may be observed: text (6) is verbal rather than nominal (i.e. many verbs, few nouns) and it has a simple structure (little phrasal or clausal elaboration). Text (7), on the other hand, is notably nominal and structurally complex, whereas text (8), when viewed against the same structural criteria, is between the two (Biber 1988: 12). Other linguistic features may be used as well to establish further differences. Table 3 lists the occurrences of passives, nominalizations, 1st and 2nd person pronouns, and contractions as appearing in texts (6) – (8) (from Biber 1988:15). The conversational text and scientific text clearly contrast in their respective counts of all the four features. Assuming for the time being that these two texts are representatives of their genres, it may be inferred that passives tend to cooccur
Linguistic variability and register variation
101
1st & 2nd p. procontractions nouns conversation 0/0 1/0.84 12/10.2 6/5.1 sci. prose 3/6.8 5/11.4 0/0 0/0 panel disc. 2/2.2 4/4.3 10/10.8 3/3.2 Table 3 Frequency counts for texts (6) – (8) (raw frequency count followed by normalized count per 100 words) passives
nomin.
with nominalizations and are far more common in academic prose; contractions, on the other hand, tend to co-occur with 1st and 2nd person pronouns and are much more common in conversation than they are in academic prose. 18 Additionally, in these two texts (and their respective genres), a notable presence of passives/nominalizations may be expected to be accompanied by few pronouns/contractions and vice versa. The two pairings of features are thus complementarily distributed. Frequency data such as these indicate that, in the same way as texts can be compared in terms of their situational characteristics, pairings of linguistic features may define dimensions of variation along which to differentiate between genres. And so, passives and nominalizations belong to the same linguistic dimension: passives are to be expected in a text that is rich in nominalizations; few passives are likely when there are few nominalizations; similarly, with the same pattern of occurrence, 1st and 2nd person pronouns and contractions (and probably other features as well) are both part of the same dimension. Although the illustration presented here is only a simplified outline of the dimensions actually found in English by Biber (see below), it is useful as a conceptual representation of the notion dimension. The author explains (ibid: 13): [A] linguistic dimension is determined on the basis of a consistent co-occurrence pattern among features. That is, when a group of features consistently co-occur in texts, those feature define a linguistic dimension […] This approach is based on the assumption that strong co-occurrence patterns of linguistic features mark underlying functional dimensions. Features do not randomly co-occur in texts. If certain features consistently co-occur, then it is reasonable to look for an underlying functional influence that encourages their use. In this way, the functions are not posited on an a priori basis; rather they are required to account for the observed co-occurrence patterns among linguistic features.
18
Later on in his analysis, Biber confirms this preliminary inference as valid across the board.
102
Chapter 3
The exact nature of the co-occurrence patterns discussed by Biber needs further elaboration. Still considering only the two texts (6) and (7), it appears that there is a co-occurrence pattern between the passive–nominalization dimension on the one hand and the pronoun–contraction dimension on the other. Namely, many passives and nominalizations entail markedly few pronouns and contractions and, conversely many pronouns and contractions co-occur with relatively few passives and nominalizations. The two pairs of features complement each other. This might imply that the two dimensions that do not introduce any contrast in comparison of genres are in fact one and the same dimension. For these two texts all the four features represent a unified dimension, which allows the researcher to predict a marked absence or a marked presence of any of the four features, based on the frequency of a single feature. Yet Biber notes that consideration of the third genre, panel discussion, suggests that the two dimensions should in fact be kept separate. Unlike the conversation text and the scientific text, panel discussion does not seem to have strong (negative) preferences and has a relatively high count of all four features. Text (8) shows that, in certain genres, a marked presence of passives/nominalizations does not preclude a high count of personal pronouns/contractions and vice versa. Thus the two co-occurrence distributions of passives–nominalizations and pronouns–contractions are not related consistently across all genres. In fact, a text may exhibit any quantitative combination of the four features, including cases where all four features have a low frequency count. Consider the text below (Biber 1988: 16): (9) Fiction She became aware that the pace was slackening: now the coach stopped. The moment had come. Upon the ensuing interview the future would depend. Outwardly she was calm, but her heart was beating fast, and the palms of her hands were damp.
In text (9), none of the four features are represented in a single occurrence. We may still postulate the existence of the two co-occurrence distributions of passives–nominalizations and pronouns–contractions. However, we can no longer assume that the two patterns belong to the same dimension – they represent two independent dimensions. Their independence is represented graphically in Figures 1 and 2 (after Biber 1988: 17):
Linguistic variability and register variation many passives and nominalizations scientific text
103
many pronouns and contractions conversation panel discussion
panel discussion
conversation fiction few passives and nominalizations Figure 1 One-dimensional plot of four genres: nominalizations and passives
scientific text and fiction few pronouns and contractions Figure 2 Plot of four genres: 1st and 2nd person pronouns and contractions
Figures 1 and 2 illustrate how genres can be compared along various dimensions of variation and how a single genre can be shown to be similar to various other genres depending on which dimension serves as the basis of comparison. By way of illustration, conversation is like fiction with regard to the passives– nominalizations dimension (Figure 1) but the two are maximally different from each other with respect to the pronouns–contractions dimension (Figure 2). A similar relation applies to comparison of scientific text and panel discussions, with a striking discrepancy between the scores across the two dimensions. We conclude that each dimension independently delivers its own variety of descriptions and comparisons, and these various kinds of comparison allow the analyst to relate genres to one another at multiple levels of enquiry. Thorough investigation of genre differences should obviously consider dimensions comprising other linguistic features. Another set of linguistic features that seem to co-occur in Texts (6) – (9) is the pairing of past tense verbs and 3rd
104
Chapter 3
person personal pronouns (Biber 1988: 16). The academic text has no 3rd person pronouns and no past tense verbs, the conversation and panel discussion texts have a few past tense verbs and no third person personal pronouns, and the fiction text includes frequent occurrences of both features. Below, Figure 3 plots the relevant co-occurrence pattern. many pronouns and past tense verbs
fiction
panel discussion and conversation scientific text few pronouns and past tense verbs Figure 3 One-dimensional plot of four genres: 3rd person pronouns and past tense verbs
The co-occurrence pattern in Figure 3 represents yet another way in which the four registers can be compared. For instance, for the first time here, conversation and scientific prose come close to each other at the bottom of the continuum, both sharing the common characteristic of relatively few 3rd person personal pronouns and past tense verbs. If we were to generalize about the four genres, it would be superficial to say that, for example, conversation is more similar to fiction that academic prose is. Instead, one must specify that conversation and fiction are alike with respect to a particular dimension, and the same might not be the case across all dimensions.
Linguistic variability and register variation
105
We have observed above in our discussion of situational factors that statements concerning register variation should not be restricted to single dimensions; rather, multi-dimensional analysis is preferable. Here again, as evident from analysis of Figures 1 – 3, the same conclusion seems the only sensible option. Comparison of any two genres must entail comparison against the background of many independent dimensions (hence Biber’s label multi-dimensional analysis). The examples discussed above were simplified and intended for illustrative purposes only. The text samples were far too short and too few to offer representative samples from which to draw reliable conclusions. Similarly, the cooccurrence patterns on which dimensions of variation were constructed were intentionally limited to merely two features. Below we build up the present discussion to incorporate Biber’s formulation of 5 major dimensions based on 67 linguistic features. We outline Biber’s findings of the complex co-occurrence patterns found in English and the linguistic dimensions of variation which they delimit. Having selected his set of genres and text samples, and prior to any comparison, Biber – on the basis of previous research – identifies a list of linguistic features associated with different communicative functions and therefore potentially relevant in register variation. He identifies 67 features, which are used in his analysis, and which fall into 16 broad grammatical categories (Biber 1988: 73; a complete list of the features is given therein): (1) tense and aspect markers, (2) place and time adverbials, (3) pronouns and pro-verbs, (4) questions, (5) nominal forms 19 , (6) passives, (7) stative forms 20 , (8) subordination features 21 , (9) prepositional phrases, adjectives and adverbs, (10) lexical specificity 22 , (11) lexical classes 23 , (12) modals, (13) specialized verb classes 24 , (14) reduced forms and dispreferred structures 25 , (15) coordination and (16) negation. Subsequently, computer programs are used to identify and count each occurrence of each feature, and a statistical procedure is implemented to identify cooccurrence patterns among these features. The groupings of features thus discovered define 5 major dimensions of variation, depicted below in Table 4 26 . 19
In Biber’s analysis, nominal forms include (1) nominalizations (ending in -tion, -ment, -ness, -ity), (2) gerunds functioning as nouns and (3) the total of other nouns. 20 be as main verb and existential there. 21 Wh- clauses, that verb complements, present participial clauses, etc. 22 Type/token ratio and mean word length. 23 Conjuncts, downtoners, hedges, amplifiers, emphatics, discourse particles (e.g. sentence initial well), demonstratives. 24 For instance, public verbs (assert, declare, etc.) vs. private verbs (doubt, believe, etc.) 25 Contractions, that-deletion, stranded prepositions, split infinitives and split auxiliaries. 26 Table 4 is based on Biber (1988: 89-90) and Biber et al. (1998:148). Some of the original labels of the 1988 analysis were replaced in Biber et al. (1998) to better reflect
106
Chapter 3
Dimension 1: “Involved vs. Informational Production”
Dimension 2: “Narrative vs. Non-Narrative Discourse”
.90 .96 past tense verbs private verbs rd .73 .91 3 person pronouns that-deletion .48 .90 perfect aspect verbs contractions .43 .86 public verbs present tense verbs .40 .86 synthetic negation 2nd person pronouns .39 .82 present participial clauses do as pro-verb ---.78 - - - - - - - - - - - - - - - - - - - - - analytic negation -.47 .76 present tense verbs demonstrative pronouns -.41 .74 attributive adjs. general emphatics st .74 1 person pronouns .71 pronoun it .71 be as main verb .66 causative subordination .66 Dimension 3: “Elaborated vs. discourse particles Situation-Dependent Refer.62 indefinite pronouns ence” .58 general hedges .56 amplifiers .55 wh-relative clauses on object sentence relatives positions .63 .52 wh-questions .61 .50 pied-piping constructions possibility modals .48 wh-relative clauses on subject non-phrasal coordination positions .45 .47 wh-clauses .36 .43 phrasal coordination final prepositions .36 .42 nominalizations adverbs ---------------------- --- -------------------.60 -.80 time adverbials nouns -.49 -.58 place adverbials word length -.46 -.54 adverbs prepositions -.54 type/token ratio -.47 attributive adjs. -.42 place adverbials -.39 agentless passives -.38 past part. postnominal cls. Table 4 The co-occurrence patterns underlying the five major dimensions of English the interpretation of the dimensions. Two minor dimensions of the total of seven indicated by Biber (1988) have been left out.
Linguistic variability and register variation
Dimension 5: “Impersonal vs. Non-Impersonal Style”
Dimension 4: “Overt Expression of Argumentation” infinitives prediction modals suasive verbs conditional subordination necessity modals split auxiliaries possibility modals ------------------[no negative features]
107
.76 .54 .49 .47 .46 .44 .37 ---
conjuncts agentless passives past participial adv. clauses by-passives past participial postnominal clauses other adverbial subordinators ---------------------[no negative features]
.48 .43 .42 .41 .40 .39 ----
Table 4 continued
Most of the dimensions consist of two broad groupings of features, which are distributed in texts in a complementary pattern (as were passives and nominalizations with respect to 1st and 2nd person pronouns and contractions in the examples above). The division is indicated by a dashed line, with the upper section comprising features with “positive weights” and the lower section comprising features with “negative weights” (Biber’s labels). In a text, a high frequency of the positive features will point to this text being closer to the upper extreme of the dimension continuum, and a high frequency of the negative features will place this text towards the lower extreme of the continuum. On the example of the simplified dimension represented in Figure 3 above, which we assume to represent the functional dimension of narrative–non-narrative texts, the two features of 3rd person personal pronouns and past tense verbs constitute positive features characteristic of highly narrative texts. If it is possible to identify a feature characteristic of non-narrative texts, it would be a feature with a negative weight on that dimension (i.e. listed below the dashed line). A straightforward complementary relation may be observed: when a text has many occurrences of the positive features, it will also have notably few occurrences of the negative features and vice versa. Additionally, all the features defining a dimension are ranked in decreasing order of their relevance in the calculation of a text’s dimension score. The numbers (the ‘weights’ or ‘loadings’) appearing to the right of each feature correspond to how relevant a given feature is in the characterization of a text with respect to the given dimension. To give an example, private verbs have a greater positive weight (.96) on Dimension 1 than wh-clauses (.47) do. Both tend to occur more frequently in texts characterized by the upper end of the dimension
108
Chapter 3
(i.e. texts described as “involved” as opposed to “informational”), but private verbs are a more salient feature. Among the negative features, nouns (–.80) is a more prominent feature than agentless passives (–.39), although both are listed as typically appearing in information-oriented texts. 27 The co-occurrence patterns identified by Biber serve the author as a springboard for functional interpretations. Recall that his central assumption here is that “linguistic features co-occur frequently in texts because they are used for a shared set of communicative functions in those texts” (Biber 1988:101). In his functional interpretations of textual differences, the author distinguishes seven major functions that can be served by linguistic features. Each of them is associated with a type of information that is marked in discourse. These seven functions are presented below (Biber 1988: 35): I. Ideational functions A. Presentation of propositional meaning (referential content) B. Informational density II. Textual functions A. Different ways of marking informational structure and prominence B. Different ways of marking cohesion C. The extent to which informational structure, prominence, and cohesion are marked III. Personal functions A. To mark group membership of addressor B. To mark idiosyncratic characteristics of addressor C. To express attitudes towards the communicative event or content IV. Interpersonal functions A. To mark role relations between participants B. To express attitudes towards particular participants V. Contextual functions A. To mark physical or temporal setting B. To mark purpose C. To mark the psychological ‘scene’
27
We ignore the exact calculation of these values. Cf. Biber (1988) for a detailed discussion.
Linguistic variability and register variation
109
VI. Processing functions: caused by or in consideration of the production and comprehension demands of the communicative event VII. Aesthetic functions: personal and cultural attitudes towards for A. To conform to grammatical prescriptions B. To conform to ‘good style’ The five dimensions of variation thus receive a functional interpretation based on the functions specified above. The labels assigned by the author to each dimension are indicative of their functional interpretations: Dimension 1: Involved vs. Informational Production marks “affective, interactional and involved” content versus “high informational density and exact informational content” (Biber 1988: 107). The contrast between the two may be related to differences of purpose and production circumstances: the former pole of this dimension represent texts (e.g. telephone conversations) with an interactional, involved and affective focus and associated with “strict real-time production and comprehension constraints” resulting in “generalized lexical choice and a generally fragmented presentation of information” 28 (Biber ibid.); the latter pole of the dimension represents discourse “carefully crafted and highly edited” with a highly informational focus, precise phrasing and lexical choice, and careful integration of information into the text (as in official documents). Dimension 2: Narrative vs. Non-Narrative Discourse is self-explanatory. It distinguishes discourse of a primarily narrative focus (especially fiction) from discourse with non-narrative purposes (argumentative, expository, descriptive, conversational, or other). The most salient of the positive features – past tense verbs, 3rd person pronouns and perfect aspect verbs – clearly indicate the narrative focus. Reference to non-past time is markedly frequent in non-narrative discourse (Biber et al. 1998:153). Dimension 3: Elaborated vs. Situation-Dependent Reference distinguishes between highly explicit, elaborated and context-independent (endophoric) reference and nonspecific, situation-dependent (exophoric) reference. The most prominent positive features in this dimension – 3 types of wh-relative clauses – are an example of devices which allow explicit, elaborated identification of referents (as in academic prose). Furthermore, the use of nominalizations facilitates explicit integration of information. At the other pole of the dimension score texts (conversations) that rely on “nonspecific deictics and reference to an external situation for identification purposes” (Biber 1988: 115). Comprehension is thus largely dependent on “direct reference to, or extensive knowledge of, the physical and temporal situation of discourse production” (Biber 1988: 148). 28
Cf. text (6) above.
110
Chapter 3
Dimension 4: Overt Expression of Argumentation marks argumentative discourse: either mere presentation of a point of view or an intentional attempt to persuade the addressee. Through the use of various modals, this dimension marks the speaker’s assessment of likelihood, advisability and necessity. Dimension 5: Impersonal versus Non-Impersonal Style distinguishes discourse that is abstract and technical in content, and impersonal and formal in style from other types of discourse. As to individual features considered by Biber (1988), of particular interest to the present study are nominalizations (see Chapter 4). Below we outline Biber’s findings concerning their distribution across registers and functional interpretations. 7. English nominalizations in Biber (1988) Abstract nominalizations in -tion, -ment, -ness, -ity are explicitly listed as one of the positive features (albeit with smaller weights) defining Dimension 3 Elaborated vs. Situation-Dependent Reference. Functionally, the use of nominalizations facilitates explicit identification of referents and economical integration of complex information into the minimum of words. This typically results in highly elaborated discourse, especially when considered in combination with the other positive features assigned to Dimension 3. Additionally, with respect to genre differences, frequent nominalizations indicate the prominence of informational content. In fact, all nominals in general are “the primary bearers of referential meaning in a text, and a high frequency of nouns thus indicates great density of information” (Biber 1988: 104). There are thus several functional characteristics that we note to be associated with nominalizations and which co-occur in texts: explicit reference, dense integration of information, informational focus and content. When close association of these characteristics is acknowledged, one must also recognize a close association between Dimensions 1 and 3 (at least with respect the use of nouns and nominalizations). The negative features of Dimension 1 mark discourse that is highly informational, and the positive features of Dimension 3 mark discourse that is elaborated and explicit in reference. It is very likely that this is where the two dimensions overlap. This is confirmed by Biber (1988: 110) when he concedes that “referentially explicit discourse also tends to be integrated and informational”. Although the features for Dimension 1 do not include nominalizations as such, they do include nouns (which include nominalizations) and word length. Similarly to nominalizations, word length marks high density of information and “precise lexical choice resulting in an exact presentation of informational content”. And, obviously, nominalizations add to extensive word length, as they do
Linguistic variability and register variation
111
tend to be rather sizeable. We thus conclude that among the features of Dimensions 1 and 3, nominalizations are one of the primary markers of elaborated discourse, explicit reference, dense integration of information, as well as informational focus and content. We now turn to consider which genres of those investigated by Biber (1988) match the description presented above along the lines of Dimensions 1 and 3. Below, Figures 4 and 5 plot mean scores of genres along Dimensions 1 and 3 respectively. Positive and negative values correspond to the two-pole distinction in each dimension. That is, in Dimension 1, positive values correspond to markedly involved discourse while negative values correspond to informational content. In Dimension 3, positive values correspond to markedly elaborated/explicit reference whereas negative values correspond to situation-dependent reference. As is clear from both figures, elaborated/explicit reference and informational content co-occur in combination in several genres, notably in official documents, academic prose, press reviews and religion. With specific regard to nominalizations, one may infer that they will be an important part of linguistic characterization of these genres. This matter, however, requires further confirmation in the form of exact quantitative counts. Biber (1988: Appendix II) cites the following mean frequency counts of nominalizations for the top five registers: professional letters – 44.2 items per 1,000 words official documents – 39.8 items per 1,000 words academic prose – 35.8 items per 1,000 words press editorials – 27.6 items per 1,000 words religion – 26.8 items per 1,000 words Overall, the ranking of registers according to their scores on Dimensions 1 and 3 in Figures 4 and 5 matches the ranking of registers according to the mean frequency counts of nominalizations. Yet there is an important discrepancy: professional letters, not official documents, have the highest mean count of nominalizations per 1,000 words. Because professional letters score low on Dimension 1 in terms of informational content (approximately -2.5, compared to -20 for official documents), we conclude that Dimension 1 values may not reflect accurately the frequency of nominalizations in a register. This indicates that nominalizations are after all more representative of Dimension 3 than they are of Dimension 1. Biber’s work is an important contribution to what we know about the distribution of nominalizations across registers. Yet his 1988 analysis offers no insight into the extent to which particular nominalizing affixes are diversified across genres. This issue as well others related to nominalizing affixes will be addressed in the next chapter.
112
Chapter 3
35
telephone conversations face-to-face conversations
30
25
20
personal letters personal speeches interviews
15
10
5
romantic fiction prepared speeches
0
mystery and adventure fiction general fiction professional letters broadcasts
-5
-10
-15
science fiction religion humor popular lore; editorial hobbies biographies press reviews academic prose; press reportage
-20 official documents
Figure 4 Mean scores for Dimension 1 for each of the genres (Biber 1988:128)
Linguistic variability and register variation official documents 8 professional letters 7 6 5
press reviews; academic prose
4
religion
3 2
popular lore editorials; biographies spontaneous speeches
1 prepared speeches; hobbies 0 -1
press reportage; interviews humor science fiction
-2 -3 -4
general fiction personal letters; mystery & adventure fiction face-to-face conversation; romantic fiction
-5 telephone conversations -6 -7 -8 -9
broadcasts
Figure 5 Mean scores for Dimension 3 for each of the genres (Biber 1988: 143)
113
Chapter 4 A register-sensitive study of English nominalizations 1. Introduction Variation and variety is inherent in language. On different occasions, different forms are used to express the same idea and still different forms may be used in a similar situation by different language users. Speakers constantly switch from one language variety to another depending on situational settings, the kind of discourse they are involved in, the purpose of language production, the intended audience, and the nature of the speaker-hearer relationship. In general, it is commonly acknowledged that such non-linguistic, contextually defined factors underlie the differentiation of distinct language varieties or registers, where register is a “cover term for any variety associated with a particular configuration of situational characteristics and purposes” 1 (Biber and Conrad 2001: 175). Thus registers are defined in terms of non-linguistic factors. And yet these situational parameters of language use correlate with linguistically definable properties that are traceable to particular registers. The study of register variation thus focuses on systematic patterns of variation in language use that are instantiated in certain patterns of linguistic features (i.e. grammar constructions, vocabulary items, word-formational elements, etc.). 2 These features also include phenomena relevant to morphology and these will be discussed in this chapter. 1
The term genre is also used with the same meaning by some authors and corpus compilers. Most linguists use one or the other, or both, interchangeably, on different occasions. However, there have been attempts to differentiate between the two. Lee (2001: 47) argues that “the term genre is used to describe groups of texts collected and compiled for corpora or corpus-based studies, and register has typically been used in a very uncritical fashion, to invoke ideas of ‘appropriateness’ and ‘expected norms’, as if situational parameters of language use have an unquestionable, natural association with certain linguistic features and that social evaluations of contextual usage are given rather than conventionalised and contested”. Biber (1994) gives a survey of how these and other related constructs (text types, styles, sublanguages) have been variously used and defined. 2 For example, passive voice tends to co-occur with nominalizations to characterize more formal registers. At the same time, these registers will have markedly few pronouns and contractions (Biber 1988). The identification of such patterns of cooccurrence is mostly attributed to the work of Douglas Biber (especially 1988) – cf. Chapter 3 of the present work.
116
Chapter 4
The content of this chapter is related to the following four subject areas: • register variation, which is concerned with the identification of systematic differences between various language varieties, however narrowly defined, such as written and spoken language, fiction, academic prose, book reviews, lectures, conversation, etc. • morphological productivity, which is interested in establishing the wordcoining potential of, for example, affixes • register-sensitive affix productivity, which is to be understood as the interplay of the first two subject fields • corpus linguistics, which studies language patterns by means of corpus-based research In relation to corpus-based study of register variation, we add to the research carried out hitherto by examining a number of nominalizing English suffixes and putting forward claims about their varied distribution across registers. It will be shown that the suffixes themselves exhibit preferences as to their occurrence in particular registers and that, in the case of the most common suffixes (-ness, -ity, -ion and -ment), the internal morphological make-up of their base forms may also have a significant bearing on their quantitative distributions. As regards morphological productivity, we offer a corpus-and-dictionarybased analysis of the same suffixes with regard to the number of new coinages derived by the respective formatives. This is done by the joint utilization and cross-reference of the British National Corpus (BNC) and the Oxford English Dictionary (OED). As is to be expected, the suffixes will uncontroversially exhibit varied degrees of productivity. More interestingly, however, our analysis will show that the morphological constitution of the base form may still influence the probability of a new word coming into existence. Additionally, our findings on morphological productivity and neology will discriminate between particular registers and identify certain noteworthy patterns. In historical perspective, the present study fills several research gaps. Firstly, little if anything is known from previous research about the influence of baseinternal morphological structure on register variation: few analyses extend beyond the rightmost affix. 3 As a consequence, nominalizations ending in a particular affix, or virtually all nominalizations (e.g. Chafe and Danielewicz 1987), tend to be somewhat superficially treated as uniformly more characteristic of one register than another. For example, Biber (1988: 17) compares four different 3
Base-internal complexity is ignored in, for example, Biber (1988), Biber et al. (1998 and 1999) and Plag et al. (1999). Suffix–suffix combinations are considered by Baayen and Renouf (1996) but their analysis is limited to newspaper English only.
A register-sensitive study of English nominalizations
117
registers (or genres) along a scale of the extent to which nominalizations and passives are used in each (see section 6 in Chapter 3). The two features are said to exhibit co-occurrence patterns (i.e. be jointly frequent or rare in a given language variety) and their joint increasing occurrence is said to set apart four genres as follows: fiction < conversation < panel discussion < scientific text. What is of interest to us, all nominalizations are taken there as a unified group to contrast the four genres, or at least the two extremes of the plotted scale. 4 Paying more attention to affixal identity, Biber et al. (1998: 63) assert that “[t]he -ness ending is more important in fiction than in either of the other two registers” 5 (i.e. academic prose and speech). Claims to the same effect are made by Biber et al. (1999). On closer inspection, however, further investigations into the morphological make-up of the base form show that serious questions may hang over the accuracy of such statements. For example, the derivatives in -iveness from the BNC are found in this study to be almost nine times as common in academic prose as they are in fiction (see section 4.3.2). Similarly, Biber (2004: 21) makes another sweeping generalization in his claim that nominalizations, along with some other language features such as long words, abstract nouns and relative clauses, are “typical of written nonfictional registers intended for specialist audiences”. As both accounts, i.e. Biber et al. (1998) and Biber (2004), regard -ness words as nominalizations, the two quotations above must now appear confusingly contradictory. Thus, it is the kind of more detailed analysis that we have illustrated with -iveness derivatives, i.e. the scrutiny of the morphological structure of the base form, that will properly distinguish between registers. Our study of English nominalizing suffixes in the BNC indicates that their distribution is not as consistent as one might wish. Plag et al. (1999) is a study of morphologically-based differences between speech and writing and, as the authors claim, is “a first window on this aspect of register variation”. This work too suffers from research gaps that have yet to be attended to. Firstly, the authors only cover differences between spoken and written language, the former being additionally differentiated into two domains of ‘spoken context-governed’ and ‘spoken demographic’. In contrast, in our research we compare a wider range of registers from the same corpus (see Methodology). Secondly, no distinction is made by Plag et al. (1999) as to the internal make-up of base forms and so the results given lack sufficient detail and accuracy. Thirdly, the authors’ method of measuring morphological productivity
4
This represents only a fraction of Biber’s dimensions of register variation. See section 3.5 for more discussion. 5 Biber’s normalized counts of -ness nominalizations per million words are 1,430 in fiction, 890 in academic prose and 480 in the spoken corpus.
118
Chapter 4
(developed by Baayen and Lieber 1991) may be questioned on several counts to be discussed below. For example, two factors influencing estimated productivity are the number of word tokens with a given affix and the number of hapax legomena with the same affix. The quotient of the two values equals the degree of productivity of an affix. This method has received some criticism in the literature since the correlation in the formula is such that items of high frequency and, therefore, high token counts contribute to decreased values of productivity. 6 For example, the high frequency of words such as awareness would have the effect of decreasing the measured value of the productivity of the suffix -ness despite the fact that the suffix is highly productive. The reason why established (even lexicalized) high frequency items should downplay the otherwise unlimited potential of an affix for word-coinage is not quite clear. Reliance on hapax legomena as indicators of productivity may also be questioned. 7 It is true that most neologisms occur among hapax legomena (Baayen and Lieber 1991, Plag 1999) and it is indeed neologisms that are the least controversial index of productivity. Yet the hapax legomena of a corpus are not necessarily novel creations in every single case: they may be rare or obsolete words, or they may occur only once in the corpus simply due to its finiteness or limited coverage. The number of hapaxes will thus indicate a measure of an affix’s relative productivity (however imprecise, see footnote 2 below) rather than give a precise number of new words with that affix. As a result, a list of the affix’s hapaxes will certainly not equal a list of the affix’s new words. More to the point, lexical innovations occurring more than once in the corpus were inevitably ignored by Plag et al. by virtue of the rationale adopted. Although the majority of new words do occur among hapax legomena, there is no reason why at least some of them should not have frequency counts exceeding 1. By way of illustration, the words deniability and clonability appear in the BNC seven and five times respectively but are nonetheless considered lexical innovations in this study on the grounds of their absence from the OED. As an alterna6
Van Marle (1992: 156), quoted in Bauer (2001:153), convincingly points out in relation to this measurement: “I do not see what kind of direct relationship there is between the chance that a given rule is put into action and the frequency with which the words that have already been produced by that rule are used. Once a word is coined, the frequency of the use of that word, it seems to me, is more or less irrelevant to the degree of productivity of that rule.” 7 For example, Plag (1999: 112) notes that in a random sample of 10 verbal hapaxes in -ate from the Cobuild corpus, 7 items turned out to be formations of the sixteenth, seventeenth and eighteenth centuries. The author points out that the proportion of genuine neologisms among hapax legomena can vary greatly between different word-formational rules thus casting doubt on the validity of Baayen’s measure.
A register-sensitive study of English nominalizations
119
tive, then, in this investigation we adopt a different approach to the identification of new words. First, items of low frequencies (below 20) are isolated in the BNC and then they are tested against the OED to confirm or disprove their novelty (see Methodology for discussion). Another reason for this procedure was that, all in all, the primary aim of this study was not to gauge the levels of morphological productivity on the basis of a more or less reliable formula (notably offered by Baayen and his collaborators 8 ) but to retrieve from the BNC new lexical formations and then, on the basis of the word lists compiled, to put forward claims about their distribution and morphological structure. Additionally, the sheer numbers of relevant formations will speak for themselves as to the productivity of particular processes. In fact, while recognizing some of the merits of Baayen’s hapax-conditioned measurements of productivity, his formulae could not have been used in this particular study. Baayen’s measures offer a rough estimate of productivity without any insight into base-internal structural complexities. The productivity of an affix may be estimated and it may be compared to the productivity of other affixes, but no distinction is made as to whether this suffix is likely to be more or less productive when following a particular type of base form. Given the bias of the present study towards base-internal morphological complexity and its effect on the productivity of the rightmost affix (as well as register variation), we follow a more discerning course of analysis – one in which claims about productivity are made with reference to various types of base forms. Cowie (2000) is also relevant to this review of literature as it is in fact one of the few studies of English nominalizations biased towards register differences. And yet its focus is on something rather different from the goals of the present work. The author measures the productivity of the nominalizing -ion by counting new word types in the historical corpus of English (ARCHER). The results are then compared across registers and across time periods (fifty-year intervals between 1650 to 1950). Uncontroversially, the two registers of Science and Medical are found to have the highest frequencies of new types in -ion. Biber et al. (1999: 322-325) briefly compare nominalizations across the registers of Conversation, Fiction, News and Academic English. Their analysis is hardly exhaustive but, at least in terms of the range of registers covered, their account of these four varieties constitutes perhaps the most comprehensive study of register distinctions among nominalizations (see Results and Discussion). Nevertheless, their analysis concentrates on only four nominalizing suffixes (-ness, -ity, -ion and -ism) and fails to investigate further the morphological 8
The most well-known is Baayen and Lieber’s (1991) P ‘productivity of an affix in a given corpus’: P = n1/ N, where n1 equals the number of hapaxes with a given affix in a corpus and N equals the number of all tokens with that affix in the same corpus.
120
Chapter 4
structure of nominalizations. In this work, -ism is replaced with -ment because of our criteria of data selection (-ment derives Nomina Actionis). Additionally, the present dissertation adds two additional registers to those studied by Biber at al. (1999). Other than the works cited above, morphological study has made little effort to look into nominalizations with the aim of establishing register-based differences, and indeed very little effort to give a register-sensitive account of word formation in general. The studies that are available include Cowie (2006), which discusses viewpoint adverbs in -wise across the registers of the BNC; Baayen and Renouf (1996), and Renouf and Baayen (1998), both of which investigate a wide number of affixes in a corpus of newspaper English (The Times and The Independent respectively); and Fischer (1998), which is a survey of so-called creative neologisms across the corpora of two American and British newspapers. Another question that is of relevance in corpus-based investigations is that of sample size. Here, again, the present study offers an advantage. Biber et al. (1998) offer generalizations and numerical statistics concerning the distribution of nominalizations based on samples of 3 million words each of academic prose and fiction (both from the Longman-Lancaster Corpus), and the 500,000 words of the London-Lund Corpus for spoken English. In this study, results are based on samples of approximately 15 and 16 million words of academic prose and fiction respectively, 16 million words of non-academic (non-fiction) sources, 10 million words of spoken and newspaper English each, and 7 million words of popular magazines. Naturally, research based on a larger sample is at an advantage in that it will produce results that will be more representative of the language in question. Plag et al. (1999) base their findings on the same corpus, the BNC, but, as mentioned above, restrict themselves to an investigation of the spoken–written contrast. With a shift of focus to neology, the present study aims to fill another research gap. Previous research has little if anything to say about register distinctions among lexical innovations. Although new words might intuitively be expected to follow the pattern of variation set by the more established words of a structurally similar kind, no empirical evidence is available to support such expectations. 2. Aims and research questions The aims of the corpus study were to: 1) investigate register variation with regard to the distribution of twelve English nominalizing suffixes that fall into two broad categories: those deriving abstract nouns denoting actions (Nomina Actionis – ‘action of V-ing’) and those deriving
A register-sensitive study of English nominalizations
121
abstract nouns denoting states/qualities (Nomina Qualitatis – ‘state or quality of being Adj/N’). 9 For this aim, both innovative and well-established words will be considered. The two groups are represented by the following suffixes: Nomina Actionis
Nomina Qualitatis
-ion (conversion) -ment (judgement) -age (assemblage) -al (referral) -ance/-ence (conveyance) -ery (mockery)
-ness (sadness) -ity (purity) -dom (stardom) -hood (adulthood) -ance/-ence (reluctance) -ery (savagery) -(c)y (delicacy) 10 -ship (ownership)
2) investigate the extent to which the suffixes are used in the coining of new lexical formations (i.e. their productivity) – both nonce words and neologisms as defined in Chapter 1 3) compile a list of new formations appearing in the corpus used, study their structural patterns, and investigate register variation with regard to the distribution of these lexical innovations There were several questions to be answered as part of Aim 1): • Are the findings of other authors concerning register differences between nominalizations largely borne out by the data in the British National Corpus? 9
The suffix -ing was excluded on the grounds that, given the size of BNC, it would have been nearly impossible to isolate genuine nominalizations and ignore non-nominalizing uses, cf. Malicka-Kleparska (1988). The suffixes -ance/-ence and -ery are considered here in their dual formal and functional usage as deverbal action-denoting and denominal/de-adjectival quality-denoting nouns. Due to this duality the two variants of both suffixes will be considered separately for descriptive purposes. However, both instantiations will be treated as underlyingly representing the same suffix. 10 The exact formal identity of the suffix is arguable. In this work, derivation pairs like accurate – accuracy and redundant – redundancy are represented as involving the suffix -(c)y. Following authorities such as Chomsky and Halle (1968) and Rubach (1984), for whom the suffix is represented as -y, the parenthesized (c) indicates a result of spirantization [t] > [s]. Alternatively, and with no reference to spirantization, the suffix is also represented in the literature as -cy (e.g. Plag 1999, 2003; Hay 2003, Bauer 1983). Marchand (1969: 249) discusses the same type of derivation under -acy, and yet he later argues that “the final /t/ of the basis is dropped before the suffix /si/”, thus blurring his account of the suffix’s identity and offering an alternative treatment of the consonantal alternation based on a truncation process.
122
Chapter 4
• Are nominalizations similarly distributed and thus do they constitute a functionally unified category or do the twelve suffixes exhibit preferences as to their distribution across the registers of the BNC? • If preferences are detected, is the internal morphological composition of the base form a relevant factor? With regard to Aim 2), the following hypotheses were tested: • Do the suffixes exhibit varied degrees of productivity? • Does the morphological constitution of the base form influence the probability of a new word coming into existence? Aim 3) has this question to answer: • Are there any patterns in the distribution of new abstract nominalizations across registers and, if so, can they be related to structural considerations? All the above questions will be taken up in Results and Discussion after we have presented the results of the study. Given the aims of this dissertation, it is only obvious that the rationale behind the choice of the suffixes was such that, firstly, derivational nominalization is the only linguistic feature which is strictly word-formational and which has been identified as a differentiating factor in register variation (Biber 1988, Plag et al. 1999). Secondly, of all the nominalizing English affixes, the four suffixes -ness, -ion, -ity and -ment have thus far been the focus of attention in study of register variation (Biber et al. 1999). Once more they are revisited in this dissertation, although with a notable bias towards any structural considerations that may shed new light on the patterns of behaviour of these formatives across language varieties. Indeed, it is precisely for this reason that nominalizations in -ness, -ion, -ity and -ment will be given most attention: they attach to already suffixed base forms and thus lend themselves to investigations of structural aspects of register variation, morphological productivity and lexical innovation. They are also the most frequent abstract-noun-forming suffixes in English and as such are especially worthy of attention. Thirdly, for the sake of completeness of coverage, the remaining suffixes (-age, -dom, -ery, -ship, -(c)y, -al, -ance/-ence, -hood) naturally complement -ness, -ion, -ity and -ment in that they all represent two functionally coherent word-formational classes denoting either actions of V-ing or states/qualities of being Adj/N. 11 11
Each group thus consists of what can be characterized as rival exponents of the same word-formational category. The abstract-noun-forming suffix -ure (e.g. closure) has been ignored due to its minimal spread – only a dozen word types and virtually no innovative ones.
A register-sensitive study of English nominalizations
123
3. Methodology 3.1. The BNC genres and super-genres The results of our study are based on the 100 million word BNC corpus (World Edition) of contemporary British English. Over 91% of the data collected are dated from 1984 to 1994 and the remaining part is no older than 1960 (Burnard 2000). The spoken component of the BNC constitutes approximately 10% of the total and the written texts make up 90%. As further division of the spoken component is irrelevant to this work (see below for reasons), below we concentrate on the composition of the written subcorpus. All the texts in the written part of the BNC are classified into distinct categories according to several criteria. The first classification is into imaginative (fictional literary texts) and informative texts (all others). Informative texts are additionally divided into domains or broad subject fields such as Leisure, Arts, Social Science, Commerce and Finance, Belief and Thought, World Affairs, etc. This classification scheme, however, is too broad and inexplicit to be used as a basis for investigations of register variation. Lee (2001) observes that, for example, academic and non-academic texts are not explicitly differentiated, instead being subsumed indiscriminately under the domains Applied Science, Arts, Pure/Natural Science, Social Science, etc. Instead of the BNC domains, then, we will employ another classification code, that of genres. 12 Each text in the corpus is annotated as belonging to a particular genre and super-genre, a level of text categorization that, compared to the domains, is far more insightful, descriptively adequate and explicit, and therefore more suited to the study of register variation (see Lee 2001). Below, the detailed classification of texts into genres and super-genres is given as coded in the BNC (Burnard 2000) along with the interpretation of the codes (interpretation after Lee 2001). Written texts – 46 genres W_ac_humanities_arts (academic prose: humanities) W_ac_medicine (academic prose: medicine) W_ac_nat_science (academic prose: natural sciences) W_ac_polit_law_edu (academic prose: politics, laws, education) W_ac_soc_science (academic prose: social & behavioural sciences) W_ac_tech_engin (academic prose: technology, computing, engineering)
12
The term genre is brought up here as employed by the BNC compilers in their text classification. The term super-genre is adopted after Lee (2001) as a more general level of text categorization (e.g. tabloid newspapers (sub-genre) > national newspapers (genre) > news texts (super-genre)). However, we will still use the term register to refer broadly to any language varieties relating to different production circumstances and purposes.
124
Chapter 4
W_non_ac_humanities_arts (non-academic/non-fiction: humanities) W_non_ac_medicine (non-academic/non-fiction: medical/health matters) W_non_ac_nat_science (non-academic/non-fiction: natural sciences) W_non_ac_polit_law_edu (non-academic/non-fiction: politics, law, education) W_non_ac_soc_science (non-academic/non-fiction: social & behavioural sciences) W_non_ac_tech_engin (non-academic/non-fiction: technology, computing, engineering) W_fict_drama (fiction: drama) W_fict_poetry (fiction: poetry) W_fict_prose (fiction: novels) W_news_script (TV autocue data) W_newsp_brdsht_nat_arts (broadsheet national newspapers: arts) W_newsp_brdsht_nat_commerce (broadsheet national newspapers: commerce & finance), W_newsp_brdsht_nat_editorial (broadsheet national newspapers: personal & institutional editorials, & letters-to-the-editor) W_newsp_brdsht_nat_misc (broadsheet national newspapers: miscellaneous material), W_newsp_brdsht_nat_report (broadsheet national newspapers: home & foreign news reportage), W_newsp_brdsht_nat_science (broadsheet national newspapers: science material) W_newsp_brdsht_nat_social (broadsheet national newspapers: material on lifestyle, leisure, belief & thought) W_newsp_brdsht_nat_sports (broadsheet national newspapers: sports material) W_newsp_other_arts (regional and local newspapers: arts), W_newsp_other_commerce (regional and local newspapers: commerce & finance) W_newsp_other_report (regional and local newspapers: home & foreign news reportage) W_newsp_other_science (regional and local newspapers: science material) W_newsp_other_social (regional and local newspapers: material on lifestyle, leisure, belief & thought) W_newsp_other_sports (regional and local newspapers: sports material) W_newsp_tabloid (tabloid newspapers) W_admin (adminstrative and regulatory texts, in-house use) W_advert (print advertisements) W_biography (biographies/autobiographies) W_commerce (commerce & finance, economics)
A register-sensitive study of English nominalizations
125
W_email (e-mail sports discussion list) W_essay_school (school essays) W_essay_univ (university essays) W_hansard (Hansard/parliamentary proceedings) W_institut_doc (official/govermental documents/leaflets, company annual reports, etc.; excludes Hansard) W_instructional (instructional texts/DIY) W_letters_personal (personal letters) W_letters_prof (professional/business letters) W_misc (miscellaneous texts) W_pop_lore (popular magazines) W_religion (religious texts, excluding philosophy) The written genres naturally fall into 5 different super-genres in the way indicated by the prefixes used by the BNC compilers: W_ac – Written Academic, W_non_ac – Written Non-Academic (Non-Fiction), W_fict – Written Fiction, W_newsp – Written Newspaper (additionally divided into broadsheet and other), W_ – Written Other. Due to the heterogeneous nature of the genres belonging to the super-genre Written Other, the only one of its components that will be taken into consideration in this study is that of Popular Magazines. The range and sample sizes of distinct varieties of English investigated here are thus as follows: 13 Spoken (taken as a whole) – approximately 10.33 million word tokens Fiction – approximately 16.19 million word tokens Academic – approximately 15.43 million word tokens Newspapers – approximately 10.64 million word tokens Non-Academic (Non-Fiction) – approximately 16.63 million word tokens Popular Magazines – approximately 7.38 million word tokens The reason behind the decision not to break down the Spoken super-genre (divided in the BNC into 24 genres) is that the nominalizing suffixes were found consistently to be least productive in the spoken subcorpus (see Results and Discussion) and so most attention in the study was directed to the written component of the BNC. On the other hand, the lack of differentiation between broadsheet newspapers and other newspapers was dictated partly by practical reasons, i.e. the fact that the infrastructure of the BNC interface used in this work was 13
The discrepancies in sample sizes are largely irrelevant. Frequency counts of word tokens are normalized to a common basis of 1 million words of text. For the treatment of word types see Procedure.
126
Chapter 4
such that the two types were grouped together, implying a joint treatment of the two. Secondly, to separate them would mean having two smaller samples instead of one larger sample. In research into the productivity of a particular affix, for reasons of statistical accuracy, it is preferable to compare 2 samples of, for example, 15 million words (Academic) and 10 million words (Newspapers) rather than compare samples of 15 million words (Academic) and 2 million words (Broadsheet Newspapers) (see below for a discussion of sample sizes). Thirdly and most importantly, in a pilot study conducted by the author, the two types proved very similar in their frequency and distribution of the nominalizing suffixes. 14 The six super-genres, then, are our primary basis for cross-register comparisons. Still, whenever appropriate or necessary, individual genres will be isolated and brought to the scene to identify noteworthy patterns of variation. 3.2. Mark Davies’s online BNC interface In this research project, the BNC was used with the Internet-based interface of Professor Mark Davies that allows the entire corpus to be searched. 15 This online service has several advantages over the SARA software that accompanies the BNC CD-ROM. Firstly, the user-friendly interface designed by Davies makes possible straightforward searches limited to individual genres and supergenres as well as the unrestricted combination of any (super-)genres in a single search. This feature facilitates simple comparisons across genres and supergenres. Secondly, affix-based searches (both prefix and suffix) are made possible by means of wildcards, thus allowing for various morphological investigations. Searches devoted to type and token frequencies can also be executed, thus facilitating both quantitative and qualitative analyses. Finally, the service allows its users to compile customized lists of words which are then recorded in the server computer and are available for future use. The word lists, which can be reviewed and modified as necessary, are available for queries limited to the items on an individual list. This last feature in particular is especially useful in corpus-based morphological research. As is well-known, the collection of data in electronic corpora needs great care and involves extensive manual editing. Data retrieved in a search need tedious manual sorting for a number of reasons. First, to eliminate items that are not nominalizations at all but only match the requested string of letters (e.g. pity and city will be included in a wildcard query (*ity) for -ity 14
Similarly, Biber et al. (1999) in their corpus-based approach to English grammar at large subsume under NEWS (newspaper language) the three registers of broadsheet, regional and tabloid newspapers. 15 The service is available at .
A register-sensitive study of English nominalizations
127
nominalizations). Secondly, search results need revising for misspelt words, which are retrieved as separate word types in addition to the word spelt correctly. Thirdly, even more manual editing is necessary, for example, in a study focused on de-adjectival or non-prefixed nominalizations, where non-deadjectival and prefixed items will have to be excluded. And such is the case in the present study, in which the majority of prefixed items have been excluded (see below). For all these reasons, the feature of customized word list is valuable in establishing patterns of register variation which are meant to be valid for a specifically defined class of words. 3.3. The data The following groups of complex words were the subject of this study: deadjectival nominalizations in -ness, -ity, -ance/-ence, -(c)y, -dom, -ery, -hood; deverbal nominalizations in -ion, -ment, -age, -al, -ance/-ence, -ery; and denominal nominalizations in -(c)y, -dom, -hood, -ship. This excluded items departing formally from the above criteria (for example through affix generalization) such as denominal -ness derivatives (owlness, godness, Guinessness) and -ion words that are best treated as simplex (function, fiction). An additional criterion of selection adopted here was that the words qualifying for inclusion had to be clearly first-cycle derivatives of the suffixes in question and not any instantiations of the suffixes, such as prefixed formations whose base forms end in one of the twelve suffixes. For example, semi-baldness and reunification were excluded, even though the suffixes -ness and -ion appear at the end of the words, because they are best seen as derivatives of semi- and unprefixation. To count reunification as another word type of -ion would have been to artificially inflate the productivity of this suffix and distort the accuracy of our findings. That is why most prefixed words ending in the four suffixes were excluded. However, exceptions to this rule were made in the following cases: • When prefixation (or compounding) clearly precedes suffixation in the derivation of a word, e.g. outrageous#ness (not out#rageousness) and wide-awake# ness (not wide#awakeness), and therefore the result of the rule that applies last is a -ness nominalization. Such items were retained as illustrating the productivity of the suffixes in question. • When postulating prefixation would have implied the existence of unattested forms that cannot stand on their own, e.g. unruli#ness (*ruliness) and lawabiding#ness (*abidingness). • When a prefixed word departs semantically from the corresponding non-prefixed form; e.g. excommunication is retained as it is not regularly derivable from the non-prefixed form communication.
128
Chapter 4
• When a prefixed form is an attested form, but neither the root morpheme nor the corresponding non-prefixed nominalization is, e.g. transmogrification (*mogrify, *mogrification). Compound nouns were included only when they were eligible candidates meeting the above criteria, i.e. when they were clear cases of last-cycle nominalization (topsy-turviness, soft-spokenness) and when the first root morpheme of the compound could not stand on its own (*turviness, *spokenness). 16 Items with free-standing root morphemes that were best viewed as nominalizations-turnedcompounds were excluded, e.g. paper-thinness, sword-sharpness. All in all, the overriding principle in the selection of data was to include only those items that clearly illustrated the productivity of the nominalizing suffixes. For this same reason blends were excluded altogether. Any unclear items that could not be traced to any possible base forms were also deleted (e.g. ennubelation, chasifness, rogation). One observation seems in order at this point. Because of the selection criteria as they stand above, it should be noted that the results of the research will not give absolute word frequencies for the twelve types of nominalizations in the BNC. Instead of merely counting all instances of the suffixes, we will isolate genuine examples of nominalization and focus on the suffixes’ distribution across the six sub-corpora. More interestingly, we will identify those novel lexemes that are indicative of the current word-formational potential of the twelve rules. The retrieval of both new and established nominalizations from the BNC is described below. 3.4. Procedure A terminological and analytic distinction needs to be made at this point. In the study of lexical frequency, two distinct measures are customarily taken into consideration: word types and word tokens, also abbreviated to types and tokens. The number of word types in a text equals the number of different lexemes in this text. The number of word tokens, on the other hand, is the total of all the words in a text, including the multiple appearances of those words that are repeated. Consider the following example: Deprived of all but his pride, David set out on his journey home. In the above sentence, the type count is 12, but the token count is 13, since the word his occurs twice. For the same reason, the token count of his is 2, but the token count of all the other words in the sentence is 1. The two measures can be 16
These happened to be mostly de-adjectival compound nouns in -ness (see Appendix).
A register-sensitive study of English nominalizations
129
interpreted in various ways in studies of frequency or topicality (high or low token count), lexical richness (type count or type-to-token ratio), lexical rarities (once-only tokens, i.e. hapax legomena), etc. In the present study too, reference will be made to type and token counts as the discussion proceeds. In order to retrieve all eligible nominalizations the entire BNC was searched via Davies’s online interface. The word lists thus obtained were manually edited, deleting irrelevant items, consulting the Oxford English Dictionary 17 and analyzing the words in their context in the BNC when necessary. This yielded 1,752 different word types in -ion, 1,700 word types in -ness, 1,007 in -ity, 312 in -ance/-ence, 302 in -ment, 194 in -ship, 190 in -(c)y, 73 in -hood, 65 in -age, 62 in -dom, 58 in -ery, and 50 in -al. Afterwards, all the items were grouped together in their respective customized word lists and further queries were run to investigate the distribution and frequency of the suffixes across the six sub-corpora. Plural nouns were collapsed under their respective singular forms for the purpose of token frequency measurements. The results of this phase of research are given below in sections 4.1 and 4.2. At the same time, another goal of the study was pursued. The morphological productivity of the suffixes was measured as a function of the numbers of word types overall as well as the number of innovative word types (4.4). These two measures were obtained by means of two different methods and we will describe each in turn below. Unlike token frequency measurements, which we will employ to discuss the absolute frequencies of each suffix across the sub-corpora, word type counts involve totaling of all distinct lexemes found in a sample. This permits the linguist to gain insight into the range of various word types formed by an affix, ignoring at the same time the fact that some of the lexemes happen to be of high frequency. Thus this is the first dimension of morphological productivity that will be considered in the discussion. The other aspect of productivity is the power of an affix to allow the formation of new word types, indicated by whatever number of such innovations is found in a corpus. Both measurements thus entail the counting of word types, which may appear problematic insofar as direct comparison of word type counts between samples of different size may seem unreliable. Unfortunately, although token frequency results obtained for samples of different size can and must be normalized to a common basis, normalizing type counts only exacerbates the problem by distorting results significantly. The best – and indeed fairly reliable and accurate –
17
The revised Second Edition available online along with the three Additions Series volumes and new material released quarterly.
130
Chapter 4
course of action, is therefore to compare raw (unnormalized) counts. Below we explain the rationale of this choice with an example. As one reads a book of 50,000 word tokens, the rate of word types appearing in the text is initially very high as almost each word token counts as a separate lexeme. As one continues reading, one encounters fewer new types, in inverse proportion to the number of tokens covered. Therefore, the difference between the numbers of types sampled at two cut-off points near the end of the text, for example at 95 and 99 per cent of the way through the text, should be negligible. We will test this hypothesis with three BNC samples of different sizes: Newspapers – approximately 10.64 million word tokens Non-Academic (Non-Fiction) – approximately 16.63 million word tokens Popular Magazines – approximately 7.38 million word tokens The three sub-corpora were tested for their rate of word type increase (Baayen’s vocabulary growth) using five different areas of nominalization: root+ment, root+ity, root+ness, root+ation and -ate+ion. Word type counts for each of these templates were established for each sub-corpus at comparable cut-off points to make possible direct comparison of raw counts (Tables 1, 2 and 3). 18 News Non-Acad Pop Cut-off point 1 2.96 3.72 Cut-off point 2 5.64 6.20 7.38 Table 1 Type count cut-off points at X million word tokens
Popular magazines constitute the smallest sub-corpus examined in this study (7.38 million). Therefore we need to establish the percentage of total word types of a particular kind found in a sample (of whatever size) at the approximately 7 million cut-off point. Due to the structure of the BNC the cut-off points for newspapers and non-academic prose are 5.64 and 6.20 respectively. Additionally, we consider another intermediate cut-off point to illustrate the slow-down in the rate of type increase. Tables 2 and 3 state the percentages of total word type counts found at both cut-off points. -ment root+ity root+ness Cut-off point 1 90% 90% 85% Cut-off point 2 97% 97% 92% Table 2 Rate of word type increase for newspapers 18
root+ation 90% 97%
-ate+ion 88% 97%
Complete identity of cut-off points was not possible due to the structure of the BNC. Popular magazines cannot be further divided and are only considered as a whole.
A register-sensitive study of English nominalizations
-ment root+ity root+ness root+ation Cut-off point 1 82% 90% 75% 85% Cut-off point 2 90% 95% 84% 94% Table 3 Rate of word type increase for non-academic prose
131
-ate+ion 70% 90%
On average, of the total word types conforming to the five structural templates, the great majority appear in the first 5 or 6 million of word tokens. Specifically, in the case of newspapers, an average of 96 per cent appear by the 5.64 million cut-off point, and in non-academic prose 90 per cent appear by the 6.20 million cut-off point. Note that these two points still leave large enough a margin for some of the remaining types to surface before the 7.38 million cut-off point, thus increasing the word type coverage. We conclude that, given the sample sizes involved in this work, comparing raw word type counts is reliable enough a method to compare the productivity of the four suffixes across sub-corpora. The second dimension of productivity considered in our analysis is the presence of novel formations. In order to identify lexical innovations in the BNC, the word lists obtained in the first part of the research were confronted with the OED to confirm or disprove the novelty of each word. Any item found absent in the OED was listed as an innovation (nonce word or neologism as defined in Chapter 1). Establishing the newness of a word on the basis of its absence from the OED seems the most reasonable solution. 19 Unlike most dictionaries, the very aim of the OED is comprehensiveness and full coverage of the English lexicon. Certainly, the OED proves far superior to Webster’s Third New International Dictionary of the English Language used for the same purpose by Baayen and Renouf (1996) in their search for new words in a corpus of The Times newspaper. The Webster’s Third, unlike the OED, does not aim at complete coverage. Although the OED may well fail to achieve this ambition in every detail, it certainly is the most comprehensive dictionary of English available. 20 Another point should be noted here. A contested issue among linguists is to the extent to which dictionary-based studies are reliable indicators of morphological productivity. 21 In general, it is agreed that dictionaries unavoidably underestimate the productivity of the most productive processes as some of their 19
Similarly, Plag (1999: 117) uses both corpus data and the OED as the basis for his investigations of productivity. The author concludes that “both the OED-based and corpus-based productivity measures are useful analytical tools.” 20 Bauer (2001) and Plag (1999) both admit that “even the OED lexicographers fall victim to the unavoidable tendency to include the more salient idiosyncratic forms and neglect the listing of regular derivatives” (Plag 1999: 98). 21 For example, see the criticism of Cannon’s (1987) dictionary-based investigations of productivity in Plag (1999), Bauer (2001), Plag et al. (1999), Baayen and Renouf (1996).
132
Chapter 4
products, due to their total regularity and predictability, inevitably escape the attention of lexicographers (see footnote 20 below). This, however, may be an advantage in an approach to lexical innovation such as that of this study: whatever BNC data are omitted by the OED lexicographers – and these will be recent and unestablished – may be assumed with a high degree of probability to be innovations. Leaving aside individual words, which may occasionally be erroneously considered new if absent from the OED, the strategy in question will certainly be useful in establishing the overall productivity of an affix and the relative productivity levels of respective affixes. We also presume that the joint use of corpus-based and dictionary-based methodologies is more useful than the use of only one or the other. 22 For example, the validity of the hapax legomenabased productivity measure put forward by Baayen and Lieber (1991) is seriously undermined by the fact that the majority of hapaxes are merely rare words that are not recent by any measure (see p. 118, footnote 7). In such cases, consulting the OED quickly resolves the matter. Of the 493 -ness hapaxes sampled here we found 132 to be absent from the OED and therefore innovative. All in all, we assume that the combination of corpus data and the OED database, which is used as a frame of reference, is a valid tool in the identification of new words. 4. Results and discussion Below, we look at the research results in the order stated in the Aims section. The three points are repeated below for convenience and will be taken each in turn: 1) to investigate register variation with regard to the distribution of the twelve nominalizing suffixes in English, including both innovative and well-established words 2) to investigate the extent to which the suffixes are used in the coining of new lexical formations – both nonce words and neologisms as defined in Chapter 2 3) to compile a list of new formations appearing in the BNC (and not appearing in the OED), study their structural patterns, and investigate register variation with regard to the distribution of these lexical innovations 4.1. Register variation among -ness, -ity, -ion and -ment nominalizations Our discussion in this section is limited to the four suffixes -ness, -ity, -ion and -ment. This is due to the fact that, firstly, register variation among these suffixes has been investigated elsewhere and it is our first objective to compare our findings with those obtained by other authors. Secondly, of the twelve suffixes sam22
Plag (1999) appears to be of the same opinion.
A register-sensitive study of English nominalizations
133
pled, these four are clearly the most frequent (in token frequencies) abstract nominalizing suffixes in English and therefore, for reasons of comprehensivness, are best handled on their own for the time being. Thirdly, we will postpone the discussion of the other suffixes as it is virtually impossible for technical reasons to incorporate statistical charts covering twelve suffixes and, additionally, break each one down as varying across six different genres. In this section we discuss register variation by considering token frequencies. We start by comparing our research results with those reported by Biber et al. (1998), Biber et al. (1999) and Plag et al. (1999). Next we move on to look at our findings of genres that hitherto have been overlooked. Biber et al. (1998) analyze the distribution of -ness, -ity, -ion and -ment based on samples of academic prose and fiction (both from the Longman-Lancaster Corpus) and spoken English (from the London-Lund Corpus). Below we cite their findings, given as joint normalized token frequencies per 1 million words. Academic prose (2.7 mill. words)
Fiction (3 mill. words)
Speech (0.5 mill. words)
Nominalizations 44,000 11,200 11,300 per 1 million words Table 4 Joint token frequencies of the suffixes -ness, -ity, -ion and -ment across registers (as reported by Biber et al. 1998:60)
Plag et al. (1999) limit their analysis to the contrast of spoken/written language and additionally distinguish in the spoken subcorpus between demographic and context-governed. Their coverage of nominalizations includes -ness, -ity and -ion. The token frequencies per suffix which they cite total to the following numbers: Written (89 mill. words) Spoken (10 mill. words) Nominalizations 20,760 7,800 per 1 million words Table 5 Joint token frequencies of the suffixes -ness, -ity and -ion across registers (based on Plag et al. 1999)
Although the exact numbers from the two sources may differ for a number of reasons, 23 the least that we can infer from these data is that spoken English has frequencies four times lower than academic prose and most probably signifi23
The most obvious reason is the exclusion of -ment. Also, Plag et al. (1999) were probably more restrictive as they set up several criteria of selection with regard to the structural complexity of eligible nominalizations.
134
Chapter 4
cantly lower than non-fictional written language (fiction goes hand in hand with spoken language according to Table 4). Below we state relevant frequencies obtained in the present study. 24 Academic prose (15.43 mill. words)
Fiction (16.19 mill. words)
Spoken (10.33 mill. words)
Nominalizations 25,132 5,500 5,990 per 1 million words Table 6 Joint token frequencies of the suffixes -ness, -ity, -ion and -ment across registers
The differences between the three registers in Table 6 consistently match those in Table 4, thus corroborating our claim above. Fiction and spoken language have approximately the same proportion of nominalizations; namely, a quarter that of academic. 25 It is commonly believed that, overall, written registers are lexically richer, not only with respect to nominalizations but indeed as far as the type-token ratio is concerned (Biber et al. 1999). Morphologically speaking, Plag et al. (1999) hypothesize and subsequently show that derivation in spoken English is much less productive. These claims are consistent with our findings above. Biber et al. (1999) contribute two more written registers for comparison with academic prose, fiction and speech. They narrow down spoken language to conversation (as opposed to other spoken registers) and additionally discuss the news. Conversation is reported to have by far the lowest number of derived nouns and news scores between fiction and academic prose. With the exception of -ness in fiction, the authors note the growing frequency of occurrence of nominalizations from conversation through fiction and news, to reach its peak values in academic prose. As Biber’s ibid. results are not presented as explicitly as one might wish, this claim will now be compared with our findings. Table 7 reports our frequency counts for newspaper language (News). News (10.64 mill. words) Nominalizations 11,229 per 1 million words Table 7 Joint token frequencies of the suffixes -ness, -ity, -ion and -ment in News 24
All the normalized token counts given here and afterwards are rounded up or down to the nearest whole number. 25 Note that the numbers in Table 4 are double those in Table 6. This, again, may be due to differences in criteria of data selection adopted by Biber et al. (1998). It is possible, for example, that their frequencies include prefixed formations.
A register-sensitive study of English nominalizations
135
In order to represent the differences in frequencies more clearly, the four registers are now brought together in the chart below: 30000 25000 20000 15000 10000
Spoken Fict News Acad
5000 0
Figure 1 Normalized joint token frequencies of -ness, -ity, -ion and -ment across registers (per 1 million words)
Bearing in mind Biber’s ibid. substitution of the spoken register with conversation and the discrepancies in methodology (see footnote 1 above), the pattern of relative frequencies plotted in Figure 1 correlates with Biber’s ibid. observation of the increasing frequencies from conversation through fiction and newspapers to academic prose. We may conclude that the register of conversation has a slightly lower frequency of nominalizations than the broader register of spoken language (this is supported by the data in Tables 4 and 7, where spoken language scores above fiction). In what follows we will take spoken language, not conversation, for further comparison. The increasing numbers of nominalizations occurring in the pattern noted by Biber ibid. are only to be expected. Academic prose seeks to condense as much information – often abstract notions – as possible into the minimum of form. Nominalizations offer such efficiency and condensation of ideas as longer phrases, even clauses, can be effectively replaced by a single complex word (cf. text categorization vs. the manner in which text is categorized, his clumsiness vs. the fact that he is always clumsy). Typically, key words in such structures are shifted, or recategorized, to become complex nominals. 26 In a broader perspective of theory of word formation, this particular effect of syntactic recategorization or transposing is one of the most basic functions of word formation, next to the labelling or referential function (i.e. inventing a word label for a new concept) (Plag 2003: 73-74, Booij 2005: 14, Lieber 2005: 406). Condensation of information is one possible motivation for syntactic recategorization, in which 26
Lieber (2005:406) refers to deverbal nominalizers (-al, -ment, etc.) as transposers of verbs to nouns.
136
Chapter 4
case abstract nominalizations are typically used. However, other syntactic categories can also be involved in similar operations, although possibly for different functional purposes. For example, Booij (2005) and Plag (2003) cite stylistic variation and text cohesion as possible results of recategorization in the following sentences (examples (a), (b) and (c) after Kastovsky 1986: 595 and quoted by Booij 2005: 14; sentence (d) after Plag 2003: 74): a) He made fists [...] He defisted to gesture. b) If that’s not civil, civilize it, and tell me. c) [. . .] and whether our own conversation doesn’t sound a little potty. It’s the pottiness, you know, that is so awful. d) Faye usually works in a different department. She is such a good worker that every department wants to have her on their staff.
Abstract nominalizations are thus to be singled out from among all other classes of complex words as particularly useable in information condensation and syntactic recategorization. This is simply due to their general categorial semantic paraphrase of “the act/state of...”, “the quality of...”. We also note that abstract nominalizations in general are insignificant in their use in the labelling (referential) function of word formation. Furthermore, in relation to register variation, we note that registers are diversified as to their preferences for particular types of syntactic structures (e.g. wh- clauses in more informal registers vs. nounbased clauses and nominalizations in academic prose; see section 3.5). We now turn to two other registers, represented in the BNC as nonacademic/non-fiction (Non-Acad) and Popular Magazines (Pop). We will then make comparisons across all six registers. Non-Acad (16.63 mill. words)
Pop (7.38 mill. words)
Nominalizations 20,798 10,003 per 1 million words Table 8 Joint token frequencies of the suffixes -ness, -ity, -ion and -ment in Non-Acad and Pop
The non-academic/non-fiction super-genre exhibits a considerable amount of nominalization occurrences compared to all the other varieties. In fact, its frequency values come so close to those of academic prose that the two designations academic and non-academic may now appear confusing. Lee (2001: 59) gives an accurate and revealing description of the types of texts that constitute this sub-corpus.
A register-sensitive study of English nominalizations
137
“The “non-academic” genres relate to written texts (mainly books) sometimes called “non-fiction” […] They are usually texts written for a general audience, or “popularizations” of academic material, and are thus distinguished from texts in the parallel academic genres (which are targeted at university-level audiences, insofar as this can be determined). In deciding whether a text was academic or not, a variety of cues was used: (a) the “audience level (of difficulty)” estimated by the BNC compilers (coded in the file headers) (b) whether COPAC lists the book as being in the “short loan” collections of British universities (this works in one direction only: absence is not indicative of a work not being academic) (c) the publisher and publication series (academic publishers form a small and recognisable set, and some books have academic series titles, which help to place them in context).”
The subject matter, if not the depth of their treatment, can thus be expected to be parallel in both groups, hence the comparable frequencies. However, the manner in which these texts are popularized in non-academic form may be expected to involve a degree of language simplification, which in turn may result in a reduction of nominalizations. This assumption receives support in the form of a frequency discrepancy between the two sub-corpora of approximately 3,000 tokens per million words of text, illustrated also in Figure 2. Another conclusion to be drawn from the chart is that, at least with regard to the number of nominalizations, non-academic/non-fiction texts are much closer to academic prose than fiction. Popular magazines, with 10,003 occurrences, are plotted in Figure 2 just below newspapers, which is to be expected from the overall profile of this genre. 30000 25000
Spoken
20000
Fict
15000 10000 5000
News Acad Non-Acad Pop
0
Figure 2 Joint token frequencies of -ness, -ity, -ion and -ment across registers
We have thus differentiated the six registers with regard to the joint frequencies of nominalizations and observed the varied distribution of the four suffixes. We now turn to consider the registers one at a time to look into the relative contribution of each suffix in each register. For preliminary comparison, in Figure 3 we plot the ranges of frequencies side by side.
138
Chapter 4
16000 14000 12000 10000 8000
-ness -ity -ion
6000 4000
-ment
2000 0 Spoken
Fict
News
Acad
Non-Ac
Pop
Figure 3 Normalized token frequencies of -ness, -ity, -ion and -ment across registers (organized by registers)
The suffix -ion is clearly the most frequent and perhaps the most productive of all. 27 Its distribution across the registers matches the total frequencies of all the suffixes plotted in Figure 2 (i.e. increasing as one moves along the scale of Fiction < Spoken < Pop < News < Non-Academic < Academic). Nominalizations in -ment are consistently plotted as the second most frequent type, although their distribution departs from the above-mentioned sequence (see below). More to the point, other suffix frequencies relative to one another vary from register to register. We will continue by taking each register in turn and discuss them in relation to the four suffixes. Figure 4 plots the frequencies of each suffix relative to one another in spoken language. The lowest and highest frequencies in spoken language are 194 (for -ness) and 3,510 (for -ion), which means the frequency of -ion is 18 times higher than that of -ness. The measure for -ion is by far the highest and it is over twice as high as the second highest, that of -ment. Figure 5 plots the four suffixes in fiction. Notably, the measure for -ness is here higher than that of -ity. The relative frequency of -ion is still substantial, now 3 times as high as that of the second most common (-ment). Interestingly, the measure for -ion is now only 3.5 times higher that that of the least frequent suffix (-ity). Noteworthy then are the balanced, evenly spread relative frequencies of -ness, -ity and -ment. 27
As these measures concern token frequencies, it is more appropriate to interpret them so, i.e. in terms of frequency, not productivity. We will refrain from making claims about morphological productivity until we proceed to investigate the number of distinct word types, and most importantly, the new word types. Similarly, the suffix -ment is consistently plotted as the second most frequent although it certainly should not be taken to be the second most productive.
A register-sensitive study of English nominalizations
-ness 194; -ity 621; -ion 3,510; -ment 1,664 4000 3000 2000 1000
-ness -ity -ion -ment
0
Figure 4 Normalized token frequencies of -ness, -ity, -ion and -ment in Spoken
-ness 896; -ity 801; -ion 2,845; -ment 958 3000 2500 2000 1500 1000
-ness -ity -ion -ment
500 0
Figure 5 Normalized token frequencies of -ness, -ity, -ion and -ment in Fiction
-ness 403; -ity 1,217; -ion 6,468; -ment 3,141 7000 6000 5000
-ness
4000
-ity
3000
-ion
2000
-ment
1000 0
Figure 6 Normalized token frequencies of -ness, -ity, -ion and -ment in News
139
140
Chapter 4
Figure 6 corresponds to the results obtained in news. The measure for -ion is again over twice as high as that of -ment, the second most frequent. It is now 16 times higher than the lowest measure, that of -ness. The ratios of frequencies of individual suffixes are comparable to those in spoken language (see Table 9 below). Figure 7 represents our frequency counts obtained for academic prose, where the frequencies reach their highest values (except for -ness). Compared to other registers, it is worth to noting here the discrepancies in the ratios of individual suffixes relative to one another (see Table 9). -ness 797; -ity 3,670; -ion 15,822; -ment 4,843 20000 15000 10000 5000
-ness -ity -ion -ment
0
Figure 7 Normalized token frequencies of -ness, -ity, -ion and -ment in Academic
Figures 8 and 9 plot the frequencies recorded for non-academic texts and popular magazines, respectively. The latter exhibit a notably distinct tendency in the values of the suffix ratios, given in Table 9. -ness 632; -ity 2,608; -ion 12,032; -ment 5,526 14000 12000 10000
-ness
8000
-ity
6000
-ion
4000
-ment
2000 0
Figure 8 Normalized token frequencies of -ness, -ity, -ion and -ment in Non-Academic
A register-sensitive study of English nominalizations
141
-ness 504; -ity 1,248; -ion 6,146; -ment 2,104 7000 6000 5000
-ness
4000
-ity
3000
-ion
2000
-ment
1000 0
Figure 9 Normalized token frequencies of -ness, -ity, -ion and -ment in Pop -ion/ -ion/ -ion/ -ment/ -ment/ -ness -ity -ness -ity -ment Spoken 18/1 2.1/1 5.7/1 8.5/1 2.6/1 Fiction 3.3/1 3/1 3.5/1 1/1 1.1/1 News 16/1 2.1/1 5.3/1 7.7/1 2.5/1 Acad 19.8/1 3.2/1 4.3/1 7/1 1.3/1 Non-Ac 19/1 2.1/1 4.6/1 8.7/1 2.1/1 Pop 12.1/1 2.9/1 4.9/1 4.1/1 1.6/1 Table 9 Comparative frequency ratios of suffixes per register
-ity/ -ness 3.2/1 0.8/1 3/1 4.6/1 4.1/1 2.4/1
We have seen in Figure 3 how the six registers are different with respect to the quantitative findings of suffix frequency. Furthermore, Table 9 shows the varying values of suffix-to-suffix ratios, indicating discrepancies in the contribution of each suffix relative to other suffixes. Thus, in the first column, the ratio of -ion to -ness averages around 18:1 (Spoken, News, Academic and NonAcademic). Popular magazines and especially fiction depart significantly from this pattern scoring 12.1 and 3.3 respectively. These results mean that, for example, spoken language has 1 -ness word per 18 -ion words, whereas fiction has 1 -ness word per 3.3 -ion words and popular magazines have 1 -ness word per 12.1 -ion words. Also noteworthy are the discrepancies in the fourth and sixth columns. The -ment to -ness ratio averages around 8:1 for 4 out of 6 registers. Fiction and popular magazines, however, score at 1 and 4.1 respectively. This is the second time that these two diverge substantially from the rest. The -ity to -ness ratio of fiction is yet again distinctly set apart from the other registers, scoring 0.8 out of the average 3.5. With specific regard to pairs of functionally rival suffixes, i.e. -ion vs. -ment and -ity vs. -ness, we observe the following: the former pair maintains a rela-
142
Chapter 4
tively steady ratio of occurrence across registers with -ion being up to three times as frequent as -ment. As for the latter pair, we have stated above that the -ity to -ness ratio is remarkably low in fiction. All in all, there are some noteworthy patterns of variation to be observed. Firstly, at one level, the actual range of frequency of individual suffixes is highly diversified in the pattern noted above (see Figures 3 – 9), Secondly, at another level, looking at the proportions of frequencies holding between suffixes (Table 9) there are still other patterns of distinction. Fiction is clearly set apart by its balanced contribution of -ment, -ness and -ity as well as its exceptionally extensive use of -ness (Figure 5, compared to Figures 4, 6, 7, 8, and 9). Additionally, popular magazines, with relatively low ratios of -ion to -ness and -ment to -ness, stand out from the rest of the registers in a way similar to fiction, i.e. they are characterized by a more balanced contribution of the suffixes. Compare the following counts for Fiction, News and Popular Magazines: -ness -ity -ion -ment Fiction 896 801 2845 958 Pop 504 1248 6146 2104 News 403 1217 6468 3141 Table 10 Normalized token frequencies of -ness, -ity, -ion and -ment in Fiction, Pop and News
The more ‘flattened’ curve of frequency distribution of popular magazines suggests that, in terms of the kind of language employed, this register is to be differentiated from News by its inclination towards fiction-like vocabulary (more -ness words, fewer -ion and -ment words), a claim which is supported by our intuitive expectations of this register. To a lesser degree this claim is further strengthened by less noticeable suffix-to-suffix ratio differences across registers in Table 9 (the ratios in Pop are always lower than the overall average). We will now proceed to take each suffix and see how they are distributed across the registers. Side by side, Figure 10 plots the distribution and quantitative presence of the four formatives in each register. There are enormous gaps in terms of frequency range between respective suffixes and the contrast is even more evident in this comparison of suffixes than it is in Figure 3, which compares the registers. Once again, the suffix -ion is clearly quantitatively dominant across the registers. The suffixes -ment and -ity are approximately within the same range and -ness is relatively rare. We will first take the suffix -ness, which we have found to be exceptionally common in fiction, and its use across the registers. This is represented in Figure 11. Clearly, -ness is indeed marked by its preference for fiction texts. A functional interpretation of these findings might be that -ness words – in comparison
A register-sensitive study of English nominalizations
143
16000 14000 12000
Spoken
10000
Fict
8000
News
6000
Acad
4000
Non-Acad
2000
Pop
0 -ness
-ity
-ion
-ment
Figure 10 Normalized token frequencies of the suffixes across registers (organized by suffixes) spoken 194; fict 896; news 403; acad 797; non-acad 632; pop 504 1000 800
Spoken Fict
600
News
400
Acad Non-Acad
200
Pop
0
Figure 11 Normalized token frequencies of -ness across registers
with words derived with -ity, the functional rival of -ness – seem more informal and less technical at the same time (e.g. tenaciousness vs. tenacity). There may be several reasons for this. Firstly, the suffix -ness may be preferred by speakers because it is a ‘safer’ option when little editing time is available in online production: it is easily parsed out, i.e. it is straightforwardly attached to its base with a clearly observable morpheme boundary with no adjustment or truncation operations (again, tenacious-ness vs. tenac(ious)-ity); it has no phonological effect on the base form either (cf. the change in vowel quality in tenacious – tenacity); hence, -ness derivatives are easily decomposed and interpreted by the hearer (cf. Hay 2003, Hay and Baayen 2003). Derivatives in -ity, on the other hand, may be both formally and semantically opaque. This in turn may be because -ity is Latinate and the exact ways in which it combines with (usually non-native) bases may be obscure to speakers of English. The suffix -ness, on the other hand, is native, although it is questionable whether etymological considerations of native
144
Chapter 4
or non-native origin in themselves influence the choice of one suffix over another. 28 Rather, as mentioned above, decomposability and full predictability of usage will be a more likely explanation. 29 Secondly, another reason why -ness words are more informal and particularly common in fiction is the very meaning of many of its bases. Many of high-frequency -ness derivatives denote personal qualities or feelings that rarely surface in technical/scientific texts but often do in fiction, such as happiness, kindness, sadness, tenderness, etc. Such formations will not only be infrequent in technical texts but even more so in spoken language (Figure 11). Biber et al. (1998) argue that spoken language, due to its limited use of nominalizations overall, is more likely to express the same notions by means of adjectives, often attributed directly to the speaker or hearer (e.g. I’m happy, I feel sad, you were very kind). Otherwise, popular magazines display an increase in -ness derivatives, scoring higher than newspapers, the reverse of what was observed in the total count of nominalizations (Figure 2). We have discussed this result above in relation to the data presented in Table 10. The discussion now moves on to the other suffixes and their occurrences across the registers. Figures 12 – 14 correspond to our results recorded for -ity, -ion and -ment. We note the preponderance of -ity over -ness in the academic register (and the reverse distribution in fiction – cf. Figures 11 and 12). This indicates that the processing factors such as parsability and transparency, which we cited in reference to -ness, may not be of paramount importance in academic texts. This is not surprising, given the lack of online production limitations and virtually unlimited editing time. This is also due to the lexical preference of academic discourse for the Latinate word stock. Both -ness and -ity words have higher frequencies in fiction than in spoken language, although, as noted earlier, the total count of all nominalizations was higher in spoken language. On the other hand, -ion and -ment are more frequent in spoken language than in fiction, as shown in Figures 13 and 14. Popular magazines score higher than newspapers for -ness and -ity although the opposite tendency is observed when taking into account -ion and -ment as well as nominalizations overall.
28
Admittedly, free choice between -ness and -ity will only occasionally be an option for one and the same base form (if only because of the selectional restrictions of -ity). Instead, either suffix will usually be the more common alternative or indeed the only reasonable choice. 29 See Hay (2003), Hay and Baayen (2003) for a detailed treatment of decomposability’s effects on productivity. In a nutshell, “any factor which facilitates decomposition of complex forms should also facilitate the emergence of productivity” (Hay 2003: 151).
A register-sensitive study of English nominalizations
145
spoken 621; fict 801; news 1,217; acad 3,670; non-acad 2,608; pop 1,248 4000 Spoken 3000 2000 1000
Fict News Acad Non-Acad Pop
0
Figure 12 Normalized token frequencies of -ity across registers spoken 3,510; fict 2,845; news 6,468; acad 15,822; non-acad 12,032; pop 6,146 20000 Spoken 15000 10000 5000
Fict News Acad Non-Acad Pop
0
Figure 13 Normalized token frequencies of -ion across registers spoken 1,664; fict 958; news 3,141; acad 4,843; non-acad 5,526; pop 2,104 6000 5000
Spoken
4000
Fict
3000 2000 1000
News Acad Non-Acad Pop
0
Figure 14 Normalized token frequencies of -ment across registers
Interestingly, comparing across all suffixes, the frequency proportions between respective registers vary. In other words, compared to -ity, -ness is responsible for less of a difference between the frequency range of its most typical register
146
Chapter 4
(fiction) and that of the other registers. Table 11 states the quotients of relevant frequencies:
-ness -ment -ion -ity
1:2 1.12 1.14 1.31 1.40
1:3 1.41 1.75 2.44 2.94
1:4 1.77 2.62 2.57 3.01
1:5 2.22 3.32 4.50 4.58
1:6 4.61 5.76 5.56 5.90
Table 11 The frequency ratios of registers (‘1’ stands for the register that scores highest for a given suffix, ‘2’ stand for the second highest, ‘3’ stands for the third highest, etc.)
We interpret the above measures as follows. The six registers exhibit the least quantitative variation relative to one another with respect to -ness derivatives. The distribution of -ness nominalizations across the registers is relatively evenly-spread (see also Figure 11). As we move down, through -ment, -ion to -ity, the ratio values increase, thus indicating an increasing capacity of respective suffixes to diversify or distance the registers relative to one another. Ultimately, the frequencies of -ity nominalizations set apart the six registers the most. For example, the highest frequency value of -ity (3,670) is 3 times higher than its fourth highest value (1,217). By way of comparison, the highest frequency value of -ness (897) is only 1.7 times higher than its fourth highest value (504). This means that, in this example, -ity nominalizations are responsible for twice as much quantitative contrast as -ness words. To summarize the discussion so far: • Overall, nominalizations are indeed distributed disproportionately across registers. When considered jointly in token frequencies, they are distributed in the following sequence of increasing frequencies: Spoken < Fiction < Pop < News < Non-Acad < Acad (Figure 2). Considered individually, the suffix -ness departs from this pattern in that it is most frequent in fiction. • Functionally, abstract nominalizations are particularly useful in syntactic recategorization rather than the labeling function of word formation. Condensation of information is typically the motivating factor, especially in academic prose. • Within a register, suffixes are also unevenly distributed, as indicated in Figure 3, reproduced below for convenience. • -ion is by far the most frequent suffix (in all registers). • -ment is the second most frequent • -ity is the third most frequent (except in fiction, where it is outnumbered by -ness). • The four suffixes consist of two pairs of rival suffixes: -ion vs. -ment and -ity
A register-sensitive study of English nominalizations
147
16000 14000 12000 -ness
10000 8000
-ity
6000
-ion
4000
-ment
2000 0 Spoken
Fict
News
Acad
Non-Ac
Pop
Figure 3 Normalized token frequencies of each suffix across registers
vs. -ness. In the first pair, -ion is consistently more frequent across registers. In contrast, in the second pair, while -ity is on average 3.5 times as frequent as -ness, the suffix -ness outnumbers -ity in fiction. This, we have argued, may be due to the formal and semantic transparency of -ness words as well as the specific semantics of many base forms, which denote personal qualities or feelings. • Fiction is notably balanced in its use of the suffixes, especially -ment, -ness and -ity (Figure 5 and Table 9). • Popular magazines stand out from the rest of the registers in a way similar to fiction (albeit much less drastically); namely, they are characterized by a more balanced contribution of the suffixes (see discussion of Table 9). • The suffixes ordered in the sequence -ness < -ment / -ion < -ity exhibit an increasing capacity to diversify the registers relative to one another (Table 11). 4.2. Register variation among -ance/-ence, -ship, -(c)y, -hood, -age, -dom, -ery, and -al nominalizations As was the case in the preceding section, below we discuss variability in the occurrence of several suffixes across registers. We will consider all relevant derivatives (as defined in section 4.2.3) – both new and well-established. Figure 15 plots joint token frequencies obtained for the eight suffixes in question. The registers’ relative ranges of suffix frequencies clearly correlate with those observed for the other four suffixes depicted in Figure 2 (henceforth Group 1). Thus we note the fact that the same sequence of increasing suffix frequencies Spoken < Fiction < Pop < News < Non-Acad < Acad which has been observed earlier holds true also for this second group (henceforth Group 2) of less frequent suffixes. It is therefore at least in terms of joint token frequencies that the two groups considered here follow a very similar pattern of distribution.
148
Chapter 4
7000 6000
Spoken
5000
Fict
4000
News
3000
Acad
2000
Non-Acad
1000
Pop
0
Figure 15 Normalized joint token frequencies of the eight suffixes across registers
Another point that we acknowledge here with regard to these findings is that when the frequency data from Figures 2 and 15 are totalled, i.e. when taking into consideration all the twelve suffixes, the suffix frequency discrepancies between particular registers will be greater still, as shown in Figure 16. With regard to frequency of individual suffixes, Figure 17 shows the relative contribution of each suffix to each register. Once more, the same sequence of increasing suffix frequencies is observable and this is largely attributable to one dominant suffix. Just as the suffix -ion is clearly the most prominent in Group 1, 35000 30000
Spoken
25000
Fict
20000
News
15000
Acad
10000
Non-Acad
5000
Pop
0
Figure 16 Normalized joint token frequencies of the twelve suffixes across registers
so is -ance/-ence in Group 2. And it is the deverbal variant of the suffix that is especially common across the registers, except for fiction, where de-adjectival formations are the most frequent. This observation again ties in with our earlier findings: fictional texts are distinguished from the other registers by their markedly extensive use of (de-adjectival) -ness and de-adjectival -ance/-ence. One likely explanation of this fact is that fictional texts focus on the descriptive portrayal of their characters, the task of which may be achieved through the use of de-adjectival Nomina Qualitatis. This is the reason for the otherwise atypical
A register-sensitive study of English nominalizations
149
outnumbering of deverbal -ance/-ence nominalizations in fiction by deadjectival formations ending in the same suffix. -age
2000
-al. de-adj. -ance
1500
deverb. -ance -cy
1000
-dom dev. -ery
500
other -ery 0
-hood Spoken
Fict
News
Acad
Non-Ac
Pop
-ship
Figure 17 Normalized token frequencies of each suffix across registers
For the purpose of overall comparison, we next plot in Figure 18 the frequency ranges of the most common suffixes from among all the twelve formatives. The least frequent ones are left out in the chart for reasons of space limitations. The chart clearly illustrates the immense gap in terms of frequency of occurrence that is found between the suffix -ion and the less widely used suffixes, and especially
16000
-al.
14000
-age
12000
de-adj. -ance
10000
deverb. -ance
8000
-cy
6000
-ship
4000
-ness
2000
-ity
0
-ion Spoken
Fict
News
Acad
Non-Ac
Pop
-ment
Figure 18 Normalized token frequencies of each suffix across registers
those from Group 2.Specifically, within the frame of reference of the six registers combined into one sample, the order of decreasing token frequency for the entire set of twelve suffixes is (per one million word tokens): -ion (8365), -ment (3241), -ity (1848), deverbal -ance/-ence (1052), de-adjectival -ance/-ence (873), -ness (618), -(c)y (437), -al (414), -ship (315), -age (285), -dom (148), deverbal
150
Chapter 4
-ery (147), -hood (82), denominal/de-adjectival -ery (7). It is rather interesting to note that the suffix -ance/-ence (both deverbal and de-adjectival) is, with the exception of fiction, quite consistently more frequent than -ness, although the latter can safely be assumed to be more productive (see section 4.4 for evidence of productivity). This indicates that the frequency figures of an affix are not to be equated with its productive potential. The two types of nominalizations in -ance/-ence, are thus the most frequent in Group 2. The order of frequency of the other suffixes will vary from register to register. With this regard we now turn to consider each register in more detail. -age
600
-al.
500
de-adj. -ance
400
deverb. -ance
300
-cy -dom
200
dev. -ery
100
other -ery -hood
0 Spoken
Fict
-ship
Figure 19 Normalized token frequencies in Spoken and Fiction, Group 2
De-adjectival nominalizations in -(c)y are the next most common type in the spoken register (see chart above). Only slightly less numerous are items in -al and -ship. Overall, as is expected, this register exhibits the least nominalization. Fictional texts, as already mentioned, are atypical for their relative proportion of deverbal and de-adjectival -ance/-ence derivatives. As well as this, compared with the spoken variety, the second most common suffix in fiction is -age. Newspapers and popular magazines employ approximately twice as many word tokens of these nominalizations as are used in the spoken register or fiction. The difference is especially noticeable in the occurrence of the suffix -ship, the second most frequent in newspapers, which is approximately five times as frequent as it is in the spoken register and fiction. The suffix -age is the second most frequent in popular magazines. As was the case in the spoken register, the suffix -(c)y is also the second most frequent in academic and non-academic texts. Overall then, depending on which register is considered, one of the four suffixes -(c)y, -al, -ship and -age scores as the second most frequent in Group 2 following the suffix -ence/-ance.
A register-sensitive study of English nominalizations
151
-age
1000
-al. 800
de-adj. -ance deverb. -ance
600
-cy 400
-dom dev. -ery
200
other -ery -hood
0 News
Pop
-ship
Figure 20 Normalized token frequencies in News and Pop
-age
2000
-al. de-adj. -ance
1500
deverb. -ance -cy
1000
-dom dev. -ery
500
other -ery -hood
0 Acad
Non-Ac
-ship
Figure 21 Normalized token frequencies in Academic and Non-academic
There are several interesting points to be made with respect to the distribution of some the suffixes across the registers. On the whole, the distribution of individual suffixes tends to match the pattern of their joint dispersion, illustrated in Figure 15 (reproduced below). However, in a few cases, notable discrepancies occur. One such exception is the suffix -ery, which, by the standards set in Figure 15, is exceptionally frequent in newspapers. Figures 22 and 23 plot the frequencies of denominal/de-adjectival (e.g. japanesery, scallywaggery, assery, peacockery) and deverbal (e.g. forgery, flattery, debauchery) -ery nominalizations respectively. It is especially the former, rarer type that is clearly biased towards the language of newspapers and a little less so towards fiction, while at the same time infrequent in the academic register. The examples cited above are typical of the jocular and eye-catching pragmatic effect conveyed in these derivatives and this may be precisely the reason for their preference for these two
152
Chapter 4
genres, as is also evidenced by our discussion of word types and innovative coinages in -ery (see 4.4).
7000 6000
Spoken
5000
Fict
4000
News
3000
Acad
2000
Non-Acad
1000
Pop
0
Figure 15 Normalized joint token frequencies of the eight suffixes across registers
15 Spoken 10
Fict News Acad
5
Non-Acad Pop
0
Figure 22 Normalized token frequencies of denominal/de-adjectival -ery across registers
300 250
Spoken
200
Fict
150 100 50
News Acad Non-Acad Pop
0
Figure 23 Normalized token frequencies of deverbal -ery across registers
A register-sensitive study of English nominalizations
153
500 400
Spoken Fict
300
News
200
Acad Non-Acad
100
Pop
0
Figure 24 Normalized token frequencies of -ship across registers
Similarly, the suffix -ship is most frequent in the register of newspapers, closely followed by non-academic texts (see Figure 24). Also notable is the relatively high count of -ship nominalizations in popular magazines. However, to claim that the suffix -ship is used creatively more in these registers may be a hasty conclusion. One must remember that a high token count may well be the product of only a few high frequency words rather than many different word types. By way of illustration, the word championship appears 2,153 times in NBC newspapers but only 3 times in academic texts. The word is thus recorded as a single -ship word type in both registers but the token counts in the two registers are strikingly different. We will come back to this point and look closer into word types and new lexemes in -ship in section 4.4. To recapitulate this part of discussion: • When considered jointly, the nominalizations (suffixes) of both Group 1 and 2 are distributed across registers in the same sequence of increasing frequencies: Spoken < Fiction < Pop < News < Non-Acad < Acad (Figures 2 and 15). Individual suffixes, however, may depart from this pattern (-ness, -ment, -ery, -ship). • Within a register, suffixes of Group 2 are unevenly distributed, as was the case in Group 1 (Figure 17). • -ance/-ence is by far the most frequent suffix in Group 2 (in all registers), also outnumbering the suffix -ness of Group 1. The second most frequent is, depending on the register, -(c)y, -al, -ship or -age. • Deverbal nominalizations in -ance/-ence are the more prominent type in all but fictional texts, where de-adjectival items are in the majority. • The order of decreasing token frequency for Groups 1 and 2 is (per one million word tokens): -ion (8365), -ment (3,241), -ity (1,848), deverbal -ance/-ence (1,052), de-adjectival -ance/-ence (873), -ness (618), -(c)y (437), -al (414), -ship
154
Chapter 4
(315), -age (285), -dom (148), deverbal -ery (147), -hood (82), denominal/deadjectival -ery (7). In the next section we take a step forward to benefit from a structurally-oriented approach to register variation. 4.3. Considerations of morphological structure 4.3.1. Affix ordering In English and many other languages with derivational morphology, affixes are not completely free to attach to any type of base or to one another. Rather, restrictions have been noted on possible base–affix and affix–affix combinations. For example, nominalizing -al only attaches to verbs ending in a stressed syllable (cf. propose, deny); all verbs in -ize can only take -ation to make nominalizations – other deverbal nominalizing suffixes such as -ment are ruled out (Plag 2003). Some such restrictions may be simple to phrase and describe in structural terms, for example on the grounds of prosodic structure in the case of -al derivation above and a specific morphological restriction placed on verbs in -ize. Other constraints may be more intricate and involve multiple requirements or limitations (see e.g. Plag 1999). In this section we are interested in the constraints on suffix–suffix combinations, especially those involving nominalizing suffixes in the word-final position. For instance, Baayen and Plag (2008: 1) note that the word atomic can take the nominalizing suffix -ity, whereas the word atomless cannot, and the suffix -ness is the only option for a nominalizing formative (*atomlessity vs. atomlessness). Why can -ity follow the suffix -ic but not -less? It would be relevant to the present study to establish any principles governing permissible pairings of suffixes so as to confront these principles with our data retrieved from the BNC. Because we will be considering the internal structure of morphologically complex base forms in order to evaluate its influence on genre diversification, we first review several approaches to restrictions on affix ordering. Plag and Baayen (2008) and Hay and Plag (2004) distinguish three main proposals of principles and mechanisms that constrain affixal combinatorial properties. First, stratum-oriented models (e.g. Siegel 1974, Allen 1978, Selkirk 1982, Kiparsky 1982) claim that affixes are hierarchically organized at different levels of the word-formational component of grammar. At each level, affixes share certain qualities, including combinatorial properties. The principles of affix ordering reflect this layered structure in that the suffixes in the upper stratum/strata are stipulated to be allowed to precede suffixes from the lower stratum/strata but not the other way round.
A register-sensitive study of English nominalizations
155
Opposed to this view are scholars who maintain that it is selectional restrictions of individual affixes that determine the combinatorial properties of affixes (e.g. Fabb 1988, Plag 1999). According to this view, each affix is to be studied on its own, rather than as belonging to a group sharing a number of properties, in order to ascertain its own set of idiosyncratic combinatorial properties. 30 As exemplified above, such affix-particular restrictions may be phonological and morphological but also syntactic and semantic. More recently, another hypothesis has been proposed by Hay (2002, 2003) who gives an account of affix ordering based on parsability (or decomposability) and processing complexity (Complexity-Based Ordering). The general claim here is that affixes can be ranked along a scale of processing complexity, with more separable (or parsable) affixes at one end of the hierarchy and less separable (less transparent) ones at the other. 31 Hay’s hypothesis has a strong psycholinguistic grounding: words with more parsable affixes “tend to be accessed via their parts in speech perception” while words with less parsable affixes “tend to be accessed as wholes” 32 (Hay and Plag 2004: 8). Affix ordering thus depends on whether an affix can be easily parsed out in processing. If so, according to Hay, more separable affixes can occur outside less separable affixes, but not vice versa.
30
This means that the benefit of generalizing across affixes is sadly lost. In this respect, Hay’s model resembles level-ordering, where stratum 1 affixes are typically difficult to parse out, e.g. when attached to bound roots (cf. collision). Similarities between the two models can also be noticed elsewhere. For example, the notion of affix boundary strength is related to ordering in both, although Hay’s theory, unlike level-ordering, recognizes that boundary strength is gradient (Hay and Plag 2004: 8). 32 Note that “any individual affix occupies a range of separability – it is more separable in some words than others. As such, there are systematic word-based exceptions to ordering generalizations – cases in which words with low levels of decomposability can take an affix that comparably highly decomposable words might not (e.g. government is less decomposable than bafflement, leading governmental to be more acceptable than bafflemental)” (Hay and Plag 2004: 8-9). Varying degrees of boundary strength of one and the same affix across words are accounted for by several factors: phonotactic transition and frequency in particular. As regards phonotactics, “low probability or illegal phoneme transitions will be more likely to be decomposed than words containing the same affix, which exhibit fully legal phonotactics” (Hay 2003: 155). The other factor is relative frequency, i.e. the frequency of the derivative relative to the frequency of the base (see Hay 2002, 2003). In short, the higher the relative frequency, the less likely decomposition is. Conversely, the lower the relative frequency, the more likely decomposition is. By way of illustration, the word government is more frequent than govern (Hay 2003: 188) and therefore it is likely to be accessed and processed whole; bafflement is less frequent than baffle(d) and thus is subject to parsing. 31
156
Chapter 4
Below we compare the three models by confronting their theoretical assumptions with empirical data. The following stratal hierarchy of English affixes is representative of level-ordering studies (after Spencer 1991: 79, and Hay and Plag 2003: 3): Class I suffixes: +ion, +ity, +y, +al, +ic, +ate, +ous, +ive, +able, +ize Class I prefixes: re+, con+, de+, sub+, pre+, in+, en+, be+ Class II suffixes: #ness, #less, #hood, #ful, #ly, #y, #like, #ist, #able, #ize Class II prefixes: re#, sub#, un#, non#, de#, semi#, anti# Affixes that are omitted from the list are of uncertain status in level-ordering models or are suspected of double membership. Such is the case of -ment affixation, which, as argued by Aronoff (1976) and Giegerich (1999), may either exceptionally occur at stratum 1 (when followed by -al, e.g. in governmental, developmental, etc.) or more commonly at stratum 2. Generally, however, -ment is attached at stratum 2 (Hay 2003: 172) so that it belongs with -ness in stratum 2, while -ion and -ity belong in stratum 1. Phrased in terms of affix ordering, one may predict that -ion and -ity should not attach outside stratum 2 suffixes (e.g. *-fulity) 33 and thus the model accounts in a principled way for the unacceptability of *atomlessity, cited above and contrasted with atomlessness. However, the level-ordering model is notorious for empirical and theoretical weaknesses (numerous counterexamples, dual membership of affixes, vague predictions of ordering within levels; cf. for example Fabb 1988, Plag 1999, Hay and Plag 2004). Already in the stratal allocation of affixes represented above, inaccuracies may be observed: the stratum 2 suffix -ize is theoretically precluded from feeding -ate and -ion suffixation despite obvious counterexamples (e.g. computerization). Likewise, the combination -ability (e.g. readability) is difficult to account for in level-ordering on the grounds that -able, a stratum 2 affix, is followed by the stratum 1 affix -ity. In the light of these problems with the model, we conclude that level-ordering cannot fully determine permissible affix combinations in English. Another alternative way to model affix ordering is by selectional restrictions understood as “affix-particular properties governing the kinds of combinations that are allowed for that affix. Such restrictions can refer to phonological, morphological, syntactic or semantic characteristics of the elements to be combined.” (Plag and Baayen 2008: 2). An advantage of this approach is that each affix, not a group of affixes, is seen as combining with a specified set of mor33
Affixes grouped at the same level are believed to share many other features (cf. any works of the level-ordering literature cited above). We only discuss the principle that ‘stratum 1 affixes always before stratum 2 affixes’ because it is crucial in affix ordering.
A register-sensitive study of English nominalizations
157
phological elements – this results in maximized accuracy of description. For example, the verbal suffix -en (in blacken) is known to attach exclusively to monosyllables ending in an obstruent (Plag 1999), which must be considered a very idiosyncratic and affix-particular requirement on the part of this suffix. By considering each affix on its own, descriptive adequacy is ensured. On the other hand, modeling affix ordering through selectional restrictions may be criticized on the grounds that it is uneconomical, as the benefit of generalizing across affixes is lost – “[a]fter all, linguists want to believe that language in general and derivational morphology in particular is not just an accumulation of idiosyncrasies” (Hay and Plag 2003: 8). In Hay’s (2003) model, affix–affix combinations are determined by considerations of phonological separability: affixes that are easily parsed out occur outside less separable affixes (notice the resemblance to level-ordering analyses, where stratum 1 affixes do tend to be less parsable than stratum 2 affixes). Structurally speaking, this entails that, for example, consonant-initial affixes, because they are more easily parsed out, follow vowel-initial affixes (e.g. -ous-ness). Parsability is also facilitated by considerations of frequency and phonotactics. Namely, firstly, an affix is more decomposable when the frequency of the derivative is lower than that of the base form (e.g. -ment is more parsable in discernment than it is in government) (Hay 2003: Chapter 8). Crucial here is the realization that one and the same affix may be more and less separable in different words, and therefore combine with other affixes differently. This is precisely why governmental is attested while *discernmental is not: -ment in government is less parsable than it is in discernment and therefore may further be followed by -al in governmental, but not in *discernmental. Secondly, low probability or illegal phoneme transitions render an affix more segmentable. Low probability transitions are here understood as ones unlikely to occur morpheme-internally (such as /pf/) but possible at morpheme boundary, e.g. in pipeful (Hay and Baayen 2003: 8). In such an analysis, pipeful is more decomposable than, for example, bowlful (Hay 2003: 155). All in all, Hay’s treatment of affix ordering offers a more reliable description than level-ordering, given the tendency of level-ordering to be both overrestrictive and over-powerful. Although the two approaches coincide to a certain degree, Hay’s theory is more discerning and flexible; it avoids across the board generalizations and instead recommends the combined consideration of a variety of factors: segmentability, relative frequency and phonotactics. Comparing Hay’s Complexity-Based Ordering with affix-particular selectional restrictions, it seems that Hay’s model may have the advantage of being a self-contained theoretical model that is able to capture important generalizations about the ordering of affixes. Selectional restrictions, on the other hand, aim and excel at item-specific descriptive adequacy, without any pretence at indicating recurrent
158
Chapter 4
patterns across affixes. The debate about how to successfully model affixordering has not yet been resolved. It is perhaps best to look for an effective solution in the interplay of a variety of factors, such as the analysis of Hay and Plag (2003), where parsing restrictions and selectional restrictions are found to coincide or complement one another. The selectional restrictions of individual affixes will inevitably find their way into the next section of our discussion – we move away from generalizing about English affixes at large and narrow down to a selection of nominalizing suffixes: -ness, -ity, -ion and -(c)y. We will consider these formatives in root–suffix and suffix–suffix combinations at the word-final position and investigate the relevance of internal structure in register variation. Additionally, the suffix -ment will be considered as attaching to prefixed forms. 34 We noted above that the verbal suffix -en only attaches to monosyllabic bases ending in obstruents – these are neatly specified selectional restrictions of the suffix. However, few English affixes can be dealt with as conveniently and specifically. 35 In many cases, selectional restrictions are stipulated by simply listing the range of affixes that a given affix may attach to. Such is the case with -ity, -ion, and -(c)y; -ness is the simplest to tackle in that it does not seem to be restricted at all. Below we specify selectional restrictions of -ness, -ity, -ion, -ment and -(c)y as regards root–suffix and suffix–suffix combinations. -ness seems unrestricted in its attachment to any adjective, regardless of the base form’s structural or semantic properties (Fabb 1988: 535, Hay and Plag 2003: 25). This amounts to the claim that all simplex adjectival bases and all adjective-forming suffixes can precede -ness. -ity attaches to Latinate adjectival bases (Marchand 1969: 312, Szymanek 1989: 157, Lieber 2005: 407). The suffix typically follows simplex bases (rigidity), -able/-ible (adaptability), -ive (transitivity), -ile (docility), -al (brutality), -ous (generosity) and -ic (periodicity) (Szymanek 1989: 158-161). Overall, for reasons of phonotactics, elligible adjectival suffixes preceding -ity seem to be limited to those ending in a consonant. -ion is represented by several allomorphs surfacing in nominalization (Plag 1999: 114). It attaches to verbs ending in -ify and emerges as -cation (codification). Bases in -ize require the suffix to surface as -ation (neutralization). When attached to simplex bases ending in /t/ and verbs in -ate, the base-final [t] changes to [S] (pollution, automation) (cf. Szymanek for a detailed discussion of representation of -ion). 34
See below for the rationale behind the selection of the suffixes. Cf. Hay and Plag (2003: Table 5). For instance, many affixes attach to bases of broadly defined criteria, e.g. the semantic requirement on bases of -ship to be a persondenoting noun. 35
A register-sensitive study of English nominalizations
159
-ment favours disyllabic bases with final stress and prefixed bases, although these are only notable tendencies (Lieber 2005: 407). Plag (1999: 73) asserts: “in general, the specification of the domain of -ment is extremely difficult.” -(c)y attaches to three types of base forms: adjectives in -ate (degenerate – degeneracy), adjectives in -ant/-ent (redundant – redundancy) and nominal bases (delinquent – delinquency) (cf. Plag 1999: 85, Plag 2003: 110-111). With these rule-specific restrictions in mind, we now move on to look at the influence of base-internal morphological structure on register variation. 4.3.2. Register variation among -ness, -ity, -ion, -ment and -(c)y nominalizations: structural effects Because our objective here will be the study of the morphological base form patterns, our discussion focuses on those suffixes which attach to several types of base forms of a particular morphologically definable kind. For example, the suffix -(c)y is included for analysis as it regularly attaches to three types of base forms: adjectives in -ate (legitimate – legitimacy), adjectives in -ant/-ent (redundant – redundancy) and nominal bases (delinquent – delinquency). 36 In contrast, the suffix -ship is ignored in this section as it attaches to nominal bases whose morphological make-up does not allow any clear classification into distinct word-formational types. In all, five of the original twelve suffixes will be discussed below and these are -ness, -ity, -ion, -ment, and -(c)y. Linguistic literature has long seen -ness nominalizations as more frequent and significant in fiction than any other register (see section1). That is also the picture emerging from the foregoing discussion (see Figure 11 below). Yet any such claims must inevitably be rectified once -ness nominalizations are investigated more closely. Below we present our findings of the effect of base forms’ morphological structure on register distinctions. Both established and innovative forms are included in the analysis. Table 12 shows normalized token frequencies across the registers with reference to the morphological structure of the base form. 37 Figure 11 is reproduced for convenience of comparison. 36
Adjectival base types of other kinds are occasionally also involved (e.g. graphicacy, paramountcy) but these are infrequent (14 types) and too heterogenous to form a morphologically coherent group. 37 Note that frequencies observed for combinations of suffixes, when added together for a particular register, do not equal their respective total counts from Figure 11. This is because some of the items considered in the total counts did not fit any of the suffix combination templates. For example, while examining -ness, we have ignored certain derivational types, namely -al+ness, -ary+ness, -ate+ness. The same applies to our analysis of -ity, -ion and -ment. For example, sanctity and humility did not qualify for the
160
Chapter 4
spoken 194, fict 896, news 403, acad 797, non-acad 632, pop 504 1000 Spoken
800
Fict 600
News
400
Acad Non-Acad
200
Pop
0
Figure 11 Normalized token frequencies of -ness across registers
Spoken
Fict
News
Acad
NonAcad
simplex root+ness 148 711 298 416 393 (fakeness) simplex root+y+ness 3.7 25 11 6 11 (creepiness) -ful+ness 1.8 18.8 6.4 25.4 17.7 (stressfulness) -ish+ness 0.3 3.5 1.6 0.9 1.7 (quirkishness) -ous+ness 10 43 19 104 51 (curvaceousness) -ed+ness 4 15 5 21 13 (datedness) -ive+ness 8 12 18 109 63 (declarativeness) -less+ness 7 29 17 32 28 (depthlessness) -ing+ness 3 5 10 20 19 (reassuringness) Table 12 Frequencies of -ness word tokens per 1 million tokens of text
Pop 356 25 12.4 2.5 27 10 27 13 9
[simplex (independent) root+ity] template on the grounds of their opacity. However, items exhibiting typical allomorphic alternations such as in toxic - toxicity were included in the [simplex root+ity] type.
A register-sensitive study of English nominalizations
161
Our findings above clearly show that further sub-division of -ness nominalizations accounts for even more register variation in additional detail. Although fiction has the highest frequency of -ness words overall, the only two groupings of morphological features for which -ness words are the most frequent in fiction are simplex+ness and -ish+ness. In other cases, -ness is as representative of fiction as it is of some other registers (root+y+ness, -ful+ness, -ed+ness, -less+ness). In yet other cases, fiction is outnumbered substantially by frequency counts for eample in academic texts (-ous+ness, -ive+ness and -ing+ness). In view of these facts, our perception of the distribution of -ness across registers needs revision in order to allow for these newly-found patterns. The highest total count of -ness derivatives in fiction is predominantly attributable to items conforming to the morphological template simplex adjectival root+ness (711 items out of the total count of 896, see Table 12). The occurrence of -ish+ness words, although highest in fiction (3.5 items), is here negligible. Otherwise, all other instantiations of -ness may be predicted to be equally or less frequent in fiction than in any other of the six registers (see Table 12). 38 On a more global scale then, claims about a universal preference of an affix for any one register may be rejected as inadequate and superficial. Admittedly, this inconsistency of -ness is not entirely haphazard. It seems to be the case that register preferences of particular base form types predetermine the varied distribution of -ness. In particular, simplex nouns are preferred in less formal registers such as spoken language and fiction (Biber et al. 1999: 322-323) and, presumably, the same also holds true of derived nouns in the sense that, in those registers, simplex roots are the preferred bases for -ness suffixation. With derived adjectives acting as base forms, the adjectival suffix itself may be an important factor. The repartition of suffixes such as -ous and -ive, which represent learned vocabulary and therefore are more frequent in academic writing (Biber et al. 1999: 532), may be the very reason for the high number of words in -ousness and -iveness in academic texts. Similarly, the suffixes -ish and -y may safely be regarded as more characteristic of less formal registers and thus explain the preponderance of words in -ish+ness and -i+ness in those registers (see Table 12). However, this correlation does not work without exceptions. Despite the fact that suffixes -less and -ful are by a narrow margin the most common in fiction, 38
Nominalizations in -ari+ness, -al+ness and -ate+ness have not been considered here as separate suffix combinations although they are included in the total -ness counts (Figure 10). The reason is that they are very rare in the BNC, both as types (around a dozen each) and tokens. Still, we have noted that all three types are most frequent in academic prose.
162
Chapter 4
as shown by Biber et al. (1999), derivatives in -less+ness and -ful+ness are found in this study to be somewhat more frequent in academic texts (see Table 12). Both stem-final suffixes (-less and -ful) and the suffix -ness are more characteristic of fiction and thus, theoretically speaking, their combination would be expected to be an even stronger force driving words in -less+ness and -ful+ness towards fiction. This, however, is not the case. Similarly, the affix combinations -ed+ness and -ing+ness are more common in academic discourse. One plausible explanation here is that these are cases of several interacting patterns: in the first, the nature of the base form imposes a certain patterning on the part of the derivative (e.g. bases in -less and -ful push -ness words towards fiction); in another, -ness is more common in fiction; in yet another, nominalizations on the whole tend to gravitate towards more formal registers. Turning now to an analysis of frequency proportion between the registers, we compare the ratios of total -ness frequencies cited in Figure 11 with the ratios of suffix combination frequencies. The total -ness frequencies yield the following ratios: 1:2 1:3 1:4 1:5 1:6 -ness 1.12 1.41 1.77 2.22 4.61 Table 13 Register-to-register ratios for -ness overall (‘1’ stands for the register that scores highest for a given suffix, ‘2’ stands for the second highest, ‘3’ stands for the third highest, etc.)
The suffix strings identified earlier are now compared with one another with regard to the quantitative divergence between registers. 1:2 1:3 1:4 1:5 1:6 root+ness 1.70 1.80 1.99 2.38 4.80 root+y+ness 1.00 2.27 2.27 4.16 6.75 -ful+ness 1.35 1.43 2.04 3.96 14.11 -ish+ness 1.40 2.05 2.18 3.88 11.66 -ous+ness 2.03 2.41 3.85 5.47 10.40 -ed+ness 1.40 1.61 2.10 4.20 5.25 -ive+ness 1.73 4.03 6.05 9.08 13.62 -less+ness 1.10 1.14 1.88 2.46 4.57 -ing+ness 1.05 2.00 2.22 4.00 6.66 Table 14 Register-to-register ratios per suffix combination of -ness
We have concluded earlier that, in general, compared with the other three suffixes, -ness words are more evenly distributed across registers. From Table 14 it appears that particular suffix combinations contribute differently to the overall
A register-sensitive study of English nominalizations
163
distributional pattern of -ness. The combinations -ive+ness and -ous+ness, which have been identified above as more academic-like, are perhaps the most varied in their distribution across registers, as indicated by relatively high frequency ratios. On the other hand, words in -less+ness and, to a lesser extent, words containing unsuffixed base forms, seem the most evenly distributed. On this basis we assume that a type of base form that is relatively frequent in fiction will also be relatively evenly distributed across other registers. Conversely, the more a type of base form leans towards academic texts, the more unbalanced will be its spread across the registers. It is now vital to investigate further in order to establish whether similar claims hold for the other three suffixes. Below a similar sub-division of -ity nominalizations is presented along with token frequencies per register. Figure 12 is also reproduced for convenience of comparison. spoken 621, fict 801, news 1,217, acad 3,670, non-acad 2,608, pop 1,248 4000 3500 3000 2500 2000 1500 1000 500 0
Spoken Fict News Acad Non-Acad Pop
Figure 12 Normalized token frequencies of -ity across registers
Spoken
Fict
News
Acad
NonAcad
Pop
simplex root+ity (fraility)
243
350
563
1306
1079
598
-able+ity (deniability)
109
60
144
453
340
176
-al+ity (annuality)
60
97
117
549
260
150
-ic+ity (crypticity)
58
34
89
100
115
64
Table 15 Frequencies of -ity word tokens per 1 million tokens of text
164
Chapter 4 Spoken
Fict
News
Acad
NonAcad
Pop
-ous+ity 39 (fibrosity)
20
81
54
166
112
64
-ile+ity (virility)
7
23
27
130
77
43
-ive+ity (tentativity)
53
34
60
442
286
105
Table 15 continued
Internal division of -ity nominalizations does not reveal quite as many further register distinctions as in the case of -ness. Rather, our results in Table 15 largely coincide with those observed for total frequency counts of -ity in Figure 12. However, there are several differences to point out. Firstly, although fiction scores more -ity words than spoken language overall, some pairings of affixes are preferred in speech, i.e. -able+ity, -ic+ity, and -ive+ity (Table 15). Secondly, although newspapers employ twice as many total nominalizations as fiction and consistently generate more -ity nominalizations than fiction, the -ous+ity combination is considerably more frequent in fiction. This is also confirmed by our findings of word type counts (section 4), which indicate a general type-and-token preference on the part of this affix pairing. More importantly however, as noted above in our analysis of register-toregister ratios (Table 11), of the four suffixes under discussion, the suffix -ity introduces the most quantitative distancing between registers. Below we consider how particular affix pairings contribute to this characteristic of -ity. The total -ity frequencies cited in Figure 12 yield the following ratios: 1:2 1:3 1:4 1:5 1:6 -ity 1.40 2.94 3.01 4.58 5.90 Table 16 Register-to-register ratios for -ity (‘1’ stands for the register that scores highest for a given suffix, ‘2’ stands for the second highest, ‘3’ stands for the third highest, etc.)
We now compare the above quotients with those calculated for specific affix combinations.
39
The suffix -ous in the derived nominalizations may either surface as an allomorphic alternation (numerosity) or sometimes it is truncated (ambiguity). Both types are included in this category.
A register-sensitive study of English nominalizations
165
1:2 1:3 1:4 1:5 1:6 root+ity 1.21 2.18 2.31 3.73 5.37 -able+ity 1.33 2.57 3.14 4.15 4.15 -al+ity 2.11 3.66 4.69 5.65 9.15 -ous+ity 1.48 2.04 2.59 3.07 8.30 -ile+ity 1.68 3.02 4.81 5.65 18.57 -ic+ity 1.15 1.29 1.79 1.98 3.38 -ive+ity 1.54 4.20 7.36 8.33 13.00 Table 17 Register-to-register ratios per suffix combination of -ity
It appears that particular suffix combinations contribute differently to the quantitative distancing of registers relative to one another. The fact that -ic+ity has the lowest ratio implies that word tokens ending in that string are the most evenly distributed of all. Derivatives of the type [root+ity] are the second most evenly distributed type. Other combinations bring about more frequential divergence between registers. Let us now focus on a similar sub-division of -ion nominalizations, presented in Table 18 along with token frequencies per register. Figure 13 is reproduced below to allow comparison of findings. spoken 3,510; fict 2,845; news 6,468; acad 15,822; non-acad 12,032; pop 6,146 16000 14000 12000 10000 8000 6000 4000 2000 0
Spoken Fict News Acad Non-Acad Pop
Figure 13 Normalized token frequencies of -ion across registers
We have noted earlier that newspapers are in general more prolific in nominalizations than popular magazines (Figure 2). It has also been established that -ion derivatives overall are slightly more frequent in newspapers (Figure 13). And yet further sub-division of -ion reveals that it is only due to -ate+ion derivatives that newspaper language has more -ion nominalizations (Table 18). All other suffix combinations with -ion have higher frequencies in popular magazines. As the difference in the number of -ate+ion words between newspapers and popular magazines is substantial, it in itself is responsible for the overall higher fre-
166
Chapter 4
quency count of all -ion words in newspapers. The uneven qualitative distribution of suffix combinations is thus a noteworthy observation. Spoken
Fict
News
Acad
NonAcad
Pop
743
763
1148
3251
2401
1473
1026
734
2055
4816
3783
1645
-ize+ation (autonomization)
91
25
3
383
325
21
-ify+cation (extensification)
52
23
58
377
223
79
unsuffixed root +ation (preventation) -ate+ion (fundoplication)
unsuffixed root+(it)ion 1269 1111 2886 5833 4751 (spendition) Table 18 Frequencies of -ion word tokens per 1 million tokens of text
3187
Nominalizations in -ize+ation appear highly biased towards academic and, to a lesser extent, non-academic prose. They are sparse virtually everywhere else and particularly so in fiction, newspapers and popular magazines. Interestingly, in this respect, spoken language substantially outnumbers the above-mentioned three registers (Table 18). Frequency proportions between registers in other suffix combinations seem to vary too. Below we consider each of them in turn, beginning with ratios of total -ion frequencies as specified in Figure 13: 1:2 1:3 1:4 1:5 1:6 -ion 1.31 2.44 2.57 4.50 5.56 Table 19 Register-to-register ratios for -ion overall (‘1’ stands for the register that scores highest for a given suffix, ‘2’ stands for the second highest, ‘3’ stands for the third highest, etc.)
The above is to be compared with the following ratios calculated for specific suffix combinations: 1:2 1:3 1:4 1:5 1:6 root+ation 1.35 2.20 2.83 4.26 4.37 Table 20 Register-to-register ratios per suffix combination of -ion
A register-sensitive study of English nominalizations
-ate+ion -ize+ation -ify+cation root+(it)ion Table 20 continued
1:2 1.27 1.17 1.69 1.22
1:3 2.34 4.20 4.77 1.83
1:4 2.92 15.32 6.50 2.02
1:5 4.69 18.23 7.25 4.59
167
1:6 6.56 127.66 16.39 5.25
It is rather evident that root base forms give rise to -ion nominalizations that are the most evenly spread across the registers, as well as being among the most frequent. Words ending in -ize+ation, which we have noted to be highly biased towards academic and non-academic prose, are perhaps the most unevenly distributed. Note that a similar pattern has been shown to hold for the distribution of -ous+ness and -ive+ness, also typical of academic prose and also the most unevenly distributed. Similarly, nominalizations in -al+ity, -ile+ity, and -ive+ity, preferred in academic texts, tend to be relatively infrequent in other registers. This is another indication of a tendency we have noted before: typically academic suffix combinations are infrequent in all other registers whereas combinations which are relatively frequent in fiction are evenly distributed across other registers. Below, the suffix -ment is analyzed in a similar fashion. This time, two types of base forms are considered. Figure 14 is repeated for comparison of frequencies. spoken 1,664; fict 958; news 3,141; acad 4,843; non-acad 5,526; pop 2,104 6000 5000
Spoken
4000
Fict
3000
News Acad
2000
Non-Acad
1000
Pop
0
Figure 14 Normalized token frequencies of -ment across registers
The first template, X-ment, accounts for the great majority of word types sampled; thus, the results for X-ment in Table 21 correlate closely with those in Figure 14. The frequency proportions of en-X-ment, however, are quite different.
168
Chapter 4
X-ment (configurement)
Spoken
Fict
News
Acad
NonAcad
Pop
1629
895
3060
4637
5376
2023
en-X-ment 35 63 81 206 150 (ensheathment) Table 21 Frequencies of -ment word tokens per 1 million tokens of text
81
Apparently the reason for this divergence is that many of X-ment words are high frequency items (e.g. government, development, management), whose frequency matches, indeed determines, the overall patterning and high token frequency of total -ment derivatives across all registers. On the other hand, en-X-ment nominalizations are but a minor type alongside the dominant one, which happens to follow a different distributional pattern. Firstly, unlike in the X-ment type, fiction scores twice the number of en-Xment items found in spoken language. Secondly, newspapers and popular magazines are on the same level, as opposed to the 30 per cent contrast between the two in the top row of Table 17. Thirdly, academic and non-academic texts pattern alternately as the leader in the distribution of one or the other type of base form. This observation is further supported by analogous findings of word type counts (section 4.4), which indicate an inclination of the X-ment type towards non-academic texts and of the en-X-ment type towards academic prose, thus implying a functional divergence. The suffix -(c)y is taken under consideration below. Figure 25 plots total frequencies of the suffix across the registers while Table 22 breaks down the total into respective base forms. spoken 148, fict 159, news 393, acad 748, non-acad 681, pop 313 800 Spoken 600 400 200
Fict News Acad Non-Acad Pop
0
Figure 25 Normalized token frequencies of -(c)y across registers
A register-sensitive study of English nominalizations Spoken
Fict
News
Acad
NonAcad
Pop
-ant+(c)y (reflectancy)
70
70
152
381
324
153
-ate+(c)y (appropriacy)
21
49
43
151
94
54
noun+(c)y 55 33 177 152 233 (infancy) Table 22 Frequencies of -(c)y word tokens per 1 million tokens of text
169
83
Perhaps the only points where the data from the table depart significantly from those in Figure 25 is the relatively low count of -ate+(c)y items and the high frequency of noun-based items in newspapers. The abundance of nominalizations of the latter type may be due to the fact that most of them – and especially the most popular and frequent – are associated with journalism, politics and current affairs (presidency, candidacy, papacy, delinquency, constituency, accountancy). In summary, this section has analyzed the internal composition of nominalizations and shown explicitly that, assuming the same rightmost suffix, various suffix combinations and types of base forms pattern differently with regard to register preferences. This is especially evident in the case of the suffix -ness, where, depending on particular affix pairings, the highest values of frequency fluctuate across the registers, including the two polarized extremes of fiction and academic prose. It has also been noted that the varied distribution of -ness in most cases overlaps with the distributional preferences of the stem-final suffix (or lack thereof). We have concluded that claims about an affix’s register distribution must necessarily be revised to accommodate finer distinctions of combinations of that affix with distinct types of base forms. To conclude, we refer back to the first set of research questions put forward at the beginning of this chapter. The BNC data have produced results very similar to those reported by other authors when it comes to estimated total frequencies of all nominalizations per register and the increasing number of nominalizations observable in the sequence: fiction < spoken < pop < news < non-academic < academic. Academic prose has long been recognized as the most productive of nominalizations and this is borne out by our findings. The abundance of -ness words in fiction, as noted by Biber et al. (1998) and (1999), has also been confirmed. Other than that, individual suffixes have received little treatment in the study of register variation. To rectify this situation, we have offered a more indepth quantitative analysis of each suffix and each register. Most importantly, we have then adopted a more detailed qualitative approach by considering the
170
Chapter 4
types of affix combinations that may bear on the distributional preferences of the rightmost suffix. The morphological structure of base forms has indeed proved a significant distributional factor: whether the stem is simplex or complex, and – if complex – whether the stem-final suffix itself is preferred in any particular register. This in turn served us as another basis for considering register variation at a completely new level – one where affix pairings, not single affixes, are considered. As a result, we have revised previous claims of affix distribution so as to fit the newly-found patterns. 4.4. Morphological productivity and lexical innovations In this section we discuss the same twelve suffixes with respect to their potential to form distinct word types. In our discussion above, observations of register variation were all based on frequency of occurrence and this in turn is customarily measured by means of the number of word tokens. In contrast, investigations of morphological productivity will be more likely to benefit from counts of different word types rather than reports of the number of times one and the same form occurs in a corpus. For example, to declare that there are 1,007 -ity word types in our sample of the BNC is a better reflection of the productivity of this suffix than to say that the word security occurs in the same corpus 13,677 times. Token and type frequencies thus indicate two different dimensions of a suffix’s usability. In Procedure, we stated total type counts for the twelve suffixes, and the five top scoring were: -ion 1,752, -ness 1,700, -ity 1,007, -ance/-ence (312) and -ment 302. The two dominating suffixes, -ion and -ness, are clearly within the same range. Yet this is to be contrasted with token frequencies of the same suffixes which amount to (normalized per one million tokens) 8,365 for -ion and 618 for -ness. 40 The suffix -ness may therefore be said to be used extensively in a great number of lexemes, possibly new. Also, these words will be of relatively low frequency. On the other hand, the suffix -ion is used in approximately the same number of lexemes but with a much higher frequency. The suffixes -ity and -ment stand in a similar relation to each other: word types in -ity are over three times as numerous as those in -ment but the latter suffix is more common in token counts (1,848 tokens in -ity and 3,241 in -ment). These two dimensions of quantitative discrepancies between individual suffixes can be represented graphically as follows:
40
The same numerical relations may be expressed in the form of token-type ratio as: 4.8 for -ion, 0.4 for -ness, 1.8 for -ity and 10.7 for -ment. Higher values of the quotient indicate high frequency words and/or a low number of word types. We give each suffix’s value of the ratio as the discussion proceeds.
A register-sensitive study of English nominalizations
171
16000 14000 12000
Spoken
10000
Fict
8000
News
6000
Acad
4000
Non-Acad
2000
Pop
0 -ness
-ity
-ion
-ment
Figure 10 Normalized token frequencies across registers
1400 1200
Spoken
1000
Fict
800
News
600
Acad
400
Non-Acad
200
Pop
0 -ness
-ity
-ion
-ment
Figure 26 Raw word type counts across registers
With regard to the joint distribution of nominalizations across the registers, token frequencies and type counts of the twelve suffixes are compared below. Figures 16 and 27 plot the joint distributions of all the twelve suffixes.
35000 30000
Spoken
25000
Fict
20000
News
15000
Acad
10000
Non-Acad
5000
Pop
0
Figure 16 Normalized joint token frequencies of the twelve suffixes across registers
172
Chapter 4
4000 Spoken 3000 2000 1000
Fict News Acad Non-Acad Pop
0
Figure 27 Raw joint types of the twelve suffixes across register
Comparing Figures 16 and 27, fictional texts are lexically richer in word type counts than may be expected from token frequency considerations. This is mostly attributable to the abundance of -ness word types in fiction (see below for a discussion of -ness types). Otherwise, the distributional pattern of nominalizations is essentially the same in both charts. The pages to follow will briefly consider each affix’s word tokens and word types to compare the merits of the two measures and discuss their respective weight for the purposes of this study. Next, we will discuss distributional facts pertaining to morphological productivity proper. Our investigations of morphological productivity fall into two different kinds: productivity in the broader sense, understood as the range of different word types, established and innovative, produced by an affix (Baayen’s vocabulary size), and productivity at its most essential – understood as the capacity of a morphological rule, here an affix, to form new lexemes. The latter in particular is our main objective. We aim to establish the degree of productivity of the four suffixes based on new arrivals to the lexicon of English. 41 Therefore we isolate innovative nominalizations in the manner stated in Methodology and study word-formational patterns that influence the probability of a new word coming into existence. Additionally we will note any patterns of register variation that may present themselves as the discussion proceeds. The suffix -ness (token-type ratio: 0.4) In general, the suffix -ness yields word types and word tokens in approximately the same proportions of register distribution (see Figures 11 and 27 below).
41
The reasons behind adopting this approach to measuring productivity were discussed in section 1. The inapplicability of Baayen’s hapax-conditioned measurements is also considered therein.
A register-sensitive study of English nominalizations
173
spoken 275, fict 918, news 490, acad 762, non-acad 787, pop 586 1000 800
Spoken Fict
600
News
400
Acad Non-Acad
200
Pop
0
Figure 28 Raw type count of -ness across registers
1000 800
Spoken Fict
600
News
400
Acad Non-Acad
200
Pop
0
Figure 11 Normalized token frequencies of -ness across registers
The ratios of frequencies between registers will be somewhat lower in the case of word type counts but the registers’ relative ranges of frequencies compared to one another are alike. The only immediately noticeable difference is that nonacademic texts have 25 more word types than academic texts (as opposed to the reverse tendency in word token frequencies). This means that, comparing the two registers, non-academic texts employ and possibly coin more distinct word types within a smaller number of word tokens (i.e. the texts are more saturated lexically) whereas academic prose tends to repeat the same items more often. The same pattern is also present when taking into account affix combinations individually, as testified in Tables 23, 25, 27 and 30 below. Note that the token counts are normalized, but the type counts are given raw. Spoken Fict News Acad Non-Acad Pop tokens 148 711 298 416 393 356 types 145 337 213 267 278 242 Table 23 [simplex root+ness] normalized tokens and raw types across registers
174
Chapter 4
Spoken Fict News Acad Non-Acad Pop tokens 3.7 25 11 6 11 25 types 29 148 59 40 73 109 Table 24 [simplex root+y+ness] normalized tokens and raw types across registers Spoken Fict News Acad Non-Acad tokens 1.8 18.8 6.4 25.4 17.7 types 10 49 23 42 43 Table 25 [-ful+ness] normalized tokens and raw types across registers
Pop 12.4 28
Spoken Fict News Acad Non-Acad tokens 0.3 3.5 1.6 0.9 1.7 types 4 26 15 6 17 Table 26 [-ish+ness] normalized tokens and raw types across registers
Pop 2.5 14
Spoken Fict News Acad Non-Acad tokens 10 43 19 104 51 types 16 81 37 75 79 Table 27 [-ous+ness] normalized tokens and raw types across registers
Pop 27 40
Spoken Fict News Acad Non-Acad tokens 4 15 5 21 13 types 15 57 24 61 57 Table 28 [-ed+ness] normalized tokens and raw types across registers
Pop 10 29
Spoken Fict News Acad Non-Acad tokens 8 12 18 109 63 types 18 47 32 101 78 Table 29 [-ive+ness] normalized tokens and raw types across registers
Pop 27 44
Spoken Fict News Acad Non-Acad tokens 7 29 17 32 28 types 12 69 32 54 61 Table 30 [-less+ness] normalized tokens and raw types across registers
Pop 13 34
Spoken Fict News Acad Non-Acad tokens 3 5 10 20 19 types 2 9 3 11 7 Table 31 [-ing+ness] normalized tokens and raw types across registers
Pop 9 4
A register-sensitive study of English nominalizations
175
Looking at Tables 23 – 31, either fiction or academic prose leads in word token and type counts for any type of base form. This again suggests a high degree of polarization in the preference of -ness and its affix combinations to appear with increasing frequency at either end of the fiction-academic frequency continuum. Comparing the numbers of tokens and types in other registers above, one notes how the two measures represent two different dimensions of one and the same notion – distribution. For instance, although fiction and popular magazines have the same frequency of tokens in -y+ness, fictional texts are much more productive, as indicated by a type count which is nearly 50 per cent higher. Moreover, a relatively large number of tokens need not necessarily be accompanied by a proportionally high number of types and, conversely, relatively low word token frequencies are not certain to correspond to low word type counts. This is best illustrated in Table 27 where nominalizations in -ous+ness are clearly most frequent in academic prose but are nevertheless the most varied in terms of word types in fiction. The two registers are in a similar relationship, though not as dramatically apparent, in the case of -less+ness derivation in Table 30 and -ful+ness in Table 25. Overall, compared to considerations of token frequency, the counts of word types in -ness bring more evidence of the preference of the suffix to be used in fiction. In 6 out of 9 different categories of base forms, -ness lexemes are most numerous in fiction, although sometimes by a narrow margin. This contrasts with the ratio of 2 out of 9 by the criterion of token frequencies. The two combinations root+ness and -ish+ness once again confirm their preference for fiction. Still, in three cases of affix combinations academic prose persists as the most productive genre (-ed+ness, -ing+ness and -ive+ness). Of the two affix combinations we have earlier concluded, on the basis of token statistics, to be typically academic, only -ive+ness remains unequivocally so by the criterion of type count, with lexemes in -ous+ness being outnumbered minimally in fictional texts. Morphological productivity and token frequency distributions alike are thus shown here to be subject to register variation, although the patterns may not always overlap completely. To conclude so far, the two measurements, of token and type counts, produce notably different results when distributional considerations are concerned. They are therefore preferably kept apart for two distinct tasks: measuring frequency of occurrence and measuring morphological productivity respectively. Thus the suffix -ness may be said to be most productive in fiction, except for the three suffix pairings -ed+ness, -ive+ness and -ing+ness, which are most productive in academic prose (none of the other registers leads in word type counts for any type of base form). At the same time, frequency of occurrence data specified as token frequency shows explicitly that the distribution of -ness is highly varied across the registers and is best presented as the distribution of its various base
176
Chapter 4
forms (see 4.1). Both measurements indicate that spoken language has both the lowest frequency of occurrence and productivity rate of -ness (across all the base form types). Let us now examine another facet of productivity, understood as the potential of the suffix to allow the creation of new words. A cross-reference of the BNC and the OED (see Methodology for details and discussion) has generated the numbers of lexical innovations in -ness given in Table 32 (the formations themselves are listed in Appendix). The final column in the table corresponds to the number of new lexemes found in the BNC outside the six sub-corpora. We include the counts of these items as relevant to the productivity of the affix. NonAcad new types 9 41 14 31 28 Table 32 Totals of new word types in -ness across registers Spoken
Fict
News
Acad
Pop
Other
35
31
Measured by the number of new word types, the suffix -ness is at its most productive in fiction and its decreasing potential to coin new words across the registers is represented by the sequence: fiction > popular magazines > academic prose > non-academic prose > newspapers > spoken language. We now turn to examine each type of base form to establish their degree of contribution to the overall productivity of -ness. Table 33 specifies the new word type counts across types of base forms and across the registers. The final column corresponds to new word types of morphological composition that does not match any of our nine word-formational templates. root
-y
-ful
-ish
-ous
-ed
-ive
-less
New 21 39 4 10 2 42 6 14 types Table 33 Totals of new word types in -ness across types of base form
-ing
other
5
25
The suffix -ness is at its most productive in the combination -ed+ness (as in brainedness), which generated 42 items (see Appendix for a detailed list); the least productive is -ous+ness, with only 2 novel creations. New word types in -ness are then most likely to emerge when their base form is of a particular kind, that is, in decreasing order of probability of emergence: -ed+ness, -y+ness, root+ness, -less+ness, -ish+ness, -ive+ness, -ing+ness, -ful+ness, -ous+ness. The type of base form involved is thus a factor of paramount importance in the capacity of an affix to coin new words and in the linguist’s measurement of this capacity. Below we look at how different types of bases are distributed across the registers.
A register-sensitive study of English nominalizations
177
NonPop Other Acad root+ness 1 4 1 3 3 7 7 root+y+ness 6 12 7 1 5 13 5 -ful+ness 0 0 0 2 0 1 2 -ish+ness 0 1 2 0 2 4 2 -ous+ness 0 0 0 0 1 0 1 -ed+ness 0 14 0 7 9 5 8 -ive+ness 1 0 0 5 0 0 0 -less+ness 0 5 0 3 1 3 2 -ing+ness 0 0 0 2 2 0 1 other 1 5 4 8 5 2 3 Table 34 New word types in -ness across types of base form and across registers Spoken
Fict
News
Acad
Intuitively speaking, one would expect high numbers of lexical innovations to correlate with high numbers of word types overall. However, this is the case in only three out of nine combinations (-ive+ness, -less+ness and -ing+ness). Comparing our findings of total type counts and new type counts, words in -ive+ness and -ing+ness (but not -ed+ness) once again prove their academic character (in token, type and new type counts) whereas root+y+ness and -less+ness are consistently fictional in character. Surprisingly, it is especially noticeable that popular magazines take the lead in the number of new lexemes elsewhere (root+ness, root+y+ness and -ish+ness), which again testifies to the predisposition of this register to lexical innovation. The suffix -ity (token-type ratio: 1.8) In Figures 12 and 29 below we plot -ity word type and word token counts across the registers. 4000 Spoken 3000 2000 1000
Fict News Acad Non-Acad Pop
0
Figure 12 Normalized token frequencies of -ity across registers
178
Chapter 4
spoken 267, fict 401, news 371, acad 744, non-acad 599, pop 416 800 Spoken 600 400 200
Fict News Acad Non-Acad Pop
0
Figure 29 Raw type count of -ity across registers
The proportions in word type counts indicate that the registers are not as diversified as may be inferred from token frequencies. We will now consider word type counts across the registers and types of base form. Spoken Fict News Acad Non-Acad Pop tokens 243 350 563 1306 1079 598 types 77 112 99 124 111 99 Table 35 [simplex root+ity] normalized tokens and raw types across registers Spoken Fict News Acad Non-Acad tokens 109 60 144 453 340 types 44 55 78 196 151 Table 36 [-able+ity] normalized tokens and raw types across registers
Pop 176 101
Spoken Fict News Acad Non-Acad tokens 60 97 117 549 260 types 52 76 72 159 117 Table 37 [-al+ity] normalized tokens and raw across registers
Pop 150 77
Spoken Fict News Acad Non-Acad tokens 20 81 54 166 112 types 26 54 39 65 58 Table 38 [-ous+ity] normalized tokens and raw types across registers
Pop 64 46
Spoken Fict News Acad Non-Acad tokens 7 23 27 130 77 Table 39 [-ile+ity] normalized tokens and raw types across registers
Pop 43
A register-sensitive study of English nominalizations Spoken types 10 Table 39 continued
Fict 17
News 15
Acad 23
Non-Acad 18
Pop 13
Spoken Fict News Acad Non-Acad tokens 58 34 89 100 115 types 11 11 9 54 35 Table 40 [-ic+ity] normalized tokens and raw types across registers
Pop 64 13
Spoken Fict News Acad Non-Acad tokens 53 34 60 442 286 types 14 15 12 50 37 Table 41 [-ive+ity] normalized tokens and raw types across registers
Pop 105 18
179
The overall productivity of -ity is not as prone to register variation as that of -ness. Word types in -ity are consistently the most numerous in academic prose, both in token and type counts. Fiction has notably fewer types and, generally, there is less of a quantitative gap between the registers, with lexemes in root+ity being the most evenly spread (as opposed to -ic+ity by token measures). 42 Let us now examine the ability of the suffix to coin novel lexemes. Table 42 gives the numbers of new -ity lexemes found in each register. Spoken
Fict
News
Acad
new types 5 11 6 50 Table 42 Totals of new word types in -ity across registers
NonAcad 21
Pop
Other
18
34
Gauged by the number of new word types, the suffix -ity is at its most productive in academic prose and its decreasing potential to coin new words across the registers is represented by the sequence: academic prose > other > nonacademic prose > popular magazines > fiction > newspapers > spoken language. We now turn to examine each type of base form to establish their degree of contribution to the overall amount of -ity lexical innovation. Table 43 specifies the new word type counts across types of base form and across the registers.
42
Similarly, by the criterion of word types, root+ness derivatives are the most evenly spread across the registers.
180
Chapter 4
root -able-al- -ous-ile-icnew types 7 65 18 4 0 12 Table 43 Totals of new word types in -ity across types of base form
-ive13
other 10
As was the case with -ness, the productivity of -ity and its probability of coining new word types is subject to limitations of the type of base form involved. And so by far the most productive and most likely to give rise to new forms is the string -able+ity (e.g. rinsability), accounting for half the total of -ity innovations. In decreasing order of probability of emergence, -ity nominalization involves the following word-formational strings: -able+ity, -al+ity, -ive+ity, -ic+ity, root+ity, -ous+ity, -ile+ity (see Appendix for a complete list of derivatives). Below we look at how types of base form are distributed across the registers. NonPop Other Acad root+ity 0 3 1 3 0 0 0 -able+ity 2 1 5 24 12 13 19 -al+ity 1 2 0 10 1 0 4 -ous+ity 0 0 0 2 0 1 1 -ile+ity 0 0 0 0 0 0 0 -ic+ity 1 3 0 5 4 1 1 -ive+ity 0 0 0 5 2 1 6 other -ity 1 2 0 1 2 2 3 Table 44 New word types in -ity across types of base form and across registers Spoken
Fict
News
Acad
Academic texts are here clearly the most typical area for innovative -ity nominalization, with the exception of root+ity, where there are just as many new lexemes in academic prose as there are in fiction, and -ive+ity, represented by six items in miscellaneous texts (too heterogeneous to describe or classify as a unified genre). The suffix -ion (token-type ratio: 4.77) Figures 13 and 30 plot the token and type counts respectively. Again, as was the case with -ness and -ity, the type counts indicate a more balanced spread of relevant lexemes across the registers. Especially noteworthy are the increased values in spoken language and fiction (as was also the case with -ity type counts in Figure 29). In Tables 45 – 49, we break down -ion nominalizations into particular affix pairings to examine further their productivity and the complexities of their distribution. Tables 45 – 49 specify token and type counts across the registers.
A register-sensitive study of English nominalizations spoken 700, fict 868, news 752, acad 1,389, non-acad 1,244, pop 828 1400 1200
Spoken
1000
Fict
800
News
600
Acad
400
Non-Acad
200
Pop
0
Figure 30 Raw type count of -ion across registers spoken 3,510, fict 2,845, news 6,468, acad 15,822, non-acad 12,032, pop 6,146 16000 14000 12000 10000 8000 6000 4000 2000 0
Spoken Fict News Acad Non-Acad Pop
Figure 13 Normalized token frequencies of -ion across registers
Spoken Fict News Acad Non-Acad Pop tokens 743 763 1148 3251 2401 1473 types 137 172 154 214 210 162 Table 45 [unsuffixed root +ation] normalized tokens and raw types across registers Spoken Fict News Acad Non-Acad tokens 1026 734 2055 4816 3783 types 271 387 340 546 505 Table 46 [-ate+ion] normalized tokens and raw types across registers
Pop 1645 366
Spoken Fict News Acad Non-Acad tokens 91 25 3 383 325 types 74 51 17 304 215 Table 47 [-ize+ation] normalized tokens and raw types across registers
Pop 21 36
181
182
Chapter 4
Spoken Fict News Acad Non-Acad tokens 52 23 58 377 223 types 29 39 37 76 67 Table 48 [-ify+cation] normalized tokens and raw types across registers
Pop 79 45
Spoken Fict News Acad Non-Acad Pop tokens 1269 1111 2886 5833 4751 3187 types 176 201 191 227 227 200 Table 49 [unsuffixed root+(it)ion] normalized tokens and raw types across registers
It is interesting to note that, as was also the tendency with -ness, and -ity, word types in -ion are particularly evenly spread across the registers when the suffix is attached to simplex root base forms (Tables 45 and 49). Here again, academic texts are the prevailing domain of nominalization as both token and type counts are the highest in this genre. As the suffix is uncontroversially and consistently the most productive in academic prose, we now turn to consider the new types in -ion. Table 50 presents the numbers of new -ion lexemes found in each register. Spoken
Fict
News
Acad
new types 5 2 7 78 Table 50 Totals of new word types in -ion across registers
NonAcad 34
Pop
Other
5
23
The totals of new word types calculated per register show that the suffix -ion is at its most productive by far in academic prose with more than twice the number of new lexemes found there than in the runner-up register. The suffix’s evident decreasing potential to coin new words across the registers is represented by the sequence: academic prose > non-academic prose > newspapers > spoken language / popular magazines > fiction. The low score of fiction here is a noteworthy finding, although hardly unexpected, assuming the highly academic disposition of the suffix. We now turn to examine each type of base form to establish their degree of contribution to the overall number of innovations in -ion. Table 51 specifies the counts of new word types across types of base form and across the registers. root-ation -ate-ize-ifyroot-(it)ion new types 10 40 83 9 4 Table 51 Totals of new word types in -ion across types of base form
A register-sensitive study of English nominalizations
183
Once more we note a substantial gradation in the productivity of individual pairings of word-formational elements. As was the case with -ness and -ity, word types of one particular kind are the most productive and the most likely to occur. And so, by far the most productive and most likely to give rise to new forms is the string -ize+ation, comprising almost 60 per cent of the total of new forms in -ion. In decreasing order of probability of emergence, the following wordformational strings may be expected to generate other -ion nominalizations: -ate+ion, root+ation, -ify+cation, root+(it)ion. Table 52 specifies in more detail the distribution of each type of base form in each register. NonPop Other Acad root+ation 1 0 1 4 1 0 3 -ate+ion 1 0 1 24 12 2 2 -ize+ation 3 1 3 44 19 3 14 -ify+cation 0 1 1 5 2 0 2 root+(it)ion 0 0 1 1 0 0 3 other -ion 0 0 0 0 0 0 0 Table 52 New word types in -ion across types of base form and across registers Spoken
Fict
News
Acad
The distribution of individual kinds of base form is not surprising, complying with our distribution findings in Table 50. Academic prose is the field typically prolific in novel nominalizations, and the two pairings -ize+ation and -ate+ion are especially productive. The suffix -ment (token-type ratio: 10.73) Figures 14 and 31 below plot the word token and word type counts obtained for nominalizations in -ment. spoken 140, fict 194, news 172, acad 231, non-acad 228, pop 161 250 200
Spoken Fict
150
News
100
Acad
50 0
Figure 31 Raw type count of -ment across registers
Non-Acad Pop
184
Chapter 4
spoken 1,664, fict 958, news 3,141, acad 4,843, non-acad 5,526, pop 2,104 6000 5000
Spoken
4000
Fict
3000
News Acad
2000
Non-Acad
1000
Pop
0
Figure 14 Normalized token frequencies of -ment across registers
Compared to word tokens, the typical more balanced spread of different lexemes across the register is by now just what we expect, as is the familiar increase of values in spoken language and fiction. Fiction texts in particular employ a relatively wide variety of distinct word types within a low number of word tokens (as was also the case with -ity and -ion). Relevant derivatives are divided below into particular strings of base form to examine further their productivity and distribution. Tables 53 and 54 give token and type counts across the registers. Spoken Fict News Acad Non-Acad tokens 1629 895 3060 4634 5376 types 113 159 140 180 181 Table 53 [root +ment] normalized tokens and raw types across registers
Pop 2023 131
Spoken Fict News Acad Non-Acad tokens 35 63 81 206 150 types 27 38 32 54 50 Table 54 [en-root+ment] normalized tokens and raw types across registers
Pop 81 32
Word types in -ment are particularly evenly spread across the registers, which is most likely attributable to the now familiar tendency for root-based derivatives to be evenly distributed. Non-academic texts go hand in hand with academic ones as far as the root+ment word type count is concerned, although the former have a higher token count. In contrast, academic texts lead in both token and type counts of en+root+ment nominalizations. Let us now turn to consider new types in -ment. Table 55 gives the totals of new -ion lexemes.
A register-sensitive study of English nominalizations
185
root-ment en-root-ment other new types 4 5 1 Table 55 Totals of new word types in -ment across types of base form
New lexemes in -ment too seem structurally balanced as both types of base form produce parallel counts of innovations. On the whole, new types in -ment are rare: 10 items in 100 million tokens of text. Considering the fact that word tokens in -ment have been found to be consistently the second most frequent across the registers (Figure 3), one must assume the suffix to be marginally productive and its widespread presence in word tokens must be attributed to (often lexicalized) high frequency items, such as government, management, development. This claim is also supported by the suffix’s low number of all word types relative to the total of word tokens (see Figure 32 below). 6000 5000
Spoken
4000
Fict
3000
News Acad
2000
Non-Acad 1000
Pop
0 types
tokens
Figure 32 Raw types and normalized tokens in -ment
Still, acknowledging the low productivity of the suffix, the few new types observed are fairly evenly distributed, with at least one item out of the total ten in each register (see Table 56 below). 43 NonAcad new types 1 1 1 2 3 Table 56 Totals of new word types in -ment across registers Spoken
43
Fict
News
Acad
Pop
Other
3
1
Table 56 specifies the number of new types in each register. The numbers add up to twelve although the total of new types in the BNC is ten. This is because some of the innovations may appear more than once in the corpus. The same applies to all new word type counts in this study.
186
Chapter 4
The suffix -(c)y (token-type ratio: 2.3) The suffix -(c)y is the last which we consider in terms of particular base forms realized as affix pairings. Figures 25 and 33 below plot type and token counts across the registers. Again we note the relatively balanced distribution of word types. spoken 72, fict 104, news 96, acad 142, non-acad 135, pop 102
150 Spoken 100
Fict News Acad
50
Non-Acad Pop
0
Figure 33 Raw type count of -(c)y across registers spoken 148, fict 159, news 393, acad 748, non-acad 681, pop 313 800 Spoken 600 400 200
Fict News Acad Non-Acad Pop
0
Figure 25 Normalized token frequencies of -(c)y across registers
Our results of type counts coincide with those of token frequencies in that the register of newspapers exhibits a preference for the template [noun+(c)y], which is more frequent in newspapers than in academic texts (Table 59). As indicated earlier, this may be due to the fact that most lexemes of this type are associated with issues that are customarily the subject of interest of journalism (e.g. presidency, candidacy, etc.). They are, however, even more common, in both token and type counts, in non-academic texts. The other two affix pairings prevail in academic writing (Tables 57 and 58).
A register-sensitive study of English nominalizations
Spoken Fict News Acad Non-Acad tokens 70 70 152 381 324 types 37 57 46 73 67 Table 57 [-ant+(c)y] normalized tokens and raw across registers
Pop 153 52
Spoken Fict News Acad Non-Acad tokens 21 49 43 151 94 types 14 17 16 30 26 Table 58 [-ate+(c)y] normalized tokens and raw across registers
Pop 54 19
Spoken Fict News Acad Non-Acad tokens 55 33 177 152 233 types 16 22 26 26 31 Table 59 [noun+(c)y] normalized tokens and raw types across registers
Pop 83 23
187
Word types in -(c)y are relatively few in number and so are the suffix’s innovations. Altogether, eight new types have been isolated in the entire BNC. Tables 60 and 61 specify the counts of new word types across types of base form and across the registers. -ant+(c)y -ate+(c)y noun+(c)y new types 3 3 0 Table 60 Totals of new word types in -(c)y across types of base form NonAcad new types 0 0 0 5 1 Table 61 Totals of new word types in -(c)y across registers Spoken
Fict
News
Acad
other 2
Pop
Other
1
3
With respect to lexical innovations in -(c)y across the registers, new word types incline towards academic prose. Structurally, the two affix pairings -ant+(c)y and -ate+(c)y, which we have stated to be characteristic of academic prose in terms of token and type counts, give rise to six out of the eight innovative derivatives. We also note that the [noun+(c)y] pairing is completely unproductive and that the remaining two items (generacy, ethicacy) are ‘odd ones out’ that fit none of the three structural templates. The remaining suffixes (-age, -al, -ance/-ence, -dom, -ship, -ery, -hood) are discussed below only with reference to the number and distribution of their types and new word types. Typological considerations of base forms are ignored as
188
Chapter 4
there are no clear structural patterns suggesting any classification or division beyond that of simplex base forms (e.g. shrinkage, fandom). The suffix -ance/-ence (token-type ratio: 5.2 de-adjectival; 7.3 deverbal) Similarly to the suffix -ment, nominalizations in -ance/-ence are frequent in token counts but low in word types (see token-type ratios for both variants of the suffix). We noted in 4.2 that de-adjectival -ance/-ence derivatives were more frequent than deverbal ones only in fiction. This preference of fiction for deadjectival items is also confirmed in our word type counts (see Figures 34 – 37). spoken 72, fict 122, news 99, acad 118, non-acad 124, pop 101 140 120
Spoken
100
Fict
80
News
60
Acad
40
Non-Acad
20
Pop
0
Figure 34 Raw type count of de-adjectival -ance/-ence across registers spoken 327, fict 549, news 620, acad 1,696, non-acad 1,057, pop 584 2000 Spoken 1500 1000 500
Fict News Acad Non-Acad Pop
0
Figure 35 Normalized token frequencies of de-adjectival -ance/-ence across registers
A register-sensitive study of English nominalizations
189
spoken 91, fict 100, news 89, acad 121, non-acad 115, pop 96 140 120
Spoken
100
Fict
80
New s
60
Acad
40
Non-Acad
20
Pop
0
Figure 36 Raw type count of deverbal -ance/-ence across registers spoken 469, fict 442, news 934, acad 1,863, non-acad 1,458, pop 767 2000 Spoken 1500
Fict News
1000
Acad Non-Acad
500
Pop 0
Figure 37 Normalized token frequencies of deverbal -ance/-ence across registers
The suffix is only moderately productive in its new word type counts: in the entire BNC corpus, five de-adjectival and nine deverbal nominalizations have been found. Considered with respect to register distribution, most of these fourteen innovations are found in academic texts and, surprisingly, in the spoken variety (five in each). NonPop Other Acad new types 1 0 1 2 1 0 0 Table 62 Totals of new word types in de-adjectival -ance/-ence across registers Spoken
Fict
News
Acad
NonPop Other Acad new types 4 1 0 3 1 2 0 Table 63 Totals of new word types in deverbal -ance/-ence across registers Spoken
Fict
News
Acad
190
Chapter 4
The suffix -dom (token-type ratio: 2.4) The distribution of the suffix -dom is interesting for several reasons. Firstly, it is popular magazines that boast the largest number of its lexemes although the token counts are the highest in academic prose (Figures 38 and 39). Secondly, its new types are also most likely to appear in popular magazines (10 out of the total 21, see Table 59). Thirdly, a third of the total number of all word types in -dom are innovative items (21 out of 62). Bearing in mind the suffix’s low token and type counts, this exceptionally high relative percentage of new types is well worth our notice. spoken 8, fict 17, news 18, acad 21, non-acad 26, pop 28 30 25
Spoken
20
Fict News
15
Acad
10
Non-Acad
5
Pop
0
Figure 38 Raw type count of -dom across registers spoken 47, fict 79, news 107, acad 266, non-acad 216, pop 63 300 250
Spoken
200
Fict News
150
Acad
100
Non-Acad
50
Pop
0
Figure 39 Normalized token frequencies of -dom across registers
NonAcad new types 1 0 2 1 3 Table 64 Totals of new word types in -dom across registers Spoken
Fict
News
Acad
Pop
Other
10
4
A register-sensitive study of English nominalizations
191
The suffix -ery (token-type ratio: 5.0 deverbal; 0.2 denominal/de-adjectival) We have noted before the significant tendency of this suffix (both variants, see Figures 40, 41 22 and 23 below) to gravitate towards the registers of fiction and, especially, newspapers. These findings are confirmed by our type word counts, where the two registers score respectively 34 and 37 lexemes. Equally noteworthy are the unusually low type and token counts for denominal and de-adjectival derivatives in academic texts (Figures 40 and 22). Although the two variations, denominal/de-adjectival and deverbal, are both represented by 29 word types each, the verbal items are far more frequent in token counts (7 and 147 tokens per million tokens respectively; see the two token-type ratios above). The denominal/de-adjectival derivatives are thus less common – perhaps unique – also in functional terms: the rareness of the formal template noun/adjective+ery finds its reflection in the special pragmatic effect of these words, one of word play and jocular creativity. spoken 2, fict 15, news 15, acad 6, non-acad 13, pop 10 15 Spoken 10
Fict News Acad
5
Non-Acad Pop
0
Figure 40 Raw type count of denominal/de-adjectival -ery across registers spoken 0.6, fict 9, news 15, acad 3, non-acad 7, pop 7 15 Spoken 10
Fict News Acad
5
Non-Acad Pop
0
Figure 22 Normalized token frequencies of denominal/de-adjectival -ery across registers
192
Chapter 4
spoken 17, fict 22, news 19, acad 17, non-acad 18, pop 18 25 Spoken
20
Fict 15
News
10
Acad Non-Acad
5
Pop
0
Figure 41 Raw type count of deverbal -ery across registers spoken 70, fict 84, news 251, acad 170, non-acad 176, pop 132 300 250
Spoken
200
Fict News
150
Acad
100
Non-Acad
50
Pop
0
Figure 23 Normalized token frequencies of deverbal -ery across registers
New word types in -ery are few in number (5 de-adjectival and 6 deverbal items) but still gravitate towards journalistic language (newspapers and popular magazines) and fiction (Tables 60 and 61). The two variants of the suffix are noted here again to have produced parallel numbers of innovations, as was the case with the total of all word types. Innovative nominalizations such as skin-flintery and weirdery are typical representatives of the eye-catching expressivity conveyed in many denominal and de-adjectival -ery derivatives. It is perhaps for this reason that the suffix is most typical of the press.
NonPop Other Acad new types 0 1 2 0 1 1 0 Table 65 Totals of new word types in denominal/de-adjectival -ery across registers Spoken
Fict
News
Acad
A register-sensitive study of English nominalizations NonPop Acad new types 0 1 1 0 0 2 Table 66 Totals of new word types in deverbal -ery across registers Spoken
Fict
News
Acad
193
Other 2
The suffix -hood (token-type ratio: 1.1) Academic prose has the highest counts of types, tokens and new types of -hood nominalizations (see the charts below). Of the 73 types identified in the BNC, 21 are innovative. This high proportion of new types is parallel to that of the suffix -dom (see above). The two suffixes are thus semantically similar, within the same range of number of word types, as well as equally productive. spoken 16, fict 31, news 26, acad 38, non-acad 35, pop 32 40 Spoken 30 20 10
Fict News Acad Non-Acad Pop
0
Figure 42 Raw type count of -hood across registers
spoken 30, fict 73, news 61, acad 134, non-acad 98, pop 63 140 120
Spoken
100
Fict
80
News
60
Acad
40
Non-Acad
20
Pop
0
Figure 43 Normalized token frequencies of -hood across registers
194
Chapter 4
NonAcad new types 1 2 1 7 4 Table 67 Totals of new word types in -hood across registers Spoken
Fict
News
Acad
Pop
Other
3
7
The suffix -ship (token-type ratio: 1.6) We have observed in 4.2 that tokens in -ship are most numerous in newspapers. Nevertheless our type counts indicate yet another pattern of distribution, typical of joint frequencies of nominalizations overall, in which the academic and nonacademic texts are customarily the most prominent. Here, the non-academic texts score the highest total (Figures 46 and 24). spoken 49, fict 67, news 73, acad 96, non-acad 109, pop 69 120 100 80 60 40 20
Spoken Fict News Acad Non-Acad Pop
0
Figure 44 Raw type count of -ship across registers spoken 134, fict 89, news 495, acad 376, non-acad 479, pop 306 500 400
Spoken Fict
300
News
200
Acad Non-Acad
100
Pop
0
Figure 24 Normalized token frequencies of -ship across registers
Total word types in -ship and the suffix’s innovations are consistently most numerous in non-academic texts (6 out of the total 18; see Table 68). It is interest-
A register-sensitive study of English nominalizations
195
ing to note that 9 out of the 18 nominalizations are derived from complex nouns ending in -man (blinkmanship, teamsmanship), which may suggest an area of vocabulary that is particularly likely to undergo -ship nominalization. NonAcad new types 1 2 3 2 6 Table 68 Totals of new word types in -ship across registers Spoken
Fict
News
Acad
Pop
Other
3
1
The suffixes -age (token-type ratio: 4.4) and -al (token-type ratio: 8.2) Nominalizations in -age and -al are merely mentioned here by virtue of their complementing the broad class of Nomina Actionis. They are however not further discussed for two reasons. Firstly, their type and token distributions are predictable from the general pattern noted for all the nominalizing suffixes considered jointly. Secondly, neither seems productive at all as no innovative derivatives have been found in the BNC. In summary, in this section we have seen that morphological productivity too is subject to register variation. In short, innovative formations are highly differentiated according to the variety of language in which they are more likely to appear. Furthermore, certain affix pairings are far more likely to lend themselves to word coinage than others. Specifically, with particular regard to new word types, we have noted the following points: • The decreasing numbers of new formations across the registers are represented in the sequence: academic prose (181 innovations) > non-academic prose (103 innovations) > popular magazines (83 innovations) > fiction (59 innovations) > newspapers (39 innovations) > spoken language (28 innovations). • The productivity of -ness decreases along the scale: -ed+ness (42 new types) > -y+ness (39) > root+ness (21) > -less+ness (14) > -ish+ness (10) > -ive+ness (6) > -ing+ness (5) > -ful+ness (4) > -ous+ness (2) (168 new word types altogether; including other disregarded patterns and items found outside the six registers studied here). • The decreasing potential of -ness to coin new words across the registers is represented by the sequence: fiction (41 new types) > popular magazines (35) > academic prose (31) > non-academic prose (28) > newspapers (14) > spoken language (9). • The productivity of -ity decreases along the scale: -able+ity (65) > -al+ity (18) > -ive+ity (13) > -ic+ity (12) > root+ity (7) > -ous+ity (4) > -ile+ity (0) (129 new word types altogether). • The decreasing potential of -ity to coin new words across the registers is repre-
196
Chapter 4
sented by the sequence: academic prose (50) > non-academic prose (21) > popular magazines (18) > fiction (11) > newspapers (6) > spoken language (5). • The productivity of -ion decreases along the scale: -ize+ation (83) > -ate+ion (40) > root+ation (10) > -ify+cation (9) > root+(it)ion (4) (146 new word types altogether). • The decreasing potential of -ion to coin new words across the registers is represented by the sequence: academic prose (78) > non-academic prose (34) > newspapers (7) > spoken language / popular magazines (5) > fiction (2). Compared to the three suffixes above, the remaining ones are marginally productive. With respect to each individual formative we have noted the following: • The 10 new types in -ment we have identified do not seem to point to any clear structural or register-related preferences of the suffix. • The suffix -(c)y is virtually unproductive with nominal base stems, and the 8 de-adjectival innovations that we have identified clearly lean towards academic prose. • The suffix -ance/-ence yields 14 new types (notably in the spoken and academic varieties). • A third of the total number of word types in -dom and -hood are innovative items (21 out of 62 and 21 out of 73 respectively). Innovations in -dom lean strongly towards popular magazines (10 out of 21) whereas those in -hood are most numerous in academic prose (7 out of 21). • The suffix -ery yields 11 new types. Both new and established types exhibit a strong tendency for newspapers, popular magazines and fiction. They are uncommon in academic prose; many are stylistically expressive. • The suffix -ship yields 18 new types, 9 of which are derived from complex nouns ending in -man. New types in -ship as well as those established are most common in non-academic prose. • The suffixes -age and -al yield no new types. As the contents of this chapter are by far the most crucial to this dissertation, the fundamental conclusions drawn here are recapitulated below in Conclusions.
Conclusions While it is evident that productive word formation rules explain the majority of novel complex words at the level of competence, other explanations must be sought at the level of performance [...] [D]ifferent types of lexical creations emerge in different genres, determined by context and motivated by a variety of phonological, morphological, semantic, pragmatic and stylistic factors. (Munat 2007: 163, 181)
Following the assumption expressed in the above quotation by Munat (2007), in this study we have looked at English nominalizations in order to investigate their structure and productivity viewed from the perspective of their distribution across genres. As predicted by Munat, we have observed that purely structural factors interact with considerations of context (in this case register) in differentiating English nominalizations across language varieties. Turning back to the research questions posed in the Introduction and, in more detail, in section 4.1, we note the following findings. The BNC data have produced results very similar to those reported by other authors when it comes to estimated total frequencies of all nominalizations per register, and the increasing number of nominalizations is observable in the sequence: fiction < spoken < pop < news < non-academic < academic (see 4.1 and 4.2). An important exception to this generalization is that, as we noted in Figure 4.27, in terms of word type count, fictional texts score much higher in the scale (in third place after academic and non-academic texts), thus indicating lexical richness in this genre. Beyond collective accounts of nominalizations, individual suffixes have received limited treatment in the study of register variation, and none whatsoever as far as base-internal complexity is concerned. To rectify this situation, we have offered a more in-depth analysis of each suffix and each register. More importantly, we have then adopted a more detailed qualitative approach by considering the types of affix combinations that may have a bearing on the distributional preferences of the rightmost suffix. The morphological structure of base forms has indeed proved a significant distributional factor (see 4.3.2 for illustration). We have concluded that claims about an affix’s distribution must necessarily be
198
Conclusions
revised to accommodate finer distinctions concerning the combinations of that affix with distinct types of base forms. With respect to productivity and innovative formations we have also noted patterns and tendencies on the part of the suffixes that are traceable to register variation. In short, innovative formations are highly differentiated according to the variety of language in which they are more likely to appear (e.g. innovations in -ness are most common in fiction, but those in -ity are most frequent in academic texts). This in turn is further conditioned by the type of suffix combination (or simplex base form) involved. We have noted the following points (reproduced from section 4.4): • The decreasing numbers of new formations across the registers are represented in the sequence: academic prose (181 innovations) > non-academic prose (103 innovations) > popular magazines (83 innovations) > fiction (59 innovations) > newspapers (39 innovations) > spoken language (28 innovations). • The productivity of -ness decreases along the scale: -ed+ness (42 new types) > -y+ness (39) > root+ness (21) > -less+ness (14) > -ish+ness (10) > -ive+ness (6) > -ing+ness (5) > -ful+ness (4) > -ous+ness (2) (168 new word types altogether; including other disregarded patterns and items found outside the six registers studied here). • The decreasing potential of -ness to coin new words across the registers is represented by the sequence: fiction (41 new types) > popular magazines (35) > academic prose (31) > non-academic prose (28) > newspapers (14) > spoken language (9). • The productivity of -ity decreases along the scale: -able+ity (65) > -al+ity (18) > -ive+ity (13) > -ic+ity (12) > root+ity (7) > -ous+ity (4) > -ile+ity (0) (129 new word types altogether). • The decreasing potential of -ity to coin new words across the registers is represented by the sequence: academic prose (50) > non-academic prose (21) > popular magazines (18) > fiction (11) > newspapers (6) > spoken language (5). • The productivity of -ion decreases along the scale: -ize+ation (83) > -ate+ion (40) > root+ation (10) > -ify+cation (9) > root+(it)ion (4) (146 new word types altogether). • The decreasing potential of -ion to coin new words across the registers is represented by the sequence: academic prose (78) > non-academic prose (34) > newspapers (7) > spoken language / popular magazines (5) > fiction (2). Compared to the three suffixes above, the remaining ones are marginally productive. With respect to each individual formative we have noted the following:
Conclusions
199
• The 10 new types in -ment we have identified do not seem to point to any clear structural or register-related preferences of the suffix. • The suffix -(c)y is virtually unproductive with nominal base stems, and the 8 de-adjectival innovations that we have identified clearly lean towards academic prose. • The suffix -ance/-ence yields 14 new types (notably in the spoken and academic varieties). • A third of the total number of word types in -dom and -hood are innovative items (21 out of 62 and 21 out of 73 respectively). Innovations in -dom lean strongly towards popular magazines (10 out of 21) whereas those in -hood are most numerous in academic prose (7 out of 21). • The suffix -ery yields 11 new types. Both new and established types exhibit a strong tendency for newspapers, popular magazines and fiction. They are uncommon in academic prose; many are stylistically expressive. • The suffix -ship yields 18 new types, 9 of which are derived from complex nouns ending in -man. New types in -ship as well as those established are most common in non-academic prose. • The suffixes -age and -al yield no new types. In all, three different measures have been used in this work in order to discuss word frequency distributions – token, type, and new type counts (see Chapter 4). The first was used to calculate frequency of occurrence, which is commonly considered in studies of register variation, the second was used to estimate the vocabulary size generated by a given affix, and the third was employed with the specific aim to establish morphological productivity in its most basic sense, i.e. the extent to which a suffix is used in the coinage of new words. At all these three levels of analysis, most of the nominalizing suffixes studied in this thesis exhibited patterns of variation attributable to register-related and structural preferences. Some of the patterns were strong and evident enough to persist at all these levels; 1 some were level-specific 2 and thus especially worthy of attention in individual areas of study (register variation, language change, neology, morphological productivity, affix ordering, etc.). Interestingly, with reference to structural preferences, we have identified patterns of apparently unexpected
1
For example, the general preference of the suffix -ion is to appear in academic prose, in token, type and new type counts. In the same vein, the suffix -ness leans towards fiction. The suffix -ery, on the other hand, is typical of the press (especially de-adjectival) and uncommon in academic texts. 2 For example, the suffix -ment is the second most frequent in token counts across the registers, but has a relatively low type count.
200
Conclusions
ditribution at one of the levels 3 and still others that persisted across the three levels. 4 Although nominalizations in general have been studied before with reference to their frequency of occurrence across language varieties, little has been said, in that context, about individual noun-forming affixes and virtually nothing about affix pairings. Indeed, the effect of any kind of base-internal morphological structure on register variation has been overlooked. Hopefully, the present dissertation fills this gap with an initial step into the subject area. We have concluded that any claims about an affix’s frequency, distribution and degree of productivity need to accommodate finer distinctions concerning combinations of that affix with distinct types of base forms. Such investigations are likely to disclose further facts about morphological variation across registers.
3
For example, -able+ity nominalizations in speech are almost twice as frequent as they are in fiction although fiction scores more -ity tokens overall. Secondly, although newspapers have more tokens in -ion than popular magazines, this is only thanks to -ate+ion derivatives. All other suffix combinations with -ion have higher frequencies in popular magazines. 4 For example, the string -ive+ness is most typical of academic texts, in token, type and new type counts although the suffix -ness is most frequent, again on all three counts, in fiction.
Appendix: Lexical innovations in the BNC
The following lists are novel derivatives of the Nomina Actionis and Nomina Qualitatis types as found in the BNC and cross-referenced for their absence from the OED. The order of the suffixes in which they appear reflects the decreasing number of innovative formations produced. The suffixes -ness, -ity, -ion, -ment and -cy include separate sections listing their respective base forms.
1. -ness derivatives (168 types) simplex root+ness gaganess avidness creoleness routineness zombieness fakeness bronzeness
eliteness eerieness hungness snagginess whackiness feistiness yumminess
campness basqueness dutchness germanness lairiness bolshieness crapness
pressiness lardiness hippiness iffiness chewiness boxiness ashiness boominess bosominess blotchiness bobbliness looniness wrinkliness
fudginess fugginess grapiness fiestiness crabbiness creakiness cleansiness clubbiness yumminess zinginess scariness clumpiness snappiness
stressfulness
threatfulness
simplex root+y+ness creepiness curviness fizziness goriness battiness pointiness tartiness snagginess tickliness trebliness spindliness plebbiness scratchiness
-ful+ness colourfulness skillfulness
202
Appendix
-ish+ness prattishness quirkishness warderishness ampishness
elvishness groupishness flemishness
clubbishness cornishness swedishness
fiery-headedness expectedness extrovertedness emptyheadedness enclosedness mild-manneredness loosemindedness many-partedness knittedness knuckleheadedness lightfootedness slow-wittedness thick-skulledness flawedness
quick-mindedness rewardedness self-employedness self-enclosedness serious-mindedness understatedness sustainedness twistedness well-craftedness quantitative-mindedness grey-mindedness group-mindedness good-humouredness quick-footedness
qualitativeness exploitativeness
declarativeness pre-emptiveness
rudderlessness caringlessness authorlessness depthlessness skinlessness
hall-lessness egglessness tiplessness strokelessness
reassuringness staggeringness
throbbingness
-ed+ness woolly-headedness well-definedness swollen-headedness brainedness constructedness community-mindedness double-voicedness dumb-mindedness dumbfoundedness disembodiedness datedness activatedness calculatedness blue-bloodedness
-ive+ness elitiveness connectiveness
-less+ness seamlessness knoblessness vigourlessness womanlessness talentlessness
-ing+ness wittingness never-endingness
-ous+ness curvaceousness
impecuniousness
Appendix
-ness other fourfoldness offhandness part-timeness creepy-crawliness soft-spokenness well-spokenness chineseness kafkaesqueness westwardness
down-to-earthness over-the-topness statesmanliness policemanliness grandmotherliness incendiariness mandatoriness digitalness
mentalness mexicanness italianness serbianness caribbeanness drivenness writtenness likableness
2. -ion derivatives (146 types) root+ation departmentation preventation occidentation experientation
forestation impactation ingestation
whistlation vegation spoilation
tubeligation relitigation fifferentiation sediation protentiation carbamylation gemmulation gammascintillation vasuolation ethylation canullation aracylation arculation
phosgenation boronation amination trypsination sulfonation silanation electroporation privatation physophorylation poladenylation mannosylation platination scalation
tertiarization stylolitization palatization inferiorization hysterization judicialization
notorization ontologization microtization metisization politization profitization
-ate+ion fundoplication dessication inducation infarcation carbocation acidication eratication entrophication gasication quantication encapsidation glucuronidation tracheation orpoagation
-ize+ation autonomization factionalization residualization analogization narratization textualization
203
204
Appendix
syncretization colocalization collateralization automization anteriorization cartelization capillarization historicization hydridization gallicization kigerianization lebanization informatization pedagogization victorianisation vacualisation vacularisation eroticisation encapsulisation ednaisation conglomeratisation conisation
conscientization rasterization aestheticization constitutionalization editorialization dialogization grammatization lumpenization lupinization vesicularization vapourization weticization swahiliization sociologization costallisation cretinisation ironisation koreanisation informationisation historisation holbornisation planetisation
contractorization credentialization sectoralisation podsolisation rectangularisation activisation los angelesisation barnardisation calcitisation hostelisation lebanonisation columnarisation dipthongisation epithelialisation cancerisation archivisation aridisation accreditisation productionisation productisation spectacularisation
sandification chelseafication chlorification
urification eradification hollywoodification
-ify+cation extensification boogification factification
root-(it)ion temption accredition
spendition distention
3. -ity derivatives (129 types) root+ity fraility skimmity ursinity
ovinity simularity
periodity dispersity
conditionability configurability
eraseability enjoyability
-able+ity deniability wickability
Appendix imageability favourability deliverability countability selectability rinseability severability sailability sellability rinsability licensability maximisability influenceability indictability interruptibility cleanability distinguishability adjustability generalisability routability
dialability caterbility bangability admissability explainability generatability expendability excludability tellability trappability hummability gropeability germinability coolability permedability normalisability parkability shaggability shapability slurpability
dubitability drapeability drillability fudgeability explicability fanciability feelability fillability allowability appreciability assertibility avoidability attributability challengeability buyability callability developability crashability compromisability
radiality hierarchicality annuality versality beneficiality clonality
controversiality coventionality focality objectuality subliminality consensuality
-al+ity genitality metricality anality figurality cyclicality contextuality
-ous+ity fibrosity coterminosity
polysemity heaviosity
-ic+ity lithogenicity echogenicity nomicity thrombogenicity
technicity poeticity lipophilicity homogenicity
crypticity urbanicity homotheticity staticity
resultativity tentativity
permissivity exhaustivity
-ive+ity recessivity aggressivity
205
206
Appendix
competitivity transmittivity prospectivity
generativity adaptivity
intuitivity intensivity
mediterraneity eponymity effemininity
specularity subversity crepuscularity
-ity other automacity exogeneity diachroneity instaneity
4. -hood derivatives (21 types) martyrhood grandparenthood servanthood riderhood disneyhood entity-hood marinehood
hackerhood rainhood pet-hood placehood urchinhood weatherhood limbhood
siblinghood dominion-hood grand-parenthood godparenthood faggothood chaphood familihood
5. -dom derivatives (21 types) adventuredom indiedom wrinklydom baggydom gothdom jazzdom labeldom
liberaldom magyardom hagdom hackdom frockdom celebritydom faggotdom
computerdom superpowerdom wimpdom orcdom slobdom serbdom scruffdom
6. -ship derivatives (18 types) caretakership nomineeship highwaymanship judgemanship audienceship backwoodsmanship
craftspersonship conferencemanship contractorship dreamership blinkmanship yesmanship
7. -ment derivatives (10 types) root-ment regardment dispersement
configurement gruntlement
slave-ship statemanship postmanship settlorship scoutmastership teamsmanship
Appendix
en-root-ment enrapturement enserfment
ensheathment emparkment
emboxment
-ment other exsheathment
8. Deverbal -ence/-ance derivatives (9 types) exceedence exitance emittance
recognisance reverberance improvance
awardance attainance actuance
surfactancy
confluency
corporacy
deliberacy
9. -cy derivatives (8 types) -ant+cy reflectancy
-ate+cy appropriacy
-cy other generacy
ethicacy
10. Deverbal -ery derivatives (6 types) advisery mobbery
debunkery handcuffery
whammery snoggery
11. Denominal/de-adjectival -ery derivatives (5 types) macabrery show-bizzery
skin-flintery chestnutery
weirdery
12. De-adjectival -ence/-ance derivatives (5 types) emollience afference
amorance ductilance
itinerance
207
References
Aarts, Jan (1991) “Intuition-based and observation-based grammars.” In: K. Aijmer and B. Altenberg (eds.), English Corpus Linguistics: Studies in Honour of Jan Svartvik. London: Longman, 44-62. Aarts, Bas (2000) “Corpus linguistics, Chomsky and fuzzy tree fragments.” In: C. Mair and M. Hundt (eds.), Corpus Linguistics and Linguistic Theory. Papers from the Twentieth International Conference on English Language Research on Computerized Corpora (ICAME 20), Freiburg im Breisgau 1999. Amsterdam: Rodopi, 5-13. Adams, Valerie (1973) An Introduction to Modern English Word-Formation. London: Longman. Adams, Valerie (2001) Complex Words in English. Pearson Education Limited. Aitchison, Jean (1994) Words in the Mind. (2nd edition) Oxford: Blackwell. Allen, Margaret (1978) Morphological Investigations. PhD thesis, University of Connecticut. Algeo, John (1980) “Where do all the new words come from?” American Speech 55, 264-277. Algeo, John (1993) “Desuetude among new English words.” International Journal of Lexicography 6:4, 281-293. Anderson, Mona (1979) Noun Phrase Structure. PhD dissertation. Storrs, CT: University of Connecticut. Anttila, Arto (2002) “Variation and phonological theory.” In: J. K. Chambers, P. Trudgill and N. Schilling-Estes (eds.), 206-243. Aronoff, Mark (1976) Word Formation in Generative Grammar. Cambridge, MA: MIT Press. Aston, Guy (1995) “Say ‘thank you’: some pragmatic constraints in conversational closings.” Applied Linguistics 16:1, 57-86. Aston, Guy and Lou Burnard (1998) The BNC Handbook: Exploring the British National Corpus with SARA. Edinburgh Textbooks in Empirical Linguistics. Edinburgh: Edinburgh University Press. Baayen, Harald (1992) “Quantitative aspects of morphological productivity.” In: G. Booij and J. van Marle (eds.), Yearbook of Morphology 1991. Dordrecht: Kluwer, 109-149. Baayen, Harald (2001) Word Frequency Distributions. Dordrecht: Kluwer Academic Publishers.
210
References
Baayen, Harald and Rochelle Lieber (1991) “Productivity and English derivation: a corpus-based study.” Linguistics 29:4, 801-844. Baayen, Harald and Antoinette Renouf (1996) “Chronicling The Times: productive lexical innovations in an English newspaper.” Language 72, 69-96. Bayley, Robert (2002) “The quantitative paradigm.” In: J. K. Chambers, P. Trudgill and N. Schilling-Estes (eds.), 117-141. Baker, Paul, Andrew Hardie and Anthony McEnery (2006) A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press. Baldi, Philip and Chantal Dawar (2000) “Creative processes.” In: G. E. Booij, C. Lehmann, and J. Mugdan (eds., in collaboration with W. Kesselheim and S. Skopeteas), 963-972. Bauer, Laurie (1983) English Word-Formation. Cambridge: Cambridge University Press. Bauer, Laurie (1988) Introducing Linguistic Morphology. Edinburgh: Edinburgh University Press. Bauer, Laurie (1994) Watching English Change. An Introduction to the Study of Linguistic Change in Standard Englishes in the Twentieth Century. London and New York: Longman. Bauer, Laurie (2000) “System vs. norm: coinage and institutionalization.” In: G. E. Booij, C. Lehmann and J. Mugdan (eds., in collaboration with W. Kesselheim and S. Skopeteas), 832-840. Bauer, Laurie (2001) Morphological Productivity. Cambridge: Cambridge University Press. Bauer, Laurie (2002) “Inferring variation and change from public corpora”. In: J. K. Chambers, P. Trudgill and N. Schilling-Estes (eds.), 97-114. Bauer, Laurie (2004) A Glossary of Morphology. Edinburgh: Edinburgh University Press. Bauer, Laurie (2005) “Productivity: theories.” In: P. Štekauer and R. Lieber (eds.), 315-334. Bauer, Laurie and Antoinette Renouf (2001) “A corpus-based study of compounding in English.” Journal of English Linguistics 29:2, 101-123. Biber, Douglas (1988) Variation across Speech and Writing. Cambridge: Cambridge University Press. Biber, Douglas (1990) “Methodological issues regarding corpus-based analyses of linguistic variation.” Literary and Linguistic Computing 5, 257-69. Biber, Douglas (1993) “Representativeness in corpus design.” Literary and Linguistic Computing 8, 241-57. Biber, Douglas (1994) “An analytical framework for register studies.” In: D. Biber and E. Finegan (eds.), 31-56. Biber, Douglas (1995) Dimensions of Register Variation: A Cross-Linguistic Comparison. Cambridge: Cambridge University Press.
References
211
Biber, Douglas (2004) “Conversation text types: a multi-dimensional analysis.” 7es Journées Internationales d'Analyse Statistique des Données Textuelles (JADT). Biber, Douglas and Edward Finegan (eds.) (1994) Sociolinguistic Perspectives on Register. New York and Oxford: Oxford University Press. Biber, Douglas and Edward Finegan (1994) “Situating register in sociolinguistics.” In: D. Biber and E. Finegan (eds.), 3-14. Biber, Douglas, Susan Conrad and Randi Reppen (1998) Corpus Linguistics: Investigating Language Structure and Use. Cambridge: Cambridge University Press. Biber Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad, Edward Finegan (1999) Longman Grammar of Spoken and Written English. Pearson Education Limited. Biber, Douglas and Susan Conrad (2001) “Register variation: a corpus approach.” In: D. Schiffrin, D. Tannen and H. Hamilton (eds.), The Handbook of Discourse Analysis, Oxford: Blackwell Publishing, 175-196. Bongers, Herman (1947) The History and Principles of Vocabulary Control. Worden: Wocopi. Booij, Geert (2005) The Grammar of Words. Oxford: Oxford University Press. Booij, Geert E., Lehmann, C., and Mugdan, J. (eds., in collaboration with Kesselheim, W. and S. Skopeteas) (2000) Morphologie / Morphology. Ein internationales Handbuch zur Flexion und Wortbildung / An International Handbook on Inflection and Word Formation. 1. Halbband / Volume 1. Berlin/New York: Mouton de Gruyter. Brekle, Herbert E. (1978) “Reflections on the conditions for the coining, use and understanding of nominal compounds.” In: W. Dressler and W. Meid (eds.), Proceedings of the Twelfth International Congress of Linguists, Vienna, August 28 – September 2, 1977. Innsbruck, 68-77. Burnard, Lou (2000) Reference Guide for the British National Corpus (World Edition). Oxford Universtity Computing Services. Accessible from: http://www.natcorp.ox.ac.uk/docs/ userManual/urg.pdf Carstairs-McCarthy, Andrew (2002) An Introduction to English Morphology. Edinburgh: Edinburgh University Press. Cedergren, Henrieta and David Sankoff (1974) “Variable rules: performance as a statistical reflection of competence.” Language 50, 333-355. Chafe, Wallace L. and Jane Danielewicz (1987) “Properties of spoken and written language.” In: R. Horowitz and S. J. Samuels (eds.), Comprehending Oral and Written Language. New York: Academic Press. Chambers, John K. (1995) Sociolinguistic Theory. Oxford: Blackwell. Chambers, John K. (2002) “Studying language variation: an informal epistemology.” In: J. K. Chambers, P. Trudgill and N. Schilling-Estes (eds.), 3-15.
212
References
Chambers, John K., Peter Trudgill and Natalie Schilling-Estes (eds.) (2002) The Handbook of Language Variation and Change. Oxford: Blackwell Publishing. Cheshire, Jenny (2002) “Sex and gender in variationist research.” In: J. K. Chambers, P. Trudgill and N. Schilling-Estes (eds.), 423-443. Chomsky, Noam (1965) Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. Chomsky, Noam (1970) “Remarks on Nominalizations.” In: R. Jacobs and P. Rosenbaum (eds.), Readings in English Transformational Grammar. Waltham, Mass.: Ginn and Company, 184-221. Chomsky, Noam (1984) Modular Approaches to the Study of the Mind. San Diego: San Diego University Press. Chomsky, Noam (1995) The Minimalist Program. Cambridge, MA: MIT Press. Chomsky, Noam and Morris Halle (1968) The Sound Pattern of English. New York: Harper and Row. Clark, Eve and Herbert Clark (1979) “When nouns surface as verbs.” Language 55, 767-811. Coseriu, Eugenio (1967) Teoría del lenguaje y lingüistica general: Cinco estudios, 2nd ed., Madrid: Gredos. Coseriu, Eugenio (1975) “System, Norm und Rede“. In Eugenio Coseriu, Sprachteorie und allgemine Sprachwissenschaft. Munchen: Wilhelm Fink, 11-101. Cowie, Claire (2000) “The discourse motivations for neologising: action nominalization in the history of English.” In: J. Coleman and C. Kay (eds.), Lexicology, Semantics and Lexicography: Selected Papers from the Fourth G. L. Brook Symposium. Amsterdam: John Benjamins, 179-208. Cowie, Claire (2006) “Economical with the truth: register categories and the function of -wise viewpoint adverbs in the British National Corpus.” ICAME Journal 30, 5-36. Crystal, David (1991) Dictionary of Linguistics and Phonetics. Oxford: Blackwell Publishing. Crystal, David (2000) “Investigating nonceness: lexical innovation and lexicographic coverage.” In: R. Boenig and K. Davis (eds.), Manuscript, Narrative and Lexicon: Essays on Literary and Cultural Transmission in Honor of Whitney F. Bolton. Lewisburg: Bucknell University Press; London: Associated University Presses, 218-231. Crystal, David (2006) Words Words Words. Oxford: Oxford University Press. Downing, Pamela (1977) “On the creation and use of English compound nouns.” Language 53, 810-842.
References
213
Dressler, Wolfgang and Lavinia Merlini Barbaresi (1994) Morphopragmatics: Diminutives and Intensifiers in Italian, German, and other Languages. Berlin: Mouton de Gruyter. Dressler, Wolfgang (1981) “General principles of poetic license in word formation.” In: H. Weydt (ed.), Logos Semantikos, vol. II. Berlin: De Gruyter, 423-431. Dura, Elżbieta (2006) “Extracting current language use from the web.” Poznań Studies in Contemporary Linguistics 41, 73-85. Fabb, Nigel (1988) “English suffixation is constrained only by selectional restrictions.” Natural Language and Linguistic Theory 6, 527-539. Fischer, John L. (1958) “Social influences on the choice of a linguistic variant.” Word 14, 47-56. Fischer, Roswitha (1998) Lexical Change in Present-Day English: A CorpusBased Study of Motivation, Institutionalization, and Productivity of Creative Neologisms. Tübingen: Gunter Narr. Francis, W. Nelson (1992) “Language corpora B.C.” In J. Svartvik (ed.), Directions in Corpus Linguistics. Berlin: Mouton de Gruyter, 17-32. Fries, Charles (1952) The structure of English: An Introduction to the Construction of English Sentences. New York: Harcourt, Brace and World. Fries, Charles and Aileen Traver (1940) English Word Lists: a Study of their Adaptability and Instruction. Washington, DC: American Council of Education. Giegerich, Heinz J. (1999) Lexical Strata in English: Morphological Causes, Phonological Effects. Cambridge: Cambridge University Press. Guy, Gregory R. (1991) “Explanation in variable phonology.” Language Variation and Change 3, 1-22. Guy, Gregory (2007) “Variation and phonological theory.” In: R. Bayley and C. Lucas (eds.), Sociolinguistic Variation. Theories, Methods and Application. Cambridge: Cambridge University Press, 5-23. Haegeman, Liliane (1987) “Register variation in English: some theoretical observations.” Journal of English Linguistics 20:2, 230-48. Harris, Randy A. (1993) The Linguistics Wars. Oxford: Oxford University Press. Hartmann, Reinhard R.K. and Gregory James (1998) Dictionary of Lexicography. London: Routledge. Hay, Jennifer (2002) “From speech perception to morphology: affix-ordering revisited.” Language 78:3, 527-555. Hay, Jennifer (2003) Causes and Consequences of Word Structure. New York and London: Routledge. Hay, Jennifer and Harald Baayen (2003) “Phonotactics, parsing and productivity.” Italian Journal of Linguistics 1, 99-130.
214
References
Hay, Jennifer and Ingo Plag (2004) “What constrains possible suffix combinations? On the interaction of grammatical and processing restrictions in derivational morphology.” Natural Language and Linguistic Theory 22, 565-596. Hazen, Kirk (2007) “The study of variation in historical perspective.” In: R. Bayley and C. Lucas (eds.), Sociolinguistic Variation: Theories, Methods and Application. Cambridge: Cambridge University Press, 70-89. Henry, Alison (2002) “Variation and syntactic theory”. In: J. K. Chambers, P. Trudgill and N. Schilling-Estes (eds.), 267-282. Hohenhaus, Peter (1996) Ad-hoc-Wortbildung: Terminologie, Typologie und Theorie kreativer Wortbildung im Englischen. Frankfurt/M.: Peter Lang. Hohenhaus, Peter (1998) “Non-lexicalizability as a characteristic feature of nonce word-formation in English and German.” Lexicology 4:2, 237-280. Hohenhaus, Peter (2005) “Lexicalisation and institutionalisation.” In: P. Štekauer and R. Lieber (eds.), 353-373. Hohenhaus, Peter (2007) “How to do (even more) things with nonce words (other than naming).” In: J. Munat (ed.), 15-38. Holmes, Janet (1992) An Introduction to Sociolinguistics. London and New York: Longman. Huddleston, Rodney and Geoffrey K. Pullum (2002) The Cambridge Grammar of the English Language. Cambridge: Cambridge University Press. Hudson, Richard A. (1996) Sociolinguistics (2nd edition). Cambridge: Cambridge University Press. Hunston, Susan (2002) Corpora in Applied Linguistics. Cambridge: Cambridge University Press. Isaacson, David (1997) “New word sources.” Reference Services Review 25:2, 53-64. Jespersen, Otto (1909-1949) A Modern English Grammar on Historical Principles. Copenhagen: Munksgaard. Johansson, Stig (2004) “Corpus linguistics – past, present, future”. In: J. Nakamura, N. Inoue, and T. Tabata (eds.), English Corpora Under Japanese Eyes. Amsterdam & New York: Rodopi, 3-24. Kastovsky, Dieter (1978) “Zum gegenwartigen Stand der Wortbildungslehre des Englischen.” Linguistik und Didaktik 36, 351-366. Kastovsky, Dieter (1986) “The problem of productivity in word formation.” Linguistics 24, 585-600. Käding, Friedrich W. (1879) Häufigkeitswörterbuch der deutschen Sprache. Steglitz: privately published. Kennedy, Graeme D. (1998) An Introduction to Corpus Linguistics. London: Longman.
References
215
Kiparsky, Paul (1982) “Lexical Morphology and Phonology.” In: The Linguistic Society of Korea (ed.), Linguistics in the Morning Calm. Seoul: Hanshin Publishing Co., 1-91. Labov, William (1963) “The social motivation of a sound change.” Word 19, 273-309. Labov, William (1966) The Social Stratification of English in New York City. Washington, DC: Center for Applied Linguistics. Labov, William (1969) “Contraction, deletion, and inherent variability of the English copula.” Language 45, 715-762. Labov, William (1997) “Resyllabification.” In: F. Hinskens, R. Van Hout and L. Wetzels (eds.), Variation, Change and Phonological Theory, Amsterdam, Philadelphia: John Benjamins Publishing Company, 145-179. Lee, David Y.W. (2001) “Genres, registers, text types, domains and styles: clarifying the concepts and navigating a path through the BNC jungle.” Language Learning & Technology 5:3, 37-72. Leech, Geoffrey (1991) “The state of the art in corpus linguistics”. In K. Aijmer and B. Altenberg (eds.), English Corpus Linguistics: Linguistic Studies in Honour of Jan Svartvik, London: Longman, 8-29. Lees, Robert (1960) The Grammar of English Nominalizations. The Hague: Mouton de Gruyter. Lehrer, Adrienne (1996a) “Identifying and interpreting blends: an experimental approach.” Cognitive Linguistics 7:4, 359-390. Lehrer, Adrienne (1996b) “Why neologisms are important to study.” Lexicology vol. 2/1, 63-73. Lehrer, Adrienne (1998) “Scapes, holics, and thons: the semantics of English combining forms.” American Speech 73:1, 3-28. Lehrer, Adrienne (2003) “Understanding trendy neologisms.” Italian Journal of Linguistics/Rivista di Linguistica 15:2, 284-300. Lehrer, Adrienne (2007) “Blendalicious.” In: J. Munat (ed.), 115-136. Lieber, Rochelle (1981) On the Organization of the Lexicon. Outstanding Dissertations in Linguistics, Garland Publishing, Inc. Lieber, Rochelle (1992) Deconstructing Morphology. Chicago and London: University Of Chicago Press. Lieber, Rochelle (2005) “English word-formation processes.” In: P. Štekauer and R. Lieber (eds.), 429-448. Lipka, Leonhard (2002) English Lexicology: Lexical Structure, Word Semantics and Word-Formation. Tübingen: Gunter Narr. Lipka, Leonhard, Susanne Handl and Wolfgang Falkner (2004) “Lexicalization & institutionalization. The state of the art in 2004.” In: SKASE Journal of Theoretical Linguistics 1:2004, 2-19.
216
References
Mahlberg, Michaela (2005) English General Nouns: a Corpus Theoretical Approach. Studies in Corpus Linguistics 20. Amsterdam: John Benjamins. Malicka-Kleparska, Anna (1988) Rules and Lexicalisations: Selected English Nominals. Lublin: Redakcja Wydawnictw KUL. Marchand, Hans (1969) The Categories and Types of Present-Day English Word-Formation. 2nd revised edition. München: C. H. Beck. Marle, Jaap van (1985) On the Paradigmatic Dimension of Morphological Creativity. Dordrech: Foris Publications. Marle, Jaap van (1990) “Rule-creating creativity: analogy as a synchronic morphological process.” In: Contemporary Morphology, W. Dressler, H. C. Luschutzky, E. Pfeiffer and J.R. Rennison (eds.), Berlin: Mouton de Gruyter, 267-282. Marle, Jaap van (1992) “The relationship between morphological productivity and frequency: a comment on Baayen’s performance oriented conception of morphological productivity.” In G. Booij and J. Van Marle (eds.), Yearbook of Morphology 1991. Dordrecht: Kluwer, 151-163. McArthur, Tom (1992) The Oxford Companion to the English Language. Oxford: Oxford University Press. McDavid, Raven I., Jr. (1948) “Postvocalic /r/ in South Carolina: a social analysis.” American Speech 23, 194-203. McEnery, Tony & Andrew Wilson (2001) Corpus Linguistics (2nd edition). Edinburgh Textbooks in Empirical Linguistics. Edinburgh: Edinburgh University Press. Meyer, Charles F. (2004) English Corpus Linguistics: An Introduction. Cambridge: Cambridge University Press. Milroy, Lesley and Matthew Gordon (2003) Sociolinguistics: Method and Interpretation. Oxford: Blackwell Publishing. Montgomery, Michael (2007) “Variation and historical linguistics.” In: R. Bayley and C. Lucas (eds.), Sociolinguistic Variation: Theories, Methods and Application. Cambridge: Cambridge University Press, 110-132. Munat, Judith (2007) “Lexical creativity as a marker of style in science fiction and children’s literature.” In: J. Munat (ed.). 163-185. Munat, Judith (ed.) (2007) Lexical Creativity, Texts and Contexts. Studies in Functional and Structural Linguistics 58. Amsterdam / Philadelphia: John Benjamins Publishing Company. Ooi, Vincent (1998) Computer Corpus Lexicography. Edinburgh: Edinburgh University Press. Palmer, Harold (1933) Second Interim Report on English Collocations. Tokyo: Institute for Research in English Teaching. Plag, Ingo (1999) Morphological Productivity: Structural Constraints in English Derivation. Berlin and New York: Mouton de Gruyter.
References
217
Plag, Ingo (2003) Word-Formation in English. Cambridge: Cambridge University Press. Plag, Ingo (2005) “Productivity” In: K. Brown (ed.), Encyclopedia of Language and Linguistics, 2nd Edition, Vol. 10. Oxford: Elsevier, 121-128. Plag, Ingo, Christiane Dalton-Puffer, C. and Harald Baayen (1999) “Morphological productivity across speech and writing.” English Language and Linguistics 3:2, 209-228. Plag, Ingo, and Harald Baayen (2008) “Parsing is not weaknessless: suffix ordering revisted.” Submitted for publication. Pyles, Thomas and John Algeo (1993) The Origins and Development of the English Language (4th ed.) Fort Worth: Harcourt Brace Jovanovich. Quirk, Randolph (1974) The Linguist and the English Language. London: Edward Arnold. Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech and Jan Svartvik (1972) A Grammar of Contemporary English. London: Longman. Quirk, Randolph, Sidney Greenbaum, Geoffrey Leech and Jan Svartvik (1985) A Comprehensive Grammar of the English Language. London: Longman. Renouf, Antoinette (2007) “Tracing lexical productivity and creativity in the British Media: ‘The Chavs and the Chav-Nots’.” In: J. Munat (ed.), 61-92. Renouf, Antoinette and Harald Baayen (1998) “Aviating among the hapax legomena: morphological grammaticalisation in current British newspaper English.” In: A. Renouf (ed.), Explorations in Corpus Linguistics. Amsterdam: Rodopi, 181-189. Roberts, Ian and Anna Rousseau (2003) Syntactic Change: A Minimalist Approach to Grammaticalization. Cambridge: Cambridge University Press. Rosch, Eleanor (1978) “Principles of categorization.” In: E. Rosch, and B. Lloyd (eds.), Cognition and Categorization. Hillsdale, NJ: Lawrence Erlbaum, 27-48. Römer, Ute (2005) Progressives, Patterns, Pedagogy: A Corpus-driven Approach to English Progressive Forms, Functions, Contexts and Didactics. Amsterdam: John Benjamins. Rúa, Paula López (2007) “Keeping up with the times: Lexical creativity in electronic communication.” In: J. Munat (ed.), 137-162. Rubach, Jerzy (1984) “Segmental phonology of English and cyclic phonology.” Language 60, 21-54. Sampson, Geoffrey (1992) “Probablistic parsing”. In J. Svartvik (ed.), Directions in Corpus Linguistics, Berlin: Mouton de Gruyter. Santa Ana, Otto (1992) “Locating the linguistic cycle in vernacular speech: Chicano English and the exponential hypothesis.” In: J. M. Denton, G. P. Chan and C. P. Canakis (eds.), CLS 28: Papers from the 28th Regional
218
References
Meeting of the Chicago Linguistic Society, Vol. 2: The Cycle in Linguistic Theory, Chicago: CLS, 277-287. Schilling-Estes, Natalie (2002) “Linguistic structure.” In: J. K. Chambers, P. Trudgill and N. Schilling-Estes (eds.), 203-205. Schultink, Henk (1961) “Produktiviteit als Morfologisch Fenomeen.” Forum der Letteren 2:1, 110-125. Selkirk, Elisabeth (1982) The Syntax of Words. Cambridge: The MIT Press. Siegel, Dorothy (1974) Topics in English morphology. PhD thesis, MIT. Smółkowa, Teresa (2001) Neologizmy we współczesnej leksyce polskiej. Kraków: PAN. Spencer, Andrew (1991) Morphological Theory. Oxford: Blackwell. Štekauer, Pavol (2002) “On the theory on neologisms and nonce-formations.” Australian Journal of Linguistics 22:1, 97-112. Štekauer, Pavol (2005) Meaning Predictability in Word Formation: Novel Context-Free Naming Units. Studies in Functional and structural Linguistics 54, Amsterdam/Philadelphia: John Benjamins Publishing Company. Štekauer, Pavol and Rochelle Lieber (eds.) (2005) Handbook of Wordformation. Dordrecht: Springer. Stockwell, Robert and Donka Minkowa (2001) English Words: History and Structure. Cambridge: Cambridge University Press. Stubbs, Michael (1993) “British traditions in text analysis: from Firth to Sinclair”. In: M. Baker, G. Francis and E. Tognini-Bonelli (eds.), Text and Technology: In Honour of John Sinclair. Amsterdam: John Benjamins Publishing Company, 1-33. Szymanek, Bogdan (1989) Introduction to Morphological Analysis. Warszawa: Państwowe Wydawnictwo Naukowe. Szymanek, Bogdan (2005) “The latest trends in English word-formation.” In: P. Štekauer and R. Lieber (eds.), 429-448. Tagliamonte, Sali A. (2006) Analysing Sociolinguistic Variation. Cambridge: Cambridge University Press. Thiel, Gisela (1973) Die Semantische Beziehungen in den Substantivkomposita der Deutchen Gegenwartssprache. Muttersprache 83, 377-404. Thomson, A. J. and A. V. Martinet (1984) A Practical English Grammar. Oxford: Oxford University Press. Thorndike, Edward (1921) A Teacher’s Wordbook. New York: Columbia Teachers College. Tognini-Bonelli, Elena (2001) Corpus Linguistics at Work. Amsterdam: John Benjamins. Warren, Beatrice (1990) “The importance of combining forms.” In: W. Dressler, H. Luschützky, O. Pfeiffer and J. Rennison (eds.), Contemporary Mor-
References
219
phology: Trends in Linguistic Studies. Berlin: Mouton de Gruyter, 111132. Weinreich, Uriel, William Labov and Marvin I. Herzog (1968) “Empirical foundations for a theory of language change.” In: W. Lehmann and Y. Malkiel (eds.), Directions for Historical Linguistics. Austin: University of Texas Press, 95-195. Wolfram, Walt and Ralph W. Fasold (1974) The Study of Social Dialects in the United States. Englewood Cliffs, NJ: Prentice-Hall. Yule, George (1996) The Study of Language (2nd edition). Cambridge: Cambridge University Press. Zwicky, Arnold and Geoffrey Pullum (1987) “Plain morphology and expressive morphology.” In: J. Aske, N. Beery, L. Michaelis, and H. Filip (eds.), Berkeley Linguistics Society: Proceedings of the Thirteenth Annual Meeting, General Session and Parasession on Grammar and Cognition. Berkeley, California: Berkeley Linguistics Society, 330-340. Dictionaries Algeo, John, (ed.) (1991) Fifty Years among the New Words: A Dictionary of Neologisms 1941-1991. Cambridge: Cambridge University Press. Ayto, John (1990) The Longman Register of New Words. London: Longman. Ayto, John (2006) Movers and Shakers: A Chronology of Words That Shaped Our Age. New York: Oxford University Press. Barnhart, Clarence, Sol Steinmetz and Robert K. Barnhart (1973) A Dictionary of New English. London: Longman. Green, Jonathon (1991) Neologisms: New Words since 1960. London: Bloomsbury Publishing. Knowles, Elisabeth (1997) The Oxford Dictionary of New Words. Oxford, New York: Oxford University Press.