Writing the corpus-based history of spoken English

Writing the corpus-based history of spoken English: The elusive past of a cleft construction Christian Mair University of Freiburg Abstract The past two decades have seen considerable advances in the corpus-based “real-time” investigation of linguistic change in English, both in older stages of the language and in progress now. Inevitably, given our present resources, most claims about changes in the language as a whole have been based on written data. Against this backdrop, the present paper seeks to define the potential and limitations of the corpus-based “real-time” study of change in the spoken language, where even for a well documented language such as English the major problem is the paucity of corpus data. In the absence of recordings of suitable quality, the study of real speech in real time will never be pushed back further than the early 20th century, but as I will make clear with the example of the WW I Phonographische Kommission recordings, a number of interesting resources may well deserve more corpuslinguistic attention than they have received so far. Considerable progress is also likely in the study of the history of the spoken language “by proxy”, i.e. through speech-based genres, of which vast amounts have recently been made available for corpus-linguistic study (Old Bailey, Literature Online, Google N-grams). Particularly with regard to grammar, though, more attention needs to be paid to the question of what is really speech-like in supposedly speech-based genres and which features of spoken syntax are likely to be edited out of the written rendering. Cleft constructions, present both in written and spoken English, but structurally and statistically more richly represented in the latter, will serve as illustration of this point. 1.

Introduction

The history of a language can only be studied in real time if primary data have been preserved. In all other cases we have to resort to the relative chronologies of reconstruction, as in historical and comparative linguistics, or to extrapolations in “apparent time,” from contrasts between older and younger speakers at a given synchronic point in time, as in sociolinguistic variationism. For most of the history of the English language, the only technology available for preserving primary language data was writing. The mechanical and, subsequently, electric and electronic recording and storage of speech sounds did not start until 1877, when Thomas Alva Edison recorded himself reciting the nursery rhyme “Mary

12

Christian Mair

had a little lamb” on the phonograph. Sound recordings have survived in sufficient amounts and quality for linguistic analysis only from the first quarter of the 20th century. For the historical linguist, the use of writing to record utterances is unproblematic as long as it can be assumed to do no more than take down on the page what has been said. However, this assumption of the innocence of the new technology is in all likelihood justified only for the very first stages of literacy, for example when Old English scribes first attempted to record their people’s oral poetry for posterity, using a slightly modified version of the Latin script. By the late Old English period, English boasted a standardised orthography based on the West Saxon dialect, and after the Early Modern “Gutenberg” revolution there as no doubt that the history of the written language increasingly followed a dynamic of its own. The result is that although written and spoken English of course instantiate the same underlying grammar, there are nevertheless drastic contrasts in the frequencies with which individual constructions occur in the two modalities, to the point that a normal spoken focussing construction of the type illustrated in (1) would be as unlikely to surface in contemporary writing as the complex nominal premodifications in (2) would be to occur in spontaneous speech: (1)

(2)

what is a common occurrence is you’ll have somebody coming into a college to do a workshop on work with the disabled or dance with the disabled and you’ll go along to that workshop and it would be full of able bodied students […] (DCPSE DI A01) In a recent paper (Kemball-Cook et al, 1990), we demonstrated a modified sodium dodecyl sulphate polyacrylamide gel electrophoresis (SDSPAGE) method for visualization of factor VIII heavy chain (FVIII HC) polypeptides. (FLOB J09)

Example (1) can be brought into line with the requirements of written norms by the addition of the conjunction that: “what is a common occurrence is that you’ll have somebody coming ….” However, as I shall argue below, we are not dealing here with a simple case of optional that, such as is, for example, commonly encountered in object noun clauses after verbs of saying and thinking, where presence or absence of that is largely a matter of formality and register. Rather, (1) represents an independent focussing construction which has remained restricted to the spoken language. The reason for this is probably that, by the prescriptive standards of the written norm, the structure is felt to be anacoluthic. The written norm generally requires complex sentences which are syntactically fully integrated, with constituent clauses being clearly identifiable as either main or subordinate, and no unintegrated fragments left behind. The variant with that meets these requirements: a complex sentence with an overall subject – predicate – predicative complement structure, in which the subject is realised as a finite nominal relative clause and the complement as a finite noun clause. Such an exhaustive analysis is impossible for the that-less variant. In structural terms, the

Writing the corpus-based history of spoken English

13

second part works as an independent main clause – a plausible analysis which leaves the first part (what is a common occurrence is) an unintegrated syntactic fragment. Lack of syntactic integration of this kind is not exceptional but pervasive in cleft constructions in spoken English. Calude, for example, finds 30 unintegrated wh-clefts in a total of 74 instances in her study of the Wellington corpus (2009: 170). In her analysis of a standard example from the literature similar to (1), Miller and Weinert’s (1998: 292) cause what you’re doin’ is you’re goin up the side of the allotments, she also argues that “the cleft constituent is, in effect, a fully independent clause, which is separate from the rest of the cleft” (2009: 170). Apart from clefts, there are several other constructions with weak or non-existent syntactic integration which are generally absent from written language or tend to be purged from written representations of spoken language – such as, for example, left and right dislocation of noun phrases (cf. this man, I can’t stand him or he gets on my nerves, this man). A focus construction similar to the one illustrated in (1) will serve as a test case for our exploration of the prehistory of spoken English below. The complex noun-phrase premodification illustrated in example (2), by contrast, serves as a device to maximally compress information, which is not a priority in most types of spoken communication. Such clear contrasts between grammatical preferences in speech and writing raise two questions: (a)

(b)

Are our corpus-based histories really histories of English, or just histories of the written (and standard) language, potentially misrepresenting or even disregarding developments in informal speech? If so, how can we recover the largely lost or submerged diachrony of the spoken language? For after all, constructions such as the one represented in example (1) are probably not recent innovations but have a history of their own.

At the present juncture, these are important questions in the development of corpus linguistics. For the remoter periods in the history of English, considerable efforts are being made at the moment to make spoken language accessible through speech-based written genres such as informal letters or witness depositions (cf., e.g., CEEC, the Corpus of Early English Correspondence; CSC, the Corpus of Scottish Correspondence; or the Old Bailey Corpus). For this approach to be successful, we need a precise idea of the ways in which speechbased writing resembles actual spoken language and in which it does not – questions which are currently being explored in a growing body of corpuslinguistic literature from diverse theoretical perspectives and with focus on various periods in the history of the language (cf. Collins 2001; Short, Semino and Wynne 2002; Mollin 2007; Culpeper and Kytö 2010; Moore 2011). As Culpeper and Kytö emphasise, “[f]or the historical context, the base-line expectation should never be that speech report is faithful: we simply do not have the evidence to support that expectation” (2010: 81).

14

Christian Mair

For the recent historical period in which recorded sound is available, two partly compatible avenues of research are being explored. The majority of corpus linguists documenting change in the spoken language follow standard practice in the field by reducing speech to orthographic transcription. As examples, consider two flagship projects, namely the Diachronic Corpus of Present-Day Spoken English (DCPSE) and the Corpus of Contemporary American English (COCA). The advantages of this strategy are obvious. Orthographically transcribed spoken language can be obtained, stored and searched fairly easily in fairly large quantities. The price to pay is that certain issues which are central in the emergence of new constructions in the spoken language, such as the syntaxprosody interface, can be investigated only indirectly or not at all. Also, there is circumstantial evidence that even the most conscientious orthographic transcriptions are subject to unacknowledged homogenising and standardising pressures (Mollin 2007). A comparatively less well-trodden path is the compilation of audio corpora or multi-modal corpora aligning sound and transcription (see Andersen 2010 for a recent survey of such projects). In spite of the considerable legal, technological and logistical challenges, this is a path also worth taking. 2.

Real speech in real time: spoken English 1900 to the present

Adopting a loose definition, we could argue that the corpus-based investigation of language change in real time is almost as old as the discipline of historical philology itself, for it is the method which Otto Jespersen, for example, applied when he compared, say, the increasing frequency of progressive forms in the diachronic corpus represented by successive English translations of the Bible (1909-49, IV: 177). Providing data for the study of change in real time was the motivation behind the compilation of the two pioneering historical computer corpora of English, the Helsinki Corpus of English Texts, and ARCHER, A Representative Corpus of Historical English Registers. Real-time historical investigations in the narrowest sense became possible when Brown and LOB, the standard synchronic reference corpora of British and American English (1961), were complemented by their 1990s (Frown and FLOB), 1930s (B-LOB [completed], “pre-Brown” [ongoing]) and 1900s versions (ongoing). Note, however, that none of these ventures involved genuine spoken data. The investigation of the history of real speech in real time thus has a very short tradition. There are some pioneering studies which restrict themselves to pronunciation and are not based on corpora in the usual sense of the term. Harrington, Palethorpe and Watson (2005), for example, investigate various segmental phonetic changes in the Queen’s Christmas broadcasts between 1952 and 2002, while Price (2008) documents the changing pronunciations of Australian newsreaders. Pronunciation change in vernacular varieties of English is in focus in two sociolinguistically informed documentation projects, ONZE


15

(Older New Zealand English) and NECTE (Newcastle Electronic Corpus of Tyneside English). The first general-purpose corpus designed for the study of ongoing change in contemporary spoken English is DCPSE, the Diachronic Corpus of PresentDay Spoken English, which covers the period between the late 1950s and the early 1990s. As a small (c. 900,000 words) corpus which reduces speech to orthographic transcription, it is particularly suitable for the study of morphosyntactic features of mid to high frequency, such as the following type of variability: (3a) (3b) (3c)

What I did was to call the police. [to-infinitive] What I did was call the police. [unmarked infinitive] What I did was I called the police. [finite clause]

This is a specific sub-type of the wh-cleft (or “pseudo-cleft”) construction, which Quirk et al. define as “essentially an SVC sentence with a nominal relative clause as subject or complement” (1985: 1387). In our example, the nominal relative or “cleft” clause (what I did) contains a form of the pro-verb do rather than some other verb, and in (3a) and (3b) the clefted constituent is realised as a marked or unmarked infinitival clause. These are common structural options in spoken and written English and recognised as such in the two major reference grammars of present-day English (Quirk et al. 1985: 1387-1389, Huddleston and Pullum 2002: 1420-1423) and in Biber et al. (1999: 959-960). What these reference grammars do not state is that they are also subject to strong ongoing diachronic change, with 20th century corpora showing a very clear trend from the older (3a) to the historically more recent variant (3b). This trend shows up very clearly in “realtime” analyses of the Brown family of corpora and the DCPSE (Mair 2012). A synchronic cross-variety comparison of ten sub-corpora of the International Corpus of English (Mair and Winkle 2012 1) shows that most New Englishes represented in this corpus collection, among them all natively spoken varieties, are affected by this development. Note that in the analyses which follow, the category of pseudo-clefts is defined somewhat more loosely than in the reference grammars and includes not only what-clefts proper, but also alternatives mostly involving quantifiers or ordinal numbers and adjectives such as all, the only/first/next/last thing: (3d) (3e) (3f)

All/The only thing/The next thing etc. I did was to call the police. [toinfinitive] All/The only thing/The next thing etc. I did was call the police. [unmarked infinitive] All/The only thing/The next thing etc. I did was I called the police. [finite clause]

This extension is not unproblematic (though widely practiced in specialist studies on the subject – cf., e.g., Tognini-Bonelli 1992; Calude 2009: 57-58). Unlike

16

Christian Mair

What I did was call the police and I called the police, pairs such as The only thing I did was call the police and I called the police are not fully equivalent in terms of truth value. However, they are comparable in their discourse function and in their impact on the information structure of the utterance. What is more, the “loose” variants of the construction (examples (3d) and (3e)) pre-date the core pseudoclefts by around two centuries in terms of historical origin (Traugott 2008). This is, of course, a powerful argument for including them in a diachronic study such as the present one. What we will be concerned with in section 3 below is the status of (3c) and (3f), which are almost exclusively attested in contemporary spoken data (and hence almost absent from even such a rich resource as the Corpus of Historical American English (COHA). The existence of this option is recognised in Huddleston and Pullum (2002: 1422, n. 32), who point out that “it is possible, in relatively informal style, for the value phrase to be a declarative content clause: What they did was they threw us out and locked the door.” One explanation for the absence of this construction from written and formal styles (and its near absence from reference grammars of contemporary English) is that it might be a genuine innovation which, originating in spontaneous speech, has not yet made it into written or formal registers. Alternatively, it may not be a new construction at all, but a construction of old standing in the spoken language which, being considered anacoluthic, has just not registered in our predominantly written and formal record of the English language. To answer this question, we can look at authentic spoken material pre-dating DCPSE, in the hope that there is sufficient data to help us trace the history of this construction in the first half of the 20th century. Beyond that time, we shall have to rely on speech-based genres. If the speech-like material from past ages provides continuous positive evidence for the use of the construction, we can take its existence in the spoken language for granted. If the evidence remains discontinuous or scant, the situation is more complex. Either the construction is really of recent historical origin, or it is among those structures which tend to be normalised and standardised automatically as part of the writing routine even in speech-based genres. As even this simple example chosen for illustration shows, compiling the ideal diachronic spoken corpus is like squaring the circle; it is impossible to meet the conflicting requirements of total authenticity (real speech rather than speechbased writing) and time-depth (continuous coverage of the three or four centuries usually required for major syntactic changes to unfold rather than the onehundred year time window opened up by sound recording). Ideally, we would like: (a) (b) (c) (d) (e)

informal and spontaneous speech in recordings of high quality in recordings opening up the maximum possible time depth in recordings which are aligned to orthographic transcriptions in amounts large enough to make possible investigations beyond the level of segmental phonology

Writing the corpus-based history of spoken English (f)

17

coming from a broad range of speakers to level out idiolectal bias.

As can be seen, DCPSE succeeds on criterion a) and largely also on criteria e) and f), but clearly fails on criteria b), c) and d). Failure on b) and d) can be remedied, as the recordings are available in digital format, but remain to be anonymised and aligned to the transcription (a process which, as anyone familiar with the problem will appreciate, is extremely laborious and time-consuming). 2 Failure on criterion c) is permanent, as audio-documentation of spontaneous faceto-face interaction does not extend into the first half of the 20th century. An early 20th-century source of spoken data, which deserves to be made more widely known to the historical (corpus-)linguistic community because of its potential value as a source for the diachrony of spoken English, is the WW I Phonographische Kommission recordings, currently the database for a Freiburgbased PhD project on “A real-time history of dialect death and koinéisation in 20th-century England” (Holz in progress 3). In 1915 the Königlich Preußische Phonographische Kommission [Royal Prussian Phonographic Commission] was founded by a number of mostly Berlinbased linguists who realised that the presence of a large number of prisoners of war provided a unique opportunity to record speech samples and music from a population of very diverse linguistic and cultural background. The Anglicist Alois Brandl (1855-1940) was an active member of this group and helped assemble a large collection of folk songs and dialect recordings, some of which were analysed by himself in a number of publications in the 1920s which – rather innovative at the time – often included the recordings as shellac records (e.g. Brandl 1926-27). Owing to subsequent political events – World War II and the partition of Germany in the Cold War being the most prominent ones – this valuable body of data was largely forgotten. Fortunately, most of the recordings, which were part of the Berliner Lautarchiv, survived and were digitised professionally in the 1990s, though not in a specialist-linguistic project but as part of a wider project to save Berlin’s digital heritage (Ziegler 2000; Mahrenholz 2003, Lange 2010). Copies of 821 Berlin sound files were acquired by the British Library and incorporated into its sound archive collections under the heading “Berliner Lautarchiv British & Commonwealth Recordings” (C1315). A small selection of these recordings has been made available for public listening by anyone. 4 Although the recordings comprise several hours of speech in their totality, they unfortunately do not add up to a balanced corpus of spoken English. Speakers were asked to read passages from the Bible, in particular the Parable of the Prodigal Son (Luke XV, 11-32), to tell short folk tales, recite folk poetry, sing songs or – in some cases – simply to count. Given the state of recording technology at the time and the research priorities of the compilers, the recording of informal conversation was neither feasible nor felt to be desirable. All things considered, the data are a unique resource for the study of historical dialect phonology, of a quality which fully meets the standards for instrumental phonetic

18

Christian Mair

analysis, as the following measurements of a Norwich informant’s long high back and stressed central vowels (lexical sets GOOSE and STRUT) illustrate:

Figure 1. Measurements for vowels in two sons, produced by Norfolk informant born in 1898 5 and recorded in 1917 As the phrase two sons – from the Biblical parable of the Prodigal Son – recurs in many other recordings, comparative studies across different speakers are feasible. On the question formulated above, however (age of the “spoken” type of pseudocleft construction illustrated in (3c) and (3f)), the material does not hold the answer. 3.

Speech by proxy: spoken language as reflected in speech-based genres

The history of (3c) and (3f), i.e. the constructional type What/All etc. I did was I called the police, implies two questions. First, how old is the construction? And second, how did it arise? We know from previous research on specificational cleft constructions involving infinitives (Traugott 2008) that the focus uses of the construction originated from predicational uses, originally not involving what itself but alternative openings such as all. Thus, a complex sentence such as all I did was to help the police originally had a predicational meaning, equivalent to all I did was done in order to help the police. The basis for the emergence of the


19

focus uses was provided by contextually ambiguous cases, such as the following one from the Old Bailey Corpus: (4)

Whether I struck her or not, I cannot say; but what I did was to defend myself, or I should have been murdered by her. (Old Bailey, 1755)

Example (4) allows a purposive paraphrase (“what I did was done in order to defend myself”) or a focus one (“defend myself was what I did”), whereas only the latter is contextually plausible in (5) below, from the same corpus, but more than 100 years later: (5)

How she was brought out from the bed to that place I do not know; all I did was to help her across the hall to the bath-room door. (Old Bailey, 1880)

As has been mentioned, Traugott also points out that the core pseudo-cleft constructions with what arose later than the related patterns with all etc., which also argues for a conservative analysis of the 1755 example. Note that the finite-clause constructions illustrated in (3c) and (3f) are never ambiguous between a purposive and a focus reading. Thus, the relevant variants of (4) and (5) are both clear examples of focus constructions: (4a) (5a)

what I did was I defended myself, or I should have been murdered all I did was I helped her across the hall to the bath-room door

Nevertheless, the construction is attested in speech-like contexts as far back as the variant with the to-infinitive, as the following examples from LION, the “Literature Online” database, show. 6 Also note that, as in the case of the infinitive, constructions with introducers other than what (e.g. all or, as in (6) below, the next thing) pre-date those with what, though in view of the very small number of examples this may be due to chance. What is sobering to the historian of spoken English, however, is that this truly massive database, containing, among other things, the text of more than 350,000 poems, plays and works of fiction and therefore well positioned to yield up examples from speech-based genres, contains only six 7 relevant examples spanning the period from the late 17th to the late 20th centuries: (6)

(7)

Ay, Sir; and I thank you, the next thing you did, was, you begot me; the Consequence of which was as follows [...] (Thomas Otway, The Atheist [1684], Literature Online database) ‘But be that as it may,’ says he, ‘you’re improving tenants, and I’m confident my brother will consider ye; so what you’ll do is, you’ll give up the possession to-morrow to myself, that will call for it by cock-crow, just for form’s sake; and then go up to the castle with the new lase ready

20

(8)

(9)

(10)

(11)

Christian Mair drawn [...]’ (Maria Edgworth, The Absentee [1812], Literature Online database) [...] we didn’t roll it down at all, sir: all we did was, we tipped it down just as carefully [...] (Robert Traill Spence Lowell, Antony Brade [1874], Literature Online database) “I hain’t lied to you,” said poor Philip, “‘n’ I guess the most stealin’ ever I done was I took a St. Bart’s trap I thought they’d left. […]” (Robert Traill Spence Lowell, Antony Brade [1874], Literature Online database) then Mrs. Sorenson told us: “now, what we are going to do is we are going to tell each other what we did during the rainstorm! […]” (Charles Bukowski: “we ain’t got no money, honey, but we got rain” [1992], Literature Online database] In McDonald’s yesterday there was this woman smoking in the non-smoking section. So what I did was, I went over and said “Go ahead, Dear, blow that smoke in my face.” (Helen Conkling, “In the Harvey Street Diner” [1997], Literature Online database)

Note that three out of four pre-20th century attestations involve not the core subcategory of what-pseudo clefts, but alternatives (the next thing you did, all we did, the most stealin’ ever I done), justifying in retrospect the decision taken here to collect these as well in a diachronic study concerned with the historical origin of the construction as a whole. Note further that every single example is from speech-like contexts in literary works: staged speech from a Restoration comedy in (6), direct speech from fictional narrative in (7) to (9), with additional vernacular touch, and extracts from the work of two poets cultivating a conscious conversational tone in (10) and (11). This is worth pointing out in view of previously voiced suspicions that it may in fact be the fictional and invented literary representation which provides the most authentic record of the spoken language of bygone days, and not non-fictional speech-based genres such as court transcripts or witness depositions. Where transcribed witness depositions may focus on the content and on the precise words which were used, at the expense of grammatical constructions, fictional speech, with no pre-existing content outside the work to report, might provide the more direct window on language structure and form, because it is through them that a writer manages to animate a voice: To adapt the title of Collins (2001) […] voices can be reanimated. In such cases one might expect faithfulness to the linguistic characteristics of speech, rather than the specifics of what was actually


21

said, and these characteristics can be retrieved with the corpus method. (Culpeper & Kytö 2010: 81) Having established the age of the construction, we can now move on to explore its origin. Does it derive by reduction from a variant with a subordinate clause explicitly marked by the conjunction that: what I did was I called the police Å what I did was that I called the police? Or is it better understood as a construction sui generis which represents emergent syntax (Hopper 2001, 2004) in informal speech and is blocked from entering written usage because it is felt to be anacoluthic – much like dislocations or copy pronouns 8 are usually edited out from written texts? From a purely chronological point of view, the reduction analysis is possible. LION contains the following instance of a finite clause introduced by the conjunction that which was produced a few decades before the oldest attestation of the unintroduced alternative: (12)

No Madam (I answer’d) ‘tis not Blacius but Izadora which has done it, that glorious confession shee made him in my favour was the essentiall cause of it, all that Blacius did, was, that he kill’d me not, but ‘twas his vnequall’d Daughter gave me my Life by giving me that which makes me value it [}] (Roger Boyle, Earl of Orrery, Parthenissa I [1655], Literature Online database)

For a reduction analysis to be plausible rather than merely possible, however, we would expect the supposedly full variant to be more common than the reduced one. This is clearly not the case, as that-clauses and the corresponding unintroduced ones remain equally rare almost into the 20th century: (13)

(14)

He a Tradesman? ‘Tis meer Scandal, he never was one. All that he did was, that he was very obliging, very officious, and as he was a grand Connoisseur in Stuffs, he used to pick them up every where, have ‘em carried to his House and gave ‘em to his Friends for Mony. (Henry Baker and James Miller, The Cit Turn’d Gentleman [1739], Literature Online database) [}] The only just Thing the Rogues did, was, That when the Spaniards came on Shore, they gave my Letter to them, and gave them Provisions and other Relief, as I had ordered them to do [...] (Daniel Defoe, The Farther Adventures of Robinson Crusoe [1719], Literature online database)

22

Christian Mair

(15)

“I told you no lie,” said Hawkins, trying to stand his ground. “All I did was that I didn’t answer your letters because I couldn’t get out of that accursed engagement, and I didn’t know what to say to you, and then the next thing I knew was that you were engaged, without a word of explanation to me or anything.” (E. Oe. Somerville & Martin Ross, The Real Charlotte [1894], Literature Online database)

Essentially the same distribution as in Literature Online is presented by another major corpus, the Corpus of Historical American English (COHA). There is practically no usable evidence until the second half of the 20th century, and even at this late stage the statistics do not warrant far-reaching conclusions. Thus, in the third person present (search string does is) 9 we find a total of seven relevant examples from 1996 onwards, six without and one with that. The two variants are illustrated by (16) and (17) respectively: (16)

(17)

And Clinton has been lucky. The first few years he had some nicks and had to miss a couple of games, but he hasn’t had anything really serious. Now what this does is it points to the offseason and how crucial the offseason conditioning and training and the rehab is, how crucial that will be for his ultimate success to continue on. (COHA, Washington Post, 2006) On its first day of operation in 1932, 48,611 cars crossed the skyway; today the number averages 85,000, according to the Department of Transportation. “The thing that the Pulaski Skyway does is that it allows you to leap over all the railyards, the Meadowlands, the industrial wastelands that pepper that area,” said Jeffrey M. Zupan, a senior fellow for transportation at the Regional Plan Association. (COHA, New York Times, 2007)

Another potentially promising source of data covering the late 17th to the early 20th centuries was the Old Bailey Proceedings. However, using similar search strategies to the ones employed for Literature Online, no relevant examples were discovered. Either the proceedings are not as speech-like in their rendering of the witnesses’ morpho-syntax as might be expected, or the amount of text held in this database is too small for a historical documentation of the very rare construction studied here. The reverse problem is faced when using the Google N-gram viewer (Michel et al. 2010). Searches for strings such as did was that, restricted to the crucial period of 1600 to 1800, yield up to several hundred examples, but the quality of the returns leaves much to be desired. After sifting out irrelevant and multiple returns 10 or returns from more recent periods mis-assigned to earlier periods, one is left with data that largely overlaps with those from the Literature Online. We not turn from COHA and the other databases mentioned, with their speech-based material, to the Corpus of Contemporary American English (COCA), which contains orthographically transcribed real speech from media


23

contexts. In these transcribed spoken data we encounter a very different picture, with statistics which now clearly show the variant with that to be a marginal option. A search for what I did was I (including variants with a comma after did, after was or after both) returns 55 relevant hits, whereas what I did was that I (including variants with commas) yields just one: (18)

But on looking into it, yes, I made a dumb mistake. What I did was that I confused two attempted assassinations -- two attempts to assassinate President Ford, both happened in September 1975. (COCA, NPR Weekend, 1992)

These are three typical examples of the alternative construction: (19)

(20)

(21)

And you see how it’s really nice and caramelizing, and I didn’t put any extra oil or butter in there, there’s enough fat just from the steak. And what I did was I put kosher salt. (COCA, CBS Early, 2007) Well, what I did was I went and did my own checking among people, both who knew her and who were friendly to her and people who were not friendly to her. (COCA, Ind. Geraldo, 1994) Yeah, I’m going to show you how to do that. Here is already the plain glass plate, and what I’ve done is, I just cut the image out of a calendar and I didn’t put it on the top, what I did, was, I flipped the plate over and glued it on the back, because you don’t want the image to show through. (COCA, NBC, TodaySat, 1998)

Note that, in addition to the one targeted by the search, example (21) contains another example of the construction (“what I’ve done is, I cut the image out of a calendar”). Taken all together, the examples suggest that we can be optimistic about the syntactic realism of the transcribed broadcasts, which are the chief source of spoken language in COCA. This is noteworthy as these transcriptions are not produced by linguistically trained personnel. One typical feature of spoken syntax is preserved better in these texts than in historical court transcripts (cf. Old Bailey) or fiction (LION). 4.

Conclusion

The present paper has explored the history of a particular type of focussing construction which is strongly associated with spoken English, namely specificational clefts of the type What I did was I called the police. In the absence of sufficient spoken data older than the mid 20th century – recorded or transcribed – the research had to rely on the investigation of “speech by proxy” in selected speech-based written genres. The relevant findings show that finite clause complements are not recent innovations but can be continuously if sparsely attested from the 17th century onwards. As regards their origin, they should not be

24

Christian Mair

seen as reduced versions of a full construction with a that-clause, but as independent innovations in spoken syntax which did not make it into standard and formal written usage. As in the related case illustrated in example (1) above, the reason is lack of full and explicit grammatical marking of main-clause and subordinate-clause status, which leads to the construction being considered syntactically incomplete and hence unsuitable by the editorial standards governing competent writing. Beyond exploring the immediate phenomenon under study, the paper has aimed to formulate a few recommendations to put the diachronic study of spoken language on a sounder footing. First and most importantly, the study of the history of real speech in real time should be made a priority in corpus-linguistic research. It has been possible to record human voices for almost 150 years. From the first quarter of the 20th century onwards archives contain treasures which are widely dispersed and often unknown but would clearly reward the attention of corpus-linguists and historical linguists. I have referred to the WW1 Phonographische Kommission recordings as a case in point. If this resource and similar ones are identified, salvaged and developed as corpora, pioneering ventures such as the Newcastle Electronic Corpus of Tyneside English (NECTE) or the Older New Zealand English (ONZE) project could soon be complemented by data of similar quality and research potential for many other standard and non-standard varieties. For the period before sound recording, systematic studies should describe the different ways in which the several supposedly speech-based genres which are explored by historians of spoken English are like speech and in which ways they are not. For the early history of the construction studied here, literary representations of speech seemed closer to authentic spoken language than the court records collected in the Old Bailey corpus. As regards the recent past, finite clause complements in clefts were not among the features of spoken syntax likely to be dropped from transcriptions produced by non-linguists. Other features, on the other hand, may pattern differently, and no premature generalisations should be drawn from the results of the present study. Systematic investigations of other spoken constructions are required for comparison, and for a full assessment of the value that a particular type of “speech by proxy” has for the reconstruction of the pre-history of spoken English. In all, the study of the history of spoken English remains a difficult and challenging enterprise, caught up in a paradox which we could pointedly formulate thus. For the recent past, the era of sound recording, we can hear the sounds, but we still have to “read” the voices through reconnecting the recordings to their sociolinguistic and cultural context. For the time before recording technology, we have to read the voices of the past literally, from the written sources, and therefore worry about precisely what it is that we “hear” without the sound.


25

Notes 1

The two studies mentioned also document the status of minor additional variants, such as What I should have done is called the police [past participle] or What I am doing is calling the police [V-ing], which can be disregarded for the present argument. The early history of the construction from the 16th century onwards is studied in Traugott 2008. Variability between the marked and unmarked infinitives in 20th century English is analysed in several papers by Rohdenburg (1998, 2000, 2006), though not from a diachronic perspective.

2

DCPSE has overlap with ICE-GB in its recent data. The sound-files shared between the corpora can be obtained as part of the ICE-GB package.

3

Holz’ study situates itself in the context of other real-time studies of phonetic change and dialect genesis, the prime example of which is the work of Gordon, Hay and others on early New Zealand English (e.g. Gordon et al. 2004; Trudgill et al. 2000; Trudgill et al. 2003). As in the present case, the work of this group was inspired by the (re-)discovery of authentic recorded material of unique size and quality.

4

See http://sounds.bl.uk/Accents-and-dialects/Berliner-Lautarchiv-Britishand-Commonwealth-recordings.

5

The British Library’s http://sounds.bl.uk site gives the informant’s name as “Fred Eccles” and the date of birth, wrongly, as 1888 (corrected to 1898 later on in the entry). The same recording was published by Brandl (1927) as a shellac record with accompanying transcription and commentary, who also confirms the 1898 birth date (1927: 9).

6

The specificational clefts were targeted through seven searches for the strings DO + BE + I / you / he / she / it / we / they, using LION’s options for lemmatised search and for searches sensitive to spelling variation. Searches for the alternative constructions with the conjunction that (DO + BE + THAT + I etc.), which will be reported on below, over-taxed the system, so that the number of search strings had to be multiplied and strings containing non-standard spellings may have been missed.

7

Or seven, if one counts an example from a Penguin Classics English translation of Dostoyevsky’s “House of the Dead,” which has found its way into LION: “Well, the first thing we did was, we went into a public house.”

8

Cf. constructions such as: “the kids, don’t you love them?”, “I met him again in the pub, the old fool”, exemplifying dislocation, and “a new toy he was very keen for his kids to get it too”, exemplifying a copy pronoun.

26

Christian Mair

9

As searches in COCA and COHA are sensitive to punctuation marks, I also searched for does, is. This precaution notwithstanding, the search will miss cases in which material other than a comma intervenes between the verbs do and be, as in: “The only thing we did to help was to take a little of the stuff out of the spare drum and stow it in our two drums, to leave him some room” (COHA, fiction, 1961). Such examples are central to the argument of Rohdenburg (2006), who assumes that structurally more explicit variants (in his case the to-infinitive) preferably occur in structurally more complex environments (in his and our case those created by the intervening material). The possible relevance of this factor for the distribution of finite-clause complements cannot be investigated, however, because unlike Rohdenburg, who studied the more common infinitival complements, I do not have a statistically critical mass of data.

10

A typical example is the following, from Bunyan’s Pilgrim’s Progress: “But that which put glory of grace into all he did was that he did it out of pure love to his country,” which – in addition to not representing the relevant construction anyway – is returned dozens of times.

References (a) Corpora and data bases consulted COCA – Corpus of Contemporary American English, compiled by Mark Davies (Brigham Young University), http://corpus.byu.edu/coca/. COHA – Corpus of Historical American English compiled by Mark Davies (Brigham Young University), http://corpus.byu.edu/coha/. DCPSE – The Diachronic Corpus of Present-Day Spoken English, compiled by Bas Aarts (University College London), http://www.ucl.ac.uk/englishusage/projects/dcpse/. FLOB – The Freiburg Update of the LOB Corpus, compiled by Christian Mair (University of Freiburg), http://icame.uib.no/cd/. Google N-gram Viewer, http://books.google.com/ngrams/. LION – Literature Online, Chadwyck Healey, http://lion.chadwyck.com/. Old Bailey Corpus, compiled by Magnus Huber (University of Giessen), http://www.uni-giessen.de/oldbaileycorpus/; cf. also Old Bailey Online, http://www.oldbaileyonline.org/. (b) Works cited Andersen, G. (2010), ‘How to use corpus linguistics in sociolinguistics’, in: A. O’Keeffe and M. McCarthy (eds.) The Routledge handbook of corpus linguistics. London: Routledge, 547-562.


27

Biber, D., et al. (1999), The Longman grammar of spoken and written English. London: Longman. Brandl, A. (1926-1927), Englische Dialekte. Lautbibliothek. Series of 20 pamphlets. Berlin: Preußische Staatsbibliothek. Brandl, A. (1927), Englische Dialekt – Norfolk: Dialektort Aslacton bei Norwich. Lautbibliothek 6. Berlin: Preußische Staatsbibliothek. Calude, A. (2009), Cleft constructions in spoken English. Saarbrücken: VDM Verlag Dr. Müller. Collins, D. E. (2001), Reanimated voices: speech reporting in a historicalpragmatic perspective. Amsterdam: Benjamins. Culpeper, J., and M. Kytö (2010), Early Modern English dialogues: spoken interaction as writing. Cambridge: CUP. Gordon, E., L. Campbell, J. Hay, M. Maclagan, A. Sudbury and P. Trudgill (2004), New Zealand English: its origins and evolution. Cambridge: CUP. Harrington, J., S. Palethorpe and C. Watson (2005), ‘Deepening or lessening the divide between diphthongs: an analysis of the Queen’s annual Christmas Broadcasts’, in: W. Hardcastle and J. Mackenzie Beck (eds.) A figure of speech: festschrift for Jonathan Laver. Mahwah NJ: Lawrence Erlbaum. 227-261. Holz, J. (in progress), Dialect levelling and koinéization in early 20th century Britain: an analysis of the WW I Lautkommission Recordings. PhD dissertation, University of Freiburg i. Br. Hopper, P. (2001), ‘Grammatical constructions and their discourse origins: prototype or family resemblance?’, in: M. Pütz and S. Niemeier (eds.) Applied cognitive linguistics I: theory and language acquisition. Berlin: Mouton de Gruyter. 109-129. Hopper, P. (2004), ‘The openness of grammatical constructions’, Chicago Linguistic Society 40: 239-256. Huddleston, R. and G. K. Pullum (2002) The Cambridge grammar of the English language. Cambridge: CUP. Jespersen, O. (1909-49), A modern English grammar on historical principles. 7 vols. Copenhagen: Munksgaard; London: Allen & Unwin. Lange, B. (2010), ‘Archiv und Zukunft: Zwei historische Tonsammlungen für das Humboldt-Forum’, Trajekte 10, April 2010: 4-6. Mahrenholz, J.-K. (2003), ‘Zum Lautarchiv und seiner wissenschaftlichen Erschließung durch die Datenbank IMAGO’, in: M. Bröcker (ed.) Berichte aus dem ICTM-National-Komitee Deutschland XII. Bamberg: Universitätsbibliothek. 131-152. Electronic version at: (accessed 16 June 2010). Mair, C. (2012), ‘Using “small” corpora of written and spoken English to document ongoing grammatical change: the case of specificational clefts in 20th century English’, in: M. Krug and J. Schlüter (eds.) Research

28

Christian Mair

methods in language variation and change. Cambridge: Cambridge University Press. Mair, C., and C. Winkle (2012), ‘Change from to-infinitive to bare infinitive in specificational cleft sentences: data from World Englishes’, in: M. Hundt and U. Gut (eds.) Mapping unity in diversity worldwide. Amsterdam: Benjamins, 2012. 243-262. Michel, J.-B., et al. (2010), ‘Quantitative analysis of culture using millions of digitized books’, Science 1199644, Published online 16 December 2010. Miller, J., and R. Weinert (1998), Spontaneous spoken language: syntax and discourse. Oxford: Clarendon. Mollin, S. (2007), ‘The Hansard hazard. Gauging the accuracy of British parliamentary transcripts’, Corpora 2: 187-210. Moore, C. (2011), Quoting speech in Early English. Cambridge: CUP. Price, J. (2008), ‘New news old news: a sociophonetic study of spoken Australian English in news broadcast speech’, Arbeiten aus Anglistik und Amerikanistik 33: 285-310. Quirk, R., et al. (1985), A comprehensive grammar of the English language. London: Longman. Rohdenburg, G. (1998), ‘Clarifying structural relationships in cases of increased complexity in English’, In R. Schulze (ed.) Making meaningful choices in English. Tübingen: Narr. 189-205. Rohdenburg, G. (2000), ‘The complexity principle as a factor determining grammatical variation and change in English’, in: I. Plag and K. P. Schneider (eds.) Language use, language acquisition and language history: (mostly) empirical studies in honour of Rüdiger Zimmermann. Trier: WVT. 25-44. Rohdenburg, G. (2006), ‘Processing complexity and competing sentential variants in present-day English’, in: W. Kürschner and R. Rapp (eds.) Linguistik international: Festschrift für Heinrich Weber. Lengerich: Pabst. 51-67. Short, M., E. Semino and M. Wynne (2002), ‘Revisiting the notion of faithfulness in discourse presentation using a corpus approach’, Language and Literature 11: 325-355. Tognini Bonelli, E. (1992), ‘“All I’m saying is …”: the correlation of form and function in pseudo-cleft sentences’, Literary and Linguistic Computing 2: 30-41. Traugott, E. (2008), ‘“All that he endeavoured to prove was ...”: On the emergence of grammatical constructions in dialogic contexts’, in: R. Cooper and R. Kempson (eds.) Language in flux: dialogue coordination, language variation, change and evolution. London: Kings College Publications. 143-177.


29

Trudgill, P., G. Lewis and M. Maclagan (2003), ‘Linguistic archaeology: the Scottish input to New Zealand English’, Journal of English Linguistics 31: 103-124. Trudgill, P., E. Gordon, G. Lewis and M. Maclagan (2000), ‘Determinism in new dialect formation’, Journal of Linguistics 26: 299-318. Ziegler, S. (2000), ‘Die akustischen Sammlungen: Historische Tondokumente im Phonogramm-Archiv und im Lautarchiv’, in: H. Bredekamp, J. Brüning and C. Weber (eds.) Theater der Natur und Kunst. Berlin: Henschel. 197207.