Automatically Extracting Typical Syntactic Differences from Corpora

3 downloads 11743 Views 367KB Size Report
Jun 8, 2009 - use the software we produced for doing this re- search, resp. ..... velop some tools to help the analysis, e.g. a pro- gram to find ...... puters have become abundant and cheap. ... our desk, then it is not a good idea to snap off.
Automatically Extracting Typical Syntactic Differences from Corpora

Wybo Wiersma∗ University of Groningen [email protected]

Abstract We develop an aggregate measure of syntactic difference for automatically finding typical syntactic differences between collections of text. With the use of this measure it is possible to mine for statistically significant syntactic differences between for example, the English of learners and natives, or between related dialects, and to find not only absence or presence, but also under- and overuse of specific constructs. We have already applied our measure to the English of Finnish immigrants in Australia to look for traces of Finnish grammar in their English. The outcomes of this detection process were analysed and found to be very useful, about which a report is included in this paper. Besides explaining our method, we also go into the theory behind it, including the less wellknown branch of statistics called permutation statistics, and the custom normalisations required for applying these tests to syntactical data. We also explain how to use the software we developed to apply this method to new corpora, and give some suggestions for further research.

1

Introduction

Languages are always changing and never homogenous or completely isolated. For example, language contact is a common phenomenon, ∗

Written by Wybo Wiersma as his BA-thesis, but meant to be publised with John Nerbonne (University of Groningen, [email protected]) and Timo Lauttamus (University of Oulu, [email protected]) as co-authors. They contributed substantial parts to the research-design, the analysis and to the paper.

and one which may even be growing due to the increased mobility of recent years. There are also differences in language usage between various professions, classes and sub-cultures in society, whether or not under the influence of education and old and modern media, such as newspapers, television and the internet. And lastly, there are of course differences between regional dialects, which might also be changing, merging, or branching under the influence of the abovementioned, and many other factors, including their own complex internal dynamics. But as these rich fields for investigation are being explored, we nonetheless still seem to have no way of assaying the aggregate differences in language usage between various groups. For example, most of the cross-linguistic research into Second Language Acquisition (SLA) has so far focused on examining typical second-language learners’ errors, such as absence of the copula, absence of prepositions, different (deviant) uses of articles, loss of inflectional endings, and a deviant word order, and on making inferences about them to explain potential substrate influence (Odlin, 1989; Odlin, 1990; Odlin, 2006a; Odlin, 2006b). As Weinreich famously noted: “No easy way of measuring or characterizing the total impact of one language on another in the speech of bilinguals has been, or probably can be devised. The only possible procedure is to describe the various forms of interference and to tabulate their frequency.” (Weinreich, 1968, p. 63) This paper proposes, explains and tests a computational technique for measuring the aggregate

Automatically Extracting Typical Syntactic Differences from Corpora

closely, who suggest focusing on tag sequences in learner corpora, just as we do. We shall add to their suggestion a means of measuring the aggregate difference between two varieties, and show how we can test whether that difference is statistically significant. Nathan Sanders (2007) has, in the meantime, extended our method to use parse tree leaf-path ancestors of Sampson (2000) instead of n-grams. These are the tags along the routes from the root tag of the parse-tree up to and not including each word of the sentence. He used it on the British part of the ICE (Internatonal Corpus of English) corpus, which is already fully parsed. In this paper, however, we only report on the method using n-grams, but we do see his extensions as promising for the technique. Baayen, van Halteren & Tweedie (1996) work with full parses on an authorship recognition task, while Hirst & Feiguina (2007) apply partial parsing in a similar study, obtaining results that allow them to distinguish a notoriously difficult author pair, the Bronte sisters. Also they establish that their technique can work for even short texts (500 words and fewer), and this could be an enormous advantage in considering practical applications of methods such as the one presented here or even a combination of our methods.

degree of syntactic difference between two varieties of language. With it we attempt to measure the “total impact” in Weinreich’s sense, and we have come a long way, albeit with respect to a single linguistic level, syntax. It may make the data of many linguistic, sociolinguistic and dialectological studies amenable to the more powerful statistical analysis currently reserved for numerical data. Naturally we wanted more than a measure which simply assigns a numerical value to the difference between two syntactic varieties, we also wanted it, just as any useful statistical measure, to be able to show us whether, and if so, how significant the difference is. In addition to this, we also wanted to be able to examine the sources of the difference, both in order to win confidence in the measure, but also to answer linguistic questions, such as those about the relative stability/volatility of syntactic structures. Thus the technique presented not only offers a significance value for the aggregate difference, but also allows one to pinpoint the syntactic differences responsible for it. In the second and third sections we introduce, explain and discuss our method. In the fourth we go into some practical results it has produced as applied to the English of Finnish immigrants in Australia. The fifth will go deeper into the statistical theory behind it, and contains a full, mathematical description of the method. Then the sixth and seventh sections contain an explanation of how to use the software we produced for doing this research, resp. some suggestions for possible future research. We will then conclude and give some acknowledgements. 1.1

June 8, 2009

2

Method

The fundamental idea of the proposed method is to tag the corpus to be investigated syntactically, to create frequency vectors of n-grams (trigrams, with n = three, for example) of POS (part-ofspeech) tags, and then to compare and analyse these using a permutation test. This then results in both a general measure of difference and a list with the POS-n-grams that are most responsible for the difference.

Related Work

Thomason and Kaufmann (1988) and van Coetsem (1988) noted, nearly simultaneously, that the most radical (structural) effects in language contact situations are to be found in the language of switchers, i.e., in the language used as a second or later language. In line with this we looked at the English of immigrants in our example research. Poplack and Sankoff (1984) introduced techniques for studying lexical borrowing and its phonological effects, and Poplack, Sankoff and Miller (1988) went on to exploit these advances in order to investigate the social conditions in which contact effects flourish best. We follow Aarts and Granger (1998) most

In five steps the method is: • POS-tag two collections of comparable material • Take n-grams (1- to 5-grams) of POS-tags from it • Compare their relative frequencies using a permutation test • Sort the significant POS-n-grams by extent of difference 2

Automatically Extracting Typical Syntactic Differences from Corpora

• Analyse the results

2.2

Taking n-grams of POS-tags

Then, in order to be able to look at syntax instead of just single POS-tags, the tags are collected into n-grams, i.e. sequences of POS tags as they occur in corpora. For a sentence such as “The cat sat on the mat”, the tagger assigns the following POS labels:

Now we will describe these steps in more detail, starting with the tagging of two comparable collections of text. 2.1

June 8, 2009

Tagging two collections of comparable material

We start with two collections of comparable material. For this you can think of two sets of interviews with people from different dialect areas, or essays from two different grades and other similar pairs of samples. In our example research into the English of Finnish immigrants in Australia we used interviews divided in two generations of arrival in Australia. One group was aged 16 or younger when they disembarked, and the other was 17 or older. The interviews come from the F INNISH AUSTRALIAN E NGLISH C ORPUS by Greg Watson (1996), and are reduced to the parts that were free conversation, leaving a remaining total of 305,000 words. Then the material is POS-tagged, or in other words, syntactic categories are assigned to all words in all the sentences. For practical reasons this is best done automatically, with a POS-tagger. There are many taggers, both statistical and rulebased taggers (such as CLAWS, the TreeTagger, Tsujii’s Tagger, and ALPINO for Dutch,1 but we use Thorsten Brants’ Trigrams ’n Tags (TnT) tagger, a hidden Markov model tagger which has performed at state-of-the-art levels in organized comparisons, achieving a precision of 96.7% correct tag assignments on the material of the Penn Treebank corpus (Brants, 2000). And as our chosen tagger is a statistical tagger, we also need a tag-set and a corpus to train it on. For this we choose the British part of the ICE corpus (Nelson et al., 2002). This corpus is fully tagged and checked by hand, so it forms a sound basis. It uses the TOSCA-ICE tagset, which is a linguistically sensitive set, designed by linguists (not computer scientists) and consisting of 270 POS tags (Garside et al., 1997).

The ART(def)

cat N(com,sing)

sat V(intr, past)

the ART(def)

mat N(com,sing)

. PUNC(per)

on PREP(ge)

And for a sentence such as “We’ll have a roast leg of lamb tomorrow (...)” (extracted from our data), the tagger assigns the following POS labels: We PRON(pers,plu)

’ll AUX(modal,pres,encl)

have V(montr,infin)

a ART(indef)

roast N(com,sing)

leg N(com,sing)

of PREP(ge)

lamb N(com,sing)

tomorrow ADV(ge)

These are then collected into n-grams. Trigrams are as follows: (first sentence) ART(def)-N(com, sing)-V(intr, past), ..., ART(def)-N(com, sing)PUNC(per), and (second sentence) PRON(pers, plu)-AUX(modal, pres, encl)-V(montr, infin),..., ART(indef)- N(com, sing)- N(com, sing),..., PREP(ge)- N(com, sing)-ADV(ge) ... In our research into the English of Finnish immigrants to Australia we used POS trigrams, and thus collected about 47,000 different kinds of trigrams, 47,000 POS-trigram-types in other words, from our corpus. For optimization reasons, and for a reason that we will come back to in section 5.3.2, we removed all trigrams that occurred 5 times or less, leaving us with a remaining total of 8,300 POS-trigram-types. 2.3

Compare frequencies using a Permutation Test

The next step consists of counting how frequently each of the POS-n-grams (the 8,300 trigrams in our case) occurs in both of the data-sets, or subcorpora. These counts result in two vectors, or in more familiar phrasing, in two table rows in a 2 × 8,300 element table, with for each n-gram a column of two cells, containing frequency counts, one for each group. These form the input for our

1

List of taggers: http://www-nlp.stanford.edu/links/ statnlp.html#Taggers, CLAWS & TreeTagger are on this list, Tsujiis Tagger: http://wwwtsujii.is.s.u-tokyo.ac.jp/∼tsuruoka/postagger/, ALPINO: http://www.let.rug.nl/ vannoord/alp/Alpino/ (only for Dutch)

3

Automatically Extracting Typical Syntactic Differences from Corpora

sponsible for the difference. This in order to be able to mine the corpus for the linguistic sources of the difference, to understand these better, and also to be able to test the method. We get a list of responsible POS-n-grams by looking at the vectors from the base-case again, and applying a permutation-test to all the individual POS-n-grams. In other words; we do 8,300 permutation-tests, one for each n-gram. So for each individual n-gram the RSquare value of the base-case is kept, and then in each permutation the RSquare value for the same n-gram is compared to it for being as large or larger, counting these, and using them to arrive at p-values, as in the original permutation test. For practical reasons we did these tests all at the same time, by keeping track of and comparing all the n-gram RSquare values, besides the aggregate RSquare value. It might look like there is issue with so called familywise errors, because of the 8,300 tests, of which 5% (about 415) would get a p-value below 0.05 by pure chance. We cover this by requiring the aggregate difference to be significant at a p level of 0.05 first (as an omnibus test, for the possible truth of the overall null-hypothesis; there being no overall difference). This guards us from complete null hypothesis family wise errors (Westfall and Young, 1993). More on this, and possible improvements, in section 7.3. Once we have this list of significant POS-ngrams we sift them into the groups for which they are typical. We do this for each n-gram by comparing the value found in the base-case for that ngram with the expected value for it based on the group-size. If the groups are of the same size, the expected values are the same for both groups (half the corpus-wide count for the n-gram) and it is simply a case of looking at which of the basecase values is larger, but if group-sizes differ, we need to use the expected values based on groupsize (see the third normalisation, section 5.3.3). So we have two lists now, one for each group. Then for each group we sort the n-grams in its list by how characteristic the n-gram type is for that group. The extent of typicality can be determined both relatively, normalised for frequency, and absolutely, but more on these options in sections 5.3.2 and 5.3.1. What is important here is that they are sorted in this step. As noted we only select and sort the significant POS-n-grams. The requirement of significance fil-

statistical-test. And even though there is no convenient test in classical statistics that we can apply to check whether the differences between vectors containing 8,300 elements are statistically significant, we fortunately may turn to permutation tests with a Monte Carlo technique in this situation (Good, 1995). We will explain it in detail (with all formulas) in section 5, but the fundamental idea in a permutation test is very simple: We first measure the difference between two sets of data in some convenient fashion, obtaining a degree of difference, lets call this the basecase. We then extract two sets at random from the total of all data pooled together, which become our new sets, and we calculate the difference between these two in the same fashion. We then look if the difference is the same or bigger than in the base case, and if it is, we take note of this. We repeat this process with the random sets ten thousand times, taking different random sets every time, and in the end we sum the total number of times the difference was at least as extreme as in the base case. This value is then divided by the ten thousand times we tried, and the outcome of that is — in standard hypothesis testing terms — our p-value. This represents how many times in ten thousand (standardized to one) a difference as large as found in the base-case would have occurred by pure chance. But before a statistical difference can be determined, a measure for the difference has to be chosen, and we have to decide what elements to permutate. For various statistical reasons explained in section 5.1 we permutated speakers, using our so called subject normalisation, and we normalised for frequency using our between-types normalisation (see section 5.3 for these). As our measure we developed RSquare, which takes the square of the difference between each normalised n-gram count for the two groups, more on this in section 5.4. For now it is important that some normalisations and a suitable measure are needed to get meaningful results from a permutation-test, that we permutated authors instead of sentences or n-grams, and that the test provides us with a p-value for the overall difference between the two sets of data. 2.4

June 8, 2009

Sort POS-n-grams by extent of difference

In addition to testing the aggregate difference between the two sub-corpora, we also wanted to be able to extract the POS-n-grams that were most re4

Automatically Extracting Typical Syntactic Differences from Corpora

June 8, 2009

ters out any n-grams which occur only or mostly in one group, but for which this is probably due to chance. For example if we selected n-grams for which more than, say, 80% of occurrences fall in one group, then we would also select n-grams occurring only once, purely by chance, in one of the groups (these would be 100% characteristic, but never significant). At the end of this step we thus have two lists of typical, significant POS-n-grams which can be analysed for linguistic causes of the aggregate difference between the groups. Moreover because we sort them to bring the most characteristic ngrams to the top, we are certain to see the most relevant data first, which is especially useful when analysing the data only up to a specific cut-off point, such as for example the top-200 of each list.

tact by themselves. A POS-n-grams being significantly characteristic for one group or another is only a fact, and a fact that can be accounted for by multiple theories. Thus there can be more than one explanation for the over- or under-use of POS-ngrams, and different samples from the corpus can suggest different explanations. So caution is warranted here. To conclusively test hypotheses using this method one has to formulate them in advance, and only then test them using the method, preferably using multiple, representative data-sets (also see section 7.3). But besides all these cautions, the method is quite powerful, and has been shown to be useful for at least data mining in our research from the analysis we made of the English of Finnish immigrants in Australia. More on this in section 4.

2.5

3

Analyse the results

The last step consists of analysing the data. This comes down to putting the top-x n-grams in context and comparing them with various linguistic hypotheses about the aggregate difference. This obviously requires intimate knowledge of the languages and/or dialects under comparison. It is where the computation ends, and the linguistic skills begin. Still to make things easier we did develop some tools to help the analysis, e.g. a program to find examples from the corpus for given POS-n-grams. See section 6 for more on this, and the other tools for applying the method. Besides tools and linguistic skills there are two important facts to keep in mind while analysing the results. The first is that the method will find differences only between the two sets provided for comparison. That is if a certain POS-n-gram is over- or under-used in both groups, relative to the general population, then the method will not find it as being characteristic for both groups. So when dealing with two generations of second language learners, as we did, one has to be careful not to make claims that involve a comparison to the broader population of native speakers. Thus one either has to make the right comparison, or has to limit one’s claims. The second thing to note is that when a certain (set of) POS-n-grams is significantly characteristic for a group, this does not mean that one’s explanation for it being so, is significant too. That is we do not claim that POS-n-grams can play an explanatory role in the account of language con-

Rationale

Now we will discuss the reasons for, and assumptions behind our method, and after that some of its features and options. 3.1

Why this Method

An important feature of our method is that it can identify not only deviant syntactic uses (errors), but also the overuse and underuse of linguistic structures, whose importance is emphasized by researchers on second-language acquisition (Coseriu, 1970; Ellis, 1994; Bot et al., 2005). According to these experts it is misleading to consider only errors, as second language learners likewise tend to overuse certain possibilities and tend to avoid (and therefore underuse) others. For example, de Bot et al. (2005) suggest that non-transparent and difficult constructions are systematically avoided even by very good secondlanguage learners. In a similar vein, Thomason (2001, 148) argues that learners often ignore or fail to learn certain target language distinctions that are opaque to them at early to middle stages of their learning process. And we expected to find this kind of SLA behaviour in our migrants’ data. This brings us to a second feature of the method: it can be applied to rather rough data, such as transcripts of speech, the writing of second-language learners, or texts from a different dialect, or even the intersection of these, as in our case: transcripts of the speech of Australian immigrants, using a tagger trained on British data (a different dialect). And as noted in the introduction it can be applied 5

Automatically Extracting Typical Syntactic Differences from Corpora

shall need automatically annotated data, we encounter a second problem with parse-trees, viz., that our parsers, the automatic data annotators capable of full annotation, are not yet robust enough for this task, especially for rougher data, such as that of spoken language or non-natives (even the best score only about 90% per constituent on edited newspaper prose).

to any language, using any tag-set for which there is a tagger or a tagged corpus available (TnT can be trained on any tagged corpus). We note that natural language processing work on tagging has compared different tag sets, noting primarily the obvious, that larger sets result in lower accuracy (Manning and Sch¨utze, 1999, 372 ff.). Still, since we aimed to contribute to the study of language contact and second-language learning, we chose a linguistically sensitive set, that is, a large set designed by linguists, not computer scientists, but other choices are possible, and may lead to lower error-rates or more linguistic precision. A third benefit of this method is that it delivers numerical data, and also provides significance levels with them. This allows one to go beyond the anecdotal and give a more rigorous grounding to dialectological or linguistic hypotheses. It can be important not only in the study of language contact, but also in the study of second-language acquisition. And it is also more generally linguistically significant, since contact effects are for example prominent in linguistic structure and wellrecognized confounders in the task of historical reconstruction. Numerical measures of syntactic difference may enable these fields to look afresh at many issues. It is important however to know that our method hinges on two assumptions, even if we think that they are reasonable and sound. We will introduce and discuss them now. 3.2

June 8, 2009

A second objection to the use of POS-n-grams could be that syntax concerns more than POSn-grams. In response we wish to deny that this is a genuine problem for the development of a measure of difference. We note that our situation in measuring syntactic differences is similar to other situations in which effective measures have been established. For example, even though researchers in first language acquisition are very aware that syntactic development is reflected in the number of categories, and rules and/or constructions used, the degree to which principles of agreement and government are respected, the fidelity to adult word order patterns, etc., they are still in large agreement that the very simple mean length of utterance (MLU) is an excellent measure of syntactic maturity (Ritchie and Bhatia, 1998). Similarly, in spite of their obvious simplicity, life expectancy, infant mortality rates and even average height are considered reliable indications of health when large populations are compared. We therefore continue, postulating that the measure we propose will correlate with syntactic differences as a whole, even if it does not measure them directly.

POS-n-grams show Syntax

In fact we can be rather optimistic about using POS-n-grams given the consensus in syntactic theory that a great deal of hierarchical structure is predictable given the knowledge of lexical categories, in particular given the lexical head. Sells (1982, §§ 2.2, 5.3, 4.1) demonstrates that this was common to theories in the 1980s (Government and Binding theory, Generalized Phrase Structure Grammar, and Lexical Function Grammar), and the situation has changed little in the successor theories (Minimalism and Head-Driven Phrase Structure Grammar). Even though the consensus of twenty years ago has been relaxed in recognition of the autonomy of constructions (Fillmore and Kay, 1999), it is still the case that syntactic heads have a privileged status in determining a projection of syntactic structure. So it is likely that even individual POS-n-grams typical for a group of language users can give the skilled linguist at

The pivotal assumption behind our method is that POS-n-grams offer a good aggregate representation of syntax. And a first objection to it could be that POS-n-grams do not reflect syntax completely and that we thus should focus on full parse-trees instead. However a practical problem with using parse-trees is first of all that there is no syntactic equivalent of an international phonetic alphabet (Association, 1949), i.e. an accepted means of noting (an interesting level of) syntactic structure for which there is reasonable scientific consensus. Moreover, the ideal system would necessarily reflect the hierarchical structure of dependency found in all contemporary theories of syntax, whether indirectly based on dependencies or directly reflected in constituent structure. Since it is unlikely that researchers will take the time to hand-annotate large amounts of data, meaning we 6

Automatically Extracting Typical Syntactic Differences from Corpora

least a good indication of what the differences in the underlying syntax of their speech or writing might be. 3.3

that our method is impractical due to about half of the 3-grams of the full tagset being erroronous. But fortunately we can fall back on a property of the statistics of large numbers here, namely that as the errors produced by automatic taggers are more or less random, they will cancel each other out, effectively annulling the influence of tagging errors. This cancelling of errors happens because in most cases we can expect the tagger to make similar mistakes in both groups, so they favor neither of the groups. A tentative analysis confirmed this as we found that the tagging errors in both groups seem to be of a similar kind, and also the errorrates in both groups are very similar, as can be seen in table 2 (also see the results in section 4.6 for a further confirmation).

Part of Speech Taggers are Accurate Enough

The method’s second assumption is that POStaggers are accurate enough for the POS-n-grams to reflect syntax at an aggregate level. First of all the facts: As noted we used Thorsten Brants’ Trigrams ’n Tags tagger. We used it with the tagset of the TOSCA-ICE consisting of 270 tags (Garside et al., 1997), of which 75 were never instantiated in our material. We trained the tagger on the British ICE corpus, which totals 1.000.000 words. Since our material was spoken English (see section 4), we also did some experiments with training TnT on only the spoken halve of the ICE corpus, but performance was better when using the whole corpus. Even then, using the British ICE corpus was suboptimal, as the material we wished to analyze was the English of Finnish emigrants to Australia, but we were unable to acquire sufficient tagged Australian material. We tested the TnT tagger with a sample of 1.000 words from our material which was tagged by hand. We found that the tagger was correct for 81.2% of POS-tags. The accuracy is poor compared to newspaper texts, but we are dealing with conversation, including the conversation of nonnatives here, so it is pretty good. Still, n-grams consist of multiple POS-tags, so performance is worse for larger n-grams; namely 56.1% for 3grams, and even 39.0% for 5-grams. In addition we also performed our small test using the reduced ICE tagset. It consisted of only 20 tags. For this performance was a bit better, namely: 86.7% for 1-grams, 66.6% for 3-grams, and 51.8% for 5grams. See table 1 for the full set of results.

Table 2: Performance of TnT for the groups. Results of both groups, for the full and reduced tagsets.

N-grams

full tagset

reduced tagset

81.2% 67.5% 56.1% 46.7% 39.0%

86.7% 76.2% 66.6% 58.8% 51.8%

N-grams

youth

adults

1-grams 2-grams 3-grams 4-grams 5-grams

81.7% 67.3% 56.0% 47.6% 40.4%

80.9% 67.7% 56.2% 46.2% 38.2%

Thus the results of the method in the form of a p-value and the POS-n-grams responsible for the difference, will be due mostly to the correctly tagged POS-n-grams, and thus it is capable of delivering meaningful results. Especially when the individually detected POS-n-grams are analysed and looked up in the corpus, as in that case any parsing-errors that would show up in the results can be corrected by hand or left out of the analysis.

4

Tangible Results

In this section we will describe our research into the English of Finnish immigrants in Australia. We will especially examine the trigrams responsible for the difference. We have performed experiments with bigrams and with combinations of n-grams for larger n, but we restrict the discussion here to trigrams in order to simplify presentation. We mainly test their usefulness for linguistic analysis here, in order to evaluate the method.

Table 1: Performance of TnT on our data. Results per n-gram size for both the full and the reduced tagsets. 1-grams 2-grams 3-grams 4-grams 5-grams

June 8, 2009

4.1 So the performance of TnT is not that good for 3-grams and longer n-grams, making it arguable

Our Research

As noted we applied the procedure described in section 2 to the Finnish Australian English Corpus 7

Automatically Extracting Typical Syntactic Differences from Corpora

corpus, a corpus of interviews compiled in 1994 by Greg Watson of the University of Joensuu, Finland (Watson, 1995; Watson, 1996). The informants were all Finnish emigrants to Australia and they are classified into two groups in this report: (1) the ADULTS (group ‘a’, adult immigrants), who were 18 or older upon arrival in Australia; (2) the JUVENILES (group ‘j’, juvenile immigrant children of these adults), who were born in Finland and were all under the age of 18 when they disembarked.

4.2

June 8, 2009

The Corpus and the Interviewees

Greg Watson interviewed the subjects between 1995 and 1998. They were farmers and skilled or semi-skilled working class Finns who had left Finland in the 1960’s at the age of 25-40 years old, some with children. The average age of the juveniles was 6 at the time of arrival, and 36 at the time of the interview, as opposed to the adults, who were, on average, 30 at the time of arrival and 59 at the time of the interview. The genders are almost equally represented in the two groups. There are sixty-one conversations with adults and twenty-seven with juveniles, each lasting approximately 65 to 70 minutes. The interviews were transcribed in regular orthography by trained language students and later checked by the compiler of the corpus. We used only those sections of the interviews which consist of relatively free conversation, i.e. a total of 305,000 word tokens. Of these 84,000 were by the juveniles and 221,000 by the adults group. Neither group of speakers was formally tested for their proficiency of English. By observing the data, however, we can confidently say that the average level of the adults’ English is considerably lower than that of the juveniles, who had received their school education in Australia, as opposed to the adults, who were educated in Finland. Also the sentences of the juveniles were substantially longer (a MLU of 27.1 tokens) than those of the adults (16.3 tokens), confirming a difference in their language skills. The rest of this description is more hypothetical and based on a similar account of a number of immigrant groups of European origin in the United States (Lauttamus and Hirvonen, 1995). It is however supported by the biographical interview data elicited from our informants in Australia. The adult immigrants (adults) typically went on speaking Finnish at home and carried on most of their social life in that language. They struggled to learn English, with varying success. Even the best learners usually retained a distinct foreign accent and some other foreign features in their English. But we can say that they were marginally bilingual, as most of them could communicate successfully in English in some situations at least, although their immigrant language was clearly dominant. The immigrant parents also spoke their native language to their children (the juveniles), so this

The goal of this research was to use our method to detect the linguistic sources of the syntactic variation between the two groups, and we examined the degree of what we call syntactic ‘contamination’ in the English of the adult speakers. We wanted to detect the linguistic sources of the variation between the two groups and interpret the findings from (at least) two perspectives, universal vs. contact influence. In our reading, the notion of ‘universal’ is concerned, not with hypotheses about Chomskyan universals and their applicability to SLA or with hypotheses about potential processing constraints in any detail, but rather with more general properties of the language faculty and natural tendencies in the grammar, called ‘vernacular primitives’ by Chambers (Chambers, 2003, 265-266). To explain differential usage by the two groups, we also draw upon the strategies, processes and developmental patterns that second-language learners usually evince in their inter-language regardless of their mother tongue (Faerch and Kasper, 1983; LarsenFreeman and Long, 1991; Ellis, 1994; Thomason, 2001). In addition to looking for ‘universal’ and contact-influences we also distinguish between adult immigrants and immigrant children based on Lenneberg’s (1967) critical age hypothesis, which suggests a possible biological explanation for successful second language (L2) acquisition between age two and puberty, and in any case predicts a difference in the expected L2 proficiency between the juveniles and the adults. Thus we also tested our idea about measuring syntactic differences by looking for such a difference. The issue is not remarkable, but it allowed us to verify whether the measure is functioning. 8

Automatically Extracting Typical Syntactic Differences from Corpora

‘learner language’ or interlanguage (Pietil¨a, 1989; Hirvonen, 1988). Thus it is expected that the English spoken by first-generation Finnish Australians is affected only to a lesser extent in its morphosyntax. But as this research is only about morphosyntax, we will not go further into vocabulary and phonology here. Also we expect that the adults show more contact-induced effects in their speech than the juveniles, since the former only temporarily shift to English, as in the case of an interview, whereas the latter have already shifted to English as their language of everyday communication.

generation usually learned the ethnic tongue as their first language. The oldest child might not have learned any English until he or she went to school. The younger children often started learning English earlier, from their older siblings and friends. At any rate, during their teen years the second-generation children became more or less fluent bilinguals. Their bilingualism was usually English-dominant: they preferred to speak English to each other, and it was sometimes difficult to detect any foreign features at all in their English. As they grew older and moved out of the Finnish communities, their immigrant language started to deteriorate from lack of regular reinforcement. Some give up speaking their ethnic language altogether. In the second generation it was still common to marry within the ethnic group. But even if the spouses shared an ethnic heritage, they were usually not comfortable enough in the ethnic language to use it for everyday communication with each other. Therefore they did not speak it to their children, either. For an ethnic language to stay alive as a viable means of communication for a longer time than two generations would require a continuous influx of new immigrants but, as (Watson, 1995, 229) points out, since the late 1970’s onwards immigration to Australia has been quite restricted, to all nationalities. 4.3

June 8, 2009

In line with this we are also led to assume here that the children of Finnish Australian immigrants represent a case of shift without interference similar to that of ‘urban immigrant groups of European origin in the United States’ (Thomason and Kaufmann, 1988, 120). The immigrants maintain their own ethnic languages for the first generation, while their children (and grandchildren) shift to the English of the community as a whole with hardly any interference from the original languages. Thus even members of the first generation of immigrants (adults) demonstrate a variety of achievements, including native-like ability, and members of the second generation (juveniles) speak natively (Piller, 2002), and language attrition does not wait till the third generation but begins with the first generation (Waas, 1996; Schmid, 2002; Schmid, 2004; Cook, 2003; Jarvis, 2003).

The Language Contact Situation

From a typological point of view, the following generalisations can be made. The basic pattern is one of maintenance of the ethnic language by the Finnish immigrant generation (adults) and the subsequent shift from Finnish to English by the second generation (juveniles). The first generation is characterised as ‘marginally bilingual’ and in contrast with the fluent bilinguals of the second generation, they can also be regarded as non-fluent bilinguals, or as L2 (second language, English) learners with some degree of L1 (Finnish) interference. Here Finnish is linguistically dominant over English, whereas English is socially dominant over Finnish. It is expected that this interference from their Finnish begins with sounds (phonology) and morphosyntax, not with vocabulary. We expect this because in the English spoken by first-generation Finnish Americans phonological and morphosyntactic patterns also often deviate from standard (American) English in the manner typical of

In addition Clyne & Kipp (2006) also note that the language groups ‘with a very low shift to English’ are those that have recently arrived from Southeast and East Asia and Africa, groups with different ‘core values such as religion, historical consciousness, and family cohesion’. The highshift groups tend to be ones ‘for whom there is not a big cultural distance from Anglo-Australians’. Although there is no mention of Finnish Australians in Clyne & Kipp, we argue that they belong to the high-shift group. Consequently, we expect to find most of the evidence for syntactic contamination in the English of first-generation Finnish Australians, as the second generation may have already shifted to English without any interference from Finnish. 9

Automatically Extracting Typical Syntactic Differences from Corpora

4.4

June 8, 2009

3. Omission of primary (copula) ‘be’ and of the primary verb ‘be’ in the progressive (present and past) is frequent in the adults as opposed to the juveniles. Learners who rely on auditory input such as the less educated adults, leave out unstressed elements because they do not perceive them and, consequently, cannot process them.

Significance and Trigrams

After applying the method we found that the groups clearly differed in the distribution of POS trigrams they contain (p < 0.0001). This means that the difference between the original two subcorpora was in the largest 0.01% of the Monte Carlo permutations, and thus highly significant. Also we find genuinely deviant syntax patterns if we inspect the individual trigrams responsible for the difference between the two sub-corpora. As noted in section 2.5 only differences between the groups can be found using our method, and thus we have no evidence of potential contamination of the juveniles’ L2 acquisition relative to native speakers. Therefore we cannot yet prove or refute our hypothesis about the strength of contact influence as opposed to that of the other factors. Nevertheless we can still deduce much from the POS trigrams that contributed to the statistically significant differences that we did find in the data. It should be remembered that we only used the POS-trigrams that were found to be statistically significant and typical (individual α levels of 0.05). We limited our analysis to a practically random selection of 137 trigrams from the 308 trigrams found to be significant and typical for the adult-speakers.2 They were sorted by their relative typicality (see section 5.3.2 for the normalisation) and analysed using 20 random sentences from the corpus which contained them. The findings can be described and listed as follows:

4. Underuse of the existential (expletive) ‘there’ and the anaphoric ‘it’ in subject position is frequently attested in the English of the adults. For example in: ‘often wine if is some visitors’, where the speaker is aiming at ‘...if there are some visitors’. This can be explained in terms of substratum influence from Finnish, which would assign the subject argument of the copula verb ‘be’ to the NP ‘some visitors’, and consequently, would not mark the subject in the position before the copula. 5. The adults produce utterances where the negator ‘not’ is placed in pre-verbal position. This can be ascribed to Finnish substratum transfer (shift-induced interference), because Finnish always has the negative item (ei, inflected in person and number such as any verb in standard Finnish) in pre-verbal position. Nevertheless examples can be found in other non-standard varieties of English, e.g. in Spanish interlanguage English (Odlin, 1989, 1040). 6. The adults misuse the pronoun ‘what’ as a relative pronoun or complementiser. This can be ascribed to substratum transfer from Finnish where ‘what’ in plural is used as interrogative pronouns or relative pronouns. We argue that this usage of ‘what’ as a relative is overgeneralised on the model of Finnish, even though we know that it is found in some substandard English.

1. Overuse of hesitation phenomena (pauses, filled pauses, repeats, false starts etc.) and parataxis (particularly with ‘and’ and ‘but’, not only at the phrasal level but also at the clausal level) by the adults as opposed to the juveniles. These arise from difficulties in speech processing in general and lexical access in particular. Their speech is less fluent. 2. The adults demonstrate overuse (and underuse) of the indefinite and definite articles. The first generation speakers supply redundant definite articles, such as ‘in the Finland’. It appears that hypercorrection (or overgeneralization), common in learner language, may be a contributing factor, as Finnish has no article system.

7. The adults extend the simple present (as opposed to the past tense and the progressive) to describe not only present but also past or future events. This can be partly ascribed to Finnish substratum transfer, as Finnish has no equivalent progressive forms or ways to express future events in its repertoire and mainly uses the simple present in these functions. Pietil (1989, 176) also reports on this in

2

Originally it was the top-176 obtained using our earlier, slightly less robust normalisation

10

Automatically Extracting Typical Syntactic Differences from Corpora

the English of first-generation Finnish Americans.

relative or complementiser (6), and overuse of the simple present to describe past and future events (7).

8. There is also evidence that the adults have acquired some formulae such as ‘that’s’ and ‘what’s’ as in ‘that’s is a’. Acquired as fixed phrases, they have apparently been processed as single elements. Learners often produce formulae or ready-made chunks as their initial utterances. Acquired formulae cannot be ascribed to substratum transfer, as they tend to be recurrent in any interlanguage.

These findings are supported by the fact that Ellis (1994, 304 - 306) argues that avoidance (‘underrepresentation’) and overuse (‘over- indulgence’) can result from transfer. Learners avoid using non- transparent L2 constructions, underusing them, and overuse simpler alternatives. Accordingly, the overuse of parataxis coupled to hesitation phenomenon by the adults (1), for example, may partly result from their inability or unwillingness to produce subordinate constructions (hypotaxis).

The table below (3) shows a listing of the numbers of trigrams supporting each of the above claims:

On the other hand what makes our argument for the role of contact somewhat less convincing is that many, and many of these ‘deviant’ features might also be ascribed to more ‘universal’ properties of the language faculty, such as the hesitation phenomena (1), absence of the copula be (3) and acquired formula (8), while others may be explained in terms of both factors (such as 5, 6 and 7).

Table 3: Trigram counts for the eight findings. The index-number corresponds with the numbered findings list in the text. The ‘short name’ is a descriptive name that corresponds to the full table of trigrams at the end of the paper. Count is the number of occurrences in the top. The ‘[useless]’ label refers to the trigrams that were not fruitful for analysis Index 0 1 2 3 4 5 6 7 8

Short name [useless] Disfluent Articles No-Be No-There Pre-Negator Relative-What Present Formulae

Count

However, our findings support other empirical evidence, reviewed, for example, by Larsen Freeman and Long (1991, 96-113), and Ellis (1994, 299-345), that shows how the learners L1 influences the course of L2 development at all levels of language, although transfer seems to be more conspicuous in phonology, lexis and discourse than in morphosyntax. See our earlier paper (Lauttamus et al., 2007) for a more extensive qualitative analysis.

24 90 39 1 3 3 3 3 5

It should be noted that we have the most evidence for disfluency and the over- and under-use of articles, though the other findings are also well supported. In addition, because this method also detects over and under-use, the notion of avoidance does not therefore imply total absence of a feature in either group. 4.5

June 8, 2009

On the basis of the analysis so far, we contend that the statistical evidence obtained from our data indeed reflects syntactic distance between the two varieties of English and, consequently, aggregate effects of the difference in the two groups’ English proficiency. We also confirm that the adults are (still) showing features of ‘learner’ language, and therefore those of ‘temporary’ or ‘partial’ shift, whereas the juveniles seem to show those of shift without interference.

Qualitative Analysis

In the English of the adult speakers of Finnish emigrants to Australia we have detected a number of features that we describe as ‘contaminating’ the interlanguage, and can be attributed to Finnish substratum transfer. These features include overuse (and underuse) of articles (2), omission of the existential there (4), use of the negator not in pre-verbal position (5), misuse of what as a

These findings are in accordance with Lenneberg’s (1967) critical age hypothesis, and many others who point out that age (of onset) is a crucial determinant of successful L2 acquisition. 11

Automatically Extracting Typical Syntactic Differences from Corpora

4.6

June 8, 2009

carding those that have to do with hesitation phenomenon (the dotted lines, ‘> disfluency’). When discarding the first two categories (both disfluency and articles), however, the results are more alike again. This suggests two things, namely that the over-use of articles occurs in frequent, but not very typical trigrams, and that the same, but to a lesser extent, might be true for hesitations. And while not conclusively proven it might be a good idea to look at the absolute data first in later research, especially if there are many n-grams found or when time-constraints are strict. Lastly we also hand-checked the performance of TnT for the 20 examples found with each of the analysed trigrams from the top-308 list. While doing this we did take into consideration a bandwidth of three words on each side of the trigram, so the context was taken into account. The results of this are very promising, because when compared to the performance of TnT on the whole corpus, performance for the significant n-grams is up by almost 20 percent points; 76.3% as opposed to the overall 56.5%. This means that the errors as introduced by the tagger are to a great extent indeed random noise that is filtered out by our method, confirming what we argued for in section 3.3. Note also that 76.3% approaches the ceiling of TnTs 81.2% performance on 1-grams very closely. This means both that it will likely be even better on cleaner data, and that eventhough we might still be missing something due to systematic tagging errors, it is unlikely to be very much, or to cause false overall significance. Thus the evaluation of our method via our experiment on the corpus containing the English of Finnish emigrants to Australia, is promising in that the method works well in distinguishing two different groups of speakers, in highlighting relevant syntactic deviations between the two groups, and all this while providing a precision that can be called very reasonable.

Quantitative Analysis

Now we come to a more quantitative analysis, where we will examine the consequences of sorting the n-grams, and check the performance of TnT on the top-n-grams to see if errors are indeed filtered out as noise. N-grams can be sorted in two ways, namely by absolute and relative typicality. The first normalisation (between-subjects normalisation, section 5.3.1) produces absolute data in the sense that the total number of occurances of trigrams still play a role, and thus more frequent trigrams will end up at the top. Whereas the second normalisation (between-types normalisation, section 5.3.2) normalises for frequency, and thus produces relative data, moving the most typical trigrams to the top. We start with a scatterplot (figure 1) of the distributions as produced by sorting the typical and significant trigrams by the two normalisations. Please note again that only the significant trigrams are sorted, and that exactly the same trigrams are sorted (but differently) on both lists. As can be seen in the scatterplot, the more frequent trigrams are generally less typical and vice versa. Thus the two ways of normalising really do produce differently ordered lists of trigrams. To an extent this distribution is something that should be expected as it is less likely that something very frequent is confined to only one group. Here for example because constructs that are frequent in a language are heard or seen more often, and in most cases easier, and thus more likely to be overused, instead of only erroronously being made up by many learners at the same time (and becoming frequent that way). So frequent trigrams are — as expected — in many cases less typical. Our second graph (figure 2) lists the cumulative percentage of useful trigrams found at each rank position. This graph is much like recall graphs as shown for Information Retrieval systems. For example for the ‘absolute’ data, after the highest ranking 120 trigrams have been looked at, approximately 40% of the interesting ones in the total set of 308 have been seen. What it shows is slightly surprising, and it is that the absolute data, that is the data normalised with just the between-subjects normalisation (the black line) is better than the relative data (normalised for frequency, the grey line). The difference is modest for the whole collection of useful trigrams, but quite a bit bigger for the useful trigrams when dis-

5

Statistical Theory

The test we need has to be able to check whether the differences between vectors containing 8, 300 elements are statistically significant, and how significant the individual differences are. Also it has to be a method that will not find significance for tiny differences due to the large number of variables alone. Fortunately there are permutation tests, which can fulfill these requirements when 12

Automatically Extracting Typical Syntactic Differences from Corpora

June 8, 2009

Figure 1: Distribution of Top-trigrams with absolute values on the X-axis (only between-subjects normalised) and relative values on the Y-axis (normalised for frequency, between-types normalised).

13

Automatically Extracting Typical Syntactic Differences from Corpora

June 8, 2009

Figure 2: Recall of Useful Trigrams both for absolute (only between-subjects normalised) and relative data (normalised for frequency, between-types normalised).

14

Automatically Extracting Typical Syntactic Differences from Corpora

three demands because they are all related to being able to make inferences about the population, and permutation-tests leave out any assumptions about the relationship between the data under scrutiny and the population. Permutation tests just tell how likely it is that two data-sets would differ by chance. Another reason for obviating the first two of the three requirements of parametric tests is that permutation tests implicitly generate their own significance tables. That is, if one would plot the difference values obtained for all 10.000 permutations, one would get the probability distribution for the data as it is used at that time. This is also what makes them more sensitive than parametric tests such as the t-test, as with those the distribution and its matching significance table is always a rough estimation, while with permutation-tests these are exactly tailored to the data. Kemperthorne and Fisher remark about this: “conclusions [of the ANOVA] have no justification beyond the fact that they agree with those which could have been arrived at by this elementary method [(the permutation test)]” (Edington, 1987, 11). So the permutation test is the truly correct one and the corresponding parametric test is valid only to the extent that it results in the same statistical decision. The only down-side of permutation tests is that they require quite some computations, but this has become much less of a problem now that fast computers have become abundant and cheap. Permutation-tests are also easier to understand and more transparent than parametric tests, as to most non-mathematicians the three requirements and the significance tables that come with most parametric tests remain a bit opaque, if not magical, while every step of permutation-tests can be fully understood and followed by laymen.

used wisely (Good, 1995). Kessler (2001) contains an informal introduction for an application within linguistics. 5.1

June 8, 2009

Permutation Tests

As noted the fundamental idea behind permutation tests with Monte Carlo permutation is very simple, but not just that, its formulas are also relatively simple: we measure the difference between the two original sets in some convenient fashion, obtaining δ(A, B). We then apply the Monte Carlo permutations (random permutations), which means extracting two sets of the same size as the base-sets at random from A ∪ B. We call these two random sets A1 , B1 , and we calculate the difference between these two in the same fashion, δ(A1 , B1 ), recording if δ(A1 , B1 ) >= δ(A, B), i.e., if the difference between the two randomly selected subsets from the entire set of observations is as extremely or even more extremely different than the original. If we repeat this process of permutations, say, 10, 000 times (including the one for the original sets), then counting the number of times we obtain differences at least as extreme as in the basecase, allows us to calculate how strongly the original two sets differ from a chance division with respect to δ. In that case we may conclude that if the two sets were not genuinely different, the original division into A and B was likely to the degree of p = n/10, 000. Put into more standard hypothesis-testing terms, p is the p-value, i.e. the probability of seeing the difference between the two sets if there is no difference in the populations they represent. Based on this value we may reject (or retain) the null hypothesis that there is no significant difference between the two sets. Permutation tests are quite different from parametric tests such as the (M)ANOVA and the t-test in that they make fewer assumptions and are applicable to more kinds of data, while being just as sensitive, or even more sensitive than their parametric counter-parts. They do not require the data to be normally distributed (to follow a bell-curve, or any known probability distribution for that matter), to be homoscedastic (have regular and finite variance), nor for it to be collected from subjects that were randomly selected (selected by random sampling). And this is good news as most linguistic research violates at least one or two of these requirements. Permutation tests lack these

5.2

Exchangability and Relevance

Permutation tests have only two real requirements which are relatively straightforward. First, one needs to be alert about exactly what relationships between the data-sets are being measured by ones test and its permutations. What is important here is the exchangeability of elements under the nullhypothesis, or in other words that there is no interdependence between the elements that are being permutated. We can best illustrate this by explaining our decision to measure the difference in syntax by permutating the speakers in the two groups, 15

Automatically Extracting Typical Syntactic Differences from Corpora

June 8, 2009

lem.3 Secondly there is an important difference between statistical significance and effect size, or even more broadly; relevance, which — while important with any statistical test — is often overlooked. A statistically significant result is not necessarily a relevant result. Significance measures the likelihood of a difference being due to chance. Thus very small, but consistent differences can always be made significant by increasing the size of the dataset. If a difference is consistent over a large dataset, it likely is not a coincidence, even if the difference is very small. But on the other hand this very small difference might possibly be not very relevant. For example if body height were related to presentation-skills, but only if averaged over millions of people, with quite some variation in-between, then facing a tall presenter, or even having a class of tall pupils presenting, would not tell you very much about the quality of the upcoming presentation(s). Again the problem in this example is not that the relationship is not significant, it is just not big enough, and thus not relevant, as in not informative for the audience of a presentation. So even though sheer corpus size should not lead to higher than warranted estimates of statistical significance when using permutation-tests, such as those that can happen with Chi Square and similar classical tests, having a bigger corpus will still make it easier to find more significance as is warranted by having more data (Agresti, 1996; Edington, 1987). So the corpus that is used should not be too big, and one should never apply the method without keeping an eye on the possible relevance of the results for the research questions at hand.4 In short: when used wisely permutation tests are a very suitable tool, also for finding significant syntactical differences and the POS-n-grams that contribute to this difference.

and not the n-grams or the sentences — as we did previously (Nerbonne and Wiersma, 2006). Because if we had permutated separate n-grams then we might — besides the differences between the groups — also be measuring the syntactic relationships between (overlapping) n-grams within sentences. And when permutating sentences of multiple different speakers across two groups we could also be measuring differences between the personal syntactic styles of the subjects, instead of just those differences that are caused by them being members of the two groups. So only the speakers can be considered to be truly independent and exchangable. More formally speaking ‘interdependent’ here means that there is no higher than chance probability-relationship among the elements (in statistical terms there should be no autocorrelation). For example between the leaves of a single plant there is an extremely high probability that they are of a similar type, as they were all formed by the same plant. Thus when we have plants of different species standing on each end of our desk, then it is not a good idea to snap off their leaves and to permutate them when trying to find out whether their differences in leaf-colour could either have been caused by the positions of their pots, or by chance, because the leaves of each plant were (mostly) formed by their plants genes, not by the position of its pot. This example also allows us to stress the important role the null-hypothesis and the alternative hypothesis play in determining what interdependence is and is not allowed. So if the plantmutilation example was not about the location of the desk-plants, but about finding out whether the colour of their leaves was dependent on their species (different alternative hypothesis, with as null-hypothesis leaves of various types growing randomly on the plants), then it was perfectly fine to permute their leaves and find a zero pvalue accordingly. One measures something other than, and often a lot more, than one thinks (genetic relation instead of pot position in the example) and this is why permutating elements that are not exchangeable under the null-hypothesis leads to false significance. And as noted, the interdependence between elements suggested by the alternative hypothesis (e.g. genetic relationship when that is what is being tested for) is no prob-

5.3

Normalisations

Each measurement of difference that is part of the permutation test — whether the difference is between the original two samples or between two 3

The requirement of exchangeability under the nullhypothesis is also known as the requirement of random assignment in some of the literature 4 In our 2006 paper we formulated this a bit less clearly and conservatively. In fact we thought then that corpus size could be ignored, which is wrong.

16

Automatically Extracting Typical Syntactic Differences from Corpora

June 8, 2009

samples which arise through permutations — is applied to the collection of POS-n-gram frequencies once these have been normalised. We first describe the normalisation that is required. This normalisation is called the BE TWEEN SUBJECTS NORMALISATION . It excludes a whole range of factors that could otherwise violate the requirements of permutation tests, and it is thus quite conservative and robust. It normalizes for differences in the size of texts, instead of for variations in sentence-length, like the IN - PERMUTATION NORMALISATION described in our previous paper on the method (Nerbonne and Wiersma, 2006), which was both more complex and a bit less conservative. The second normalisation we used is called the BETWEEN - TYPES NORMALISATION , and it normalizes for differences in frequency between ngram types. Using it is optional and its purpose is mainly the elimination of frequency as a factor, allowing one to detect significance if the typical ngrams are the less frequent ones. It is also different from the normalisation that we used for the same purpose before (the BETWEEN - PERMUTATIONS NORMALISATION ). The difference being that the between-types normalisation is much simpler to calculate for the output of the between-subjects normalisation, and that part of its function is now delegated to a third normalisation. The BETWEEN - GROUPS NORMALISATION is the third normalisation, and it has a really different function from the others as its output is not used as input for a measure or a permutation-test, but is solely meant for detecting for which of the two groups a given n-gram is typical. So it can only be used after individual n-grams were selected by using one of the other normalisations. It normalises for group-size, so it can detect what constitutes over- or under-use, even if the groups are of different sizes.

not change when authors are permutated between them, so the requirement for constant group-sizes is not violated. The normalisation functions, in short, by correcting for the size of the text by the speaker as measured by the number of POSn-grams inside it. It has the possible disadvantage that it is not always possible, easy or feasible to split up a corpus according to authorship. For example some texts, such as those in the Bible, have multiple or contested authors. Solutions can be found for this, such as splitting up the text into more or less random sizeable chunks (for example chapters), and then treating them as belonging to different authors. Still one should be aware of the extra things that might be measured because of such complications. For the between-subjects normalisation we first collect from the tagger a sequence of counts ci of POS-tag-n-grams (index i) for each subject (c). Giving us one vector (cs ) per subject.

5.3.1

Then we use the fraction-vectors in this list as the elements we permutate in our permutation test, instead of n-grams or sentences, effectively permutating subject speakers. We permutate these subjects between two groups, which we shall refer to as juveniles (‘j’) and adults (‘a’) as we did in our example research. Each permutation results in two lists — one for each group — containing the fraction-vectors of the subjects ending up in that group (f sj for a subject-vector in the juveniles

cs = < cs1 , cs2 , ..., csn > After that we normalise for total number of POS-n-grams produced by a given subject. We do this by calculating the sum of the subject’s ngram-counts, and then dividing each of his individual n-gram counts by it. This gives us an ngram fraction vector (f s ) for the subject. Ns =

Pn

s i=1 ci

f s = < ..., fis (= csi /N s ), ... >

Pn

s i=1 fi

=1

After we have done this for all the subjects, we have a list (F) that contains vectors, namely the fraction-vectors of all subjects. F = < f s 1 , f2s , ..., fns >

The Between-Subjects Normalisation

The between-subjects normalisation is, as noted, applied to texts belonging to single authors or speakers. Its purpose is to make the collection of POS-n-grams that make up the texts of a speaker comparable to that of other speakers so that no one subject has more influence on the groups vector than any other. This is also important for permutating, as it ensures that the sizes of the groups in terms of n-grams counts will 17

Automatically Extracting Typical Syntactic Differences from Corpora

June 8, 2009

group). So for each permutation, including the base-case, we produce two lists of fraction-vectors (Fj and Fa ).

The inner part of the formula, the total count for each n-gram (si ), can be calculated for the whole vector (s) as follows:

Fj = < f sj 1 , f2sj , ..., fnsj > Fa = < f sa 1 , f2sa , ..., fnsa >

s = < ..., si (= sji + sai ), ... > N-grams with large summed authors’ fraction counts are those with high frequencies in the original sub-corpora. And the normalisation under discussion strips away the role of frequency and thus allows us to find significance if there are typical n-grams, even if they are less frequent. The values thus normalised will be 0.5 on average when the groups are of the same size. They will be skewed in the direction of the larger group when one of the groups is bigger than the other. Therefore the output of this normalisation cannot directly be used to determine if there is under- or over-use (as 0.3 can be over-use for a small group, and extreme under-use for a bigger one). It is primarily meant for determining the overall (corpus-wide) p-value without regard to frequency. It has no influence on the p values reported for individual n-grams, because it is only a linear transformation when looked at per n-gram. In addition we do use it for sorting the n-grams that were found to be significant, as it will cause infrequent, but typical n-grams to move to the top (see section 2.4). The normalisation is applied to the base-case and all subsequent permutations, and the pair of arrays it produces (tj and ta ) is ready to be used as input for a measure.

As the last step of this normalisation for each group we sum the fraction-vectors of all subjects that ended up in that group (summing the fractions per n-gram for different authors in the group, not across different n-grams), giving us a vector holding per-n-gram sums for each group: sj and sa (note this last step is also done for for all permutations). n sj = Fji Pi=1 n a a s = i=1 Fi

P

The pair of single arrays it produces at each permutation (sj and sa ) can, besides being already directly usable when no further normalisation is done for frequency, also be used as input for the second normalisation, the between-types normalisation. 5.3.2 The Between-Types Normalisation The purpose of the between-types normalisation is the removal of the influence of absolute frequencies in n-gram counts. This is useful because as can be seen in the graph for trigrams (figure 3), the frequency of POS-n-gram types follows Zipfs law, and thus a few very frequent (relatively uninteresting) n-grams could have too much influence on the reported significance of the difference between the two groups. So this normalisation allows one to find significance if there are enough n-gram types that are typical for the two original sub-corpora, regardless of their frequency. It is similar to the second step of the subject normalisation (the fractions per author), except that the linear transformation is applied over the total count for each n-gram type, instead of over all n-grams of each author: for each POS-n-gram type i in each group (sub-corpus) g ∈ {a, j}, the summed authors fraction count of the group sgi is divided by the total count (both as provided by the subject normalisation) for that type si . The outer part of the formula for tg is:

5.3.3 The Between-Groups Normalisation The between-groups normalisation is applied to the output of the first normalisation (the betweensubjects normalisation) and it is meant for detecting whether an n-gram is being over-used in one group or the other. The average normalised value will be 1, with cases of over-use in the group having a normalised value of bigger than 1, and cases of under-use having a value smaller than 1. It is calculated by dividing the found per-n-gram count by the average value for that n-gram in that group. It works because the average value for each of these n-grams across permutations is equal to the expected value for a group of that size under the null-hypothesis. More formally it is arrived at in the following way: for each POS-n-gram i in each group (subcorpus) g ∈ {a, j}, the summed authors fraction

tj = < ..., tji (= sji /si ), ... > ta = < ..., tai (= sai /si ), ... > 18

Automatically Extracting Typical Syntactic Differences from Corpora

June 8, 2009

Figure 3: Distribution of the trigrams in the FAEC corpus in a log log graph, with the ideal Zipf distribution in grey.

19

Automatically Extracting Typical Syntactic Differences from Corpora

count sgi is divided by the average of the count (both as provided by the between-subjects normalisation) for that type in that group (across all permutations): sgi . The outer part of the formula for og is:

values, such as 1/0.8 = 1.25. This can lead to false significance by introducing fluctuations in the normalised group-sizes, and by turning twosided measures, such as RSquare, into one-sided ones (more on RSquare in section 5.4). When these two things are kept in mind it can be used as a convenient normalisation for assigning the detected pos-n-grams to the list of the group by which they are being over-used. Just as the previous normalisations, this normalisation is also applied to the base-case and all subsequent permutations, and for each it produces a pair of arrays (oj and oa ).

oj = < ..., oji (= sji /sji ), ... > oa = < ..., oai (= sai /sai ), ... > Now for large numbers of permutations the inner part of the formula, namely the average count (sgi ), can be calculated as (sji + sai )/2 when the groups are of equal sizes, and as follows when the groups differ in size:

5.4

sj = sa =

Measures

The choice of vector difference measure, e.g. cosine vs. χ2 , does not affect the proposed technique greatly, and alternative measures can easily be swapped in. Accordingly, we have worked with both cosine and two measures inspired by the RECURRENCE (R) metric introduced by Kessler (Kessler, 2001, 157 and further). We also call our measures R and RS QUARE. The advantage of our R and RSquare metrics is that they are transparent, they are simple aggregates, that allow one to easily see how much each n-gram contributed to the difference. Actually we also used them to calculate a separate p-value per n-gram. Our R is calculated as the sum of the differences of each cell with respect to the average for that cell. If we have collected our data into two vectors (sj , sa ), and if i is the index of a POS-n-gram, R for each of these two groups is equal, and it simply looks as follows.

Where N j and N a are the number of subjects in the juveniles, resp. adults group. The values thus normalised will, as can be noted, be 1 on average across permutations, and thus when summed for large corpora, be equal to the number of POS-n-gram types. Pn oj = n Pni=1 ai i=1 oi

June 8, 2009

=n

We firmly note again that the per-POS-n-gram values output by this normalisation are only useful for detecting over- and under-use when used together with information on per-POS-n-gram statistical significance as provided by one of the other normalisations. First of all it cannot be used to find typical or extreme POS-n-grams without consideration for significance, because infrequent n-grams are especially likely to have high values with respect to sgi . For example a n-gram occurring only once, in one sub-corpus, can get a value of 1/0.5 = 2 (as it is indeed very typical for this sub-corpus), while with a count of one it clearly will not be statistically significant (moving between equally sized sub-corpora with a chance of 50 % during permutations). Secondly this normalisation is not suitable to generate per n-gram p-values itself when groupsizes differ, because then the output will not be symmetrical. To illustrate this we can look at our single n-gram again: as a n-gram occurring in a smaller sub-corpus can produce very big normalised values, such as 1/0.2 = 5, and one in a bigger sub-corpus will always produce smaller

R=

j g s − s i=1 i i

Pn

With the average between the two group for each n-gram cell (sgi ) being. sgi = (sji + sai )/2 The RSquare measure emphasises few large differences more than many small ones, and it is simply the square of R, and thus calculated in this way (with sgi being the same as for R). RSquare =

j i=1 (si

Pn

− sgi )2

Both R and RSquare are two-sided measures. This means that the same R(Square) value is obtained when an n-gram has more than the average 20

Automatically Extracting Typical Syntactic Differences from Corpora

value to a certain extent (say x), as when it has less than the average to this same extent (also x). So tests using them are two-tailed, which is statistically conservative. Also both very typical and a-typical n-grams are detected as significant. Thus assignment to the groups for which they are typical is done as a separate step using the betweengroups normalisation (see 5.3.3). All in all we had best results with the RSquare measure and thus have used it for our example research.

6

for the Finnish Australian English Corpus. Besides transforming the corpus, some metadata about it should also be provided. Most notably a division, which is a set of two files, each containing a list with the names of the files that belong to the two groups stored below the corpusData directory. It can be created using the corpusdivider.pl tool, based on collected metadata. See the test-corpus that is included with the tool set for an example. All tools in the ComLinToo also provide a short description and a detailed list of possible arguments when they are called with ‘-h’, for example:

Software

Our method can easily be applied to new questions and different corpora with the help of the C OM PUTATIONAL L INGUISTICS T OOLSET (the ComLinToo).5 It is Free Software licensed under the Affero GPL (available for free, and for modification, but without warranty, as long as they remain free). It runs under Linux (and should work on Macs too), is written in Perl and consists of about 40 separate command-line tools, grouped by various tasks such as corpus preparation, examination and sense disambiguation. For our research it was extended with the permstat and tagging tasks. We will now introduce and explain the corpus preparation, tagging and permstat tasks, but only to the extent that this is neccesary for applying our method. 6.1

June 8, 2009

./corpussplitter.pl -h Also note that most tools have default configuration-values, which are set in each script’s config.pl file. These are used when a command-line argument is not provided for them. Then when the corpus is prepared it needs to be tagged if it is not a tagged corpus already. For this the ComLinToo uses Thorsten Brants’ TnT tagger (Brants, 2000). Sadly enough we cannot provide it with our tool set for licensing reasons, but it can be obtained separately and is free for research purposes.7 For tagging using TnT the corpus2tnt.pl, tagtnt.pl and tnt2tagrow.pl scripts are used. If TnT first needs to be trained on a tagged corpus, then traintnt.pl should be used before tagging. After the tagging is done tagrow files matching the wordrow files are created. These files (one for each text) contain the tags for each sentence on a new line, with tags separated by spaces. The tools for the tagging-task, and every other task, can also be called grouped together in tasks, with just one command-line, by using the goall.pl script, such as tagginggoall.pl. If it is called with ‘-lt’ it reports the possible tasks, and the calls to the other tools that it will make to fulfil them. In this way you can get a feel for the order in which the scripts are normally called, and their most usual arguments. The goall.pl scripts also accept arguments, such as the corpus (with −c), or the sub-corpora division that should be used (−d), and they pass these on to the tools they call. An example goall-call is:

Step by Step

Everything starts with selecting a corpus. Things to keep in mind here are the fact that it either should be tagged already or that at least a tagged corpus in the same language has to be available to train the Part of Speech tagger on. If a corpus is selected it should be transformed to plaintext format, with every separate text in a new file, every sentence on a new line and words and punctuation split by spaces (these are the wordrow files). This transformation can be done with the corpussplitter.pl (where splitting in files is needed), corpusstripper.pl and corpusrefiner.pl tools. They require a custom, but small software library for each corpustype, and these are currently available for the various International Corpus of English corpora6 and 5

The Computational Linguistics Toolset is available from http://en.logilogi.org/Wybo Wiersma/User/Com Lin Too 6 The ICE-corpora are available from http://www.ucl.ac.uk/english-usage/ice/

7

TnT is available saarland.de/ thorsten/tnt/

21

here

http://www.coli.uni-

Automatically Extracting Typical Syntactic Differences from Corpora

and time of their experience in the second language, the role of formal instruction, etc. And beyond this, comparisons could be made between dialects and variants of languages (such as the various Englishes as collected in the ICE-corpora). Also different discourses can be compared, such as the language of lawyers or academics as compared to that of laymen or students. Besides all this it might also be applied for the analysis of literary styles such as the syntactic difference between romance and detective novels. The change of syntax through time could also be followed, as reflected in novels or newspapers published in different decades or even different centuries. In addition it can for example be used to track traces of the grammar of a sourcelanguage (say Ancient Greek) in translations (say of the Bible). In short, as long as a common tagset exists or can be devised for a corpus of comparable material, and the comparison answers one or more meaningful questions, it can be used, and not just for purely scientific purposes. It might even lead to very concrete applications in teaching, so second language learners, aspiring creative writers, or journalism-students can be shown what grammatical structures they — either collectively or individually — are overor under-using compared to native speakers, famous authors, resp. acclaimed journalists. And if the method is confirmed to work for small corpus-sizes and/or is improved and calibrated sufficiently, many more applications may be thought of, such as e-mail-classification or authoridentification. More on analysis and possible improvements now.

./tagginggoall.pl -c test -lt And finally when the corpus has been cleaned and is fully tagged, the permstat task can be performed. Key-tools here are (besides permstatgoall.pl) textpermutator.pl and permutation statter.pl. The former generates the actual permutations, and the latter applies one or more normalisations and a measure to them to arrive at actual test-results. For the normalisations and the measures small plugin libraries are used again. It should be noted that although both scripts use zip-compression for their intermediate output files, fairly large amounts of diskspace may be required. In our set-up, with a corpus of 305.000 words and 10.000 permutations we needed more than 10 gb per experiment. The other two tools that are used for the task are permutationstattabler.pl and permstatresultselector.pl. The tabler tool transforms the output of permutation statter.pl to a tab-delimited format for easy importing in most software-packages (such as SPSS and Excel), and the latter selects and sorts significant n-grams for easy inspection. The list of n-grams obtained in this way can then be looked up in the corpus (with random examples) by using the tagsamplefinder.pl tool, from the examine task. Browsing through the tools in the various task-sets, and reading their descriptions can give you further guidance on which tools can be useful for you.

7

Further Research

First we will go into possible applications of the method, then we will do some suggestions for further testing and analysis, and we will conclude this section with some suggestions for improvements. 7.1

June 8, 2009

7.2

Analysis of the method

A first thing that could be analysed more in-depth are the corpus-sizes at which the method is effective in finding genuine differences. Note that ideally both a minimum and a maximum size should be established. A minimum could be found by trying it on sub-corpora of various sizes, composed of material whose syntax is known to be different. For finding a maximum it would be necessary to apply the method to material that is known to be very similar, and then taking the biggest corpussize at which no significant difference is found. It should be noted that especially agreeing on a maximum size could be hard as a degree of subjectivity is involved in deciding what ‘very similar’ entails,

Applications of the Method

As already shortly mentioned in the introduction, the method as presented here can be applied to many data-sets. It can offer quantitatively based answers to questions much beyond our data of two generations of immigrants. More specific questions could be asked such as the time course needed for second-language acquisition, the relative importance of factors influencing the degree of difference such as the mother tongue of the speakers, other languages they know, the length 22

Automatically Extracting Typical Syntactic Differences from Corpora

and it likely varies with research-questions. It might also be valuable to further analyse the various types of errors that the method might produce. Obviously here are tagging errors, and there is the slight possibility of there being problematic — that is significance causing — patterns in them. So far we found none in our tentative analysis, and even found better tagger-performance for the significant n-grams, as noted in sections 3.3 and 4.6. Still, having these errors even more thoroughly examined and tested would add to the trust in the method. In addition one or more re-applications to studies where syntax-differences were found by other means might be useful for additional verification. Another opportunity for analysis might be the testing of the method with various tag-sets and ngram sizes: especially tag-sets of various precisions, and for larger n-grams. We have not experimented with different tagsets apart from the full and reduced ICE-tagsets. And here we have limited our (preliminary) analysis to the results of tagset-size on significance for various n-gram sizes. We found that there appears to be a bandwidth between 1- and 5-grams at which it is easier to find significance, but apart from noting decreased significance at the edges, we have not yet been able to map it conclusively.8 Also our results are limited since they result from just one corpussize and just our two ICE tag-sets. In addition to quantitative analysis of the method, as proposed above, there are also more opportunities for qualitative analysis of the usefulness of results for linguists or other researchers when using different corpus-sizes and tag-sets. 7.3

June 8, 2009

validation steps. This means that the corpus is split into two parts (of sizes, say 20% stands to 80%), and then the second part is used to mine for typical n-grams, after which these are confirmed by applying the method again, but now to the first part. This process can be repeated so all data is used for mining and confirming in turn. Doing this would make the results more robust. Another improvement would be to add a stronger protection from family wise errors, such as one against partial null hypothesis family wise errors (part of the null-hypothesis being true, as in for separate n-grams). In the literature on neuroimaging, where also comparisons between hundred thousands of data points have to be made, proposals and solutions for this with regard to permutation tests have recently been put forth, such as threshold techniques (Nichols and Holmes, 2002; Nichols and Hayasaka, 2003). Applying these or similar solutions to our method could make the individual significance of all found n-grams also fully certain. Another small improvement would be to add start- and end- of sentence anchors. This might make the tagger perform better, and, in addition it will result in more fine-grained results. This because when using such anchors, mid-sentence ngrams will be differentiated from those at the end or beginning of sentences, leading to very different, and maybe even more informative results. In addition, as already noted in section 5.4, it could be useful to experiment with various measures. It would be especially useful if a measure were found that could standardize the method in some way, or calibrate it. This would make specifically corpus-wide comparisons more useful, as we would not only have an overall p-value, but also a standardized, overall difference-value. This would make it easier to compare the outcomes of different studies. A way to improve the method in a more radical sense, especially when taggers and parsers become more accurate, could be going beyond POS-tags, to chunking, that of partial parsing, and that of full parsing. Chunking attempts to recognize nonrecursive syntactic constituents, ‘[the man] with [the red hair] in [the camel-hair jacket] waved’, while partial parsing attempts to infer more abstract structure where possible: ‘[S [NP [the man] with [the red hair]] in [the camel-hair jacket] waved]’. Note that the latter example is only par-

Improvements to the method

Strictly speaking the method as currently presented is only a method for data-mining, and not for hypothesis-testing. This is because especially when the typical n-grams are going to be used, and not just the p-value that states whether the two subcorpora do differ, then no hypothesis is set before testing, and thus strictly speaking the results cannot be considered proven. To alleviate this theoretical problem, it would be possible — especially when there is much data at hand — to add cross8

We found no significance for 1-grams from the reduced tagset (there was significance for the full tagset), and a slightly reduced one (but still < 0.01) for 5-grams for the full tagset

23

Automatically Extracting Typical Syntactic Differences from Corpora

data-sets and used to answer many questions, such as factors influencing language learning, comparisons between discourses and literary styles, and maybe even teaching. Secondly the method could be further analysed, testing it with various corpus, tagset- and n-gram-sizes, and doing more qualitative analysis. Lastly there are also still many opportunities to improve the method, by adding and evaluating more statistical safe-guards, by experimenting with various measures, and by using chunking or parsing, instead of POS-tags, especially for cleaner data. Thus while we fall short of Weinreich’s goal of assaying “total impact” in that we focus on syntax, we do take a large step in this direction by showing how to aggregate, test for statistical significance, and examine the syntactic structures responsible for the difference.

tially parsed in that the phrase ‘in the camel-hair jacket’, is not attached in the tree. In general, chunking is easier and therefore more accurate than partial parsing, which, however, is a more complete account of the latent hierarchical structure in the sentence. There is a tradeoff between the resolution or discrimination of the technique and its accuracy. Going beyond POStags is especially possible for cases where tagging accuracy might be less an issue, such as newspaper text or novels as Sanders and others have shown (Sanders, 2007; Baayen et al., 1996; Hirst and Feiguina, 2007).

8

June 8, 2009

Conclusion

Weinreich (1968) regretted that there was no way to “measure or characterize the total impact one language on another in the speech of bilinguals,” (1968, p. 63) and speculated that there could not be. This paper has proposed a way of going beyond counts of individual phenomena to a measure of aggregate syntactic difference. The technique proposed follows Aarts and Granger (1998) in using part-of-speech n-grams. We argue that such lexical categories are likely to reflect a great deal of syntactic structure given the tenets of linguistic theory according to which more abstract structure is, in general, projected from lexical categories. We went beyond Aarts and Granger in Showing how entire vectors of POS ngrams may be used to characterize aggregate syntactic distance, and in particular by showing how these, and individual n-gram counts can be analysed statistically. The technique was implemented using an automatic POS-tagger, several normalisations and permutation statistics, and it was shown to be effective on the English of Finnish immigrants to Australia. We were able to detect various forms of interference of their L1 (Finnish) on the English of the adult speakers, such as over-use of articles and placing not in pre-verbial position, as well some due to universal contact-influences, such as the over-use of hesitation phenomenon. The ComLinToo, the software implementing the method, including the normalisations, is freely available. It is developed to allow easy application to other data-sets, and generalization to n-grams of any size. There are many possibilities for future research. First of all it can be applied to a wide range of

9

Acknowledgments

We are grateful to Lisa Lena Opas-H¨anninen and Pekka Hirvonen of the University of Oulu, who made the data available and consulted extensively on its analysis. We also thank audiences at the 2005 TaBu Dag, Groningen; at the Workshop Finno-Ugric Languages in Contact with English II held in conjunction with Methods in Dialectology XII at the Universit´e de Moncton, Aug. 2005; the Sonderforschungsbereich 441, “Linguistic Data Structures”, T¨ubingen in Jan. 2006; the Seminar on Methodology and Statistics in Linguistic Research, University of Groningen, Spring, 2006; the Linguistic Distances Workshop at the Coling/ACL 2006 in Sydney, July. 2006; and the Thirteenth International Conference on Methods in Dialectology, Leeds, Aug. 2008, and especially Livi Ruffle, for useful comments on our earlier paper.

References Aarts, Jan and Sylviane Granger. 1998. Tag sequences in learner corpora: A key to interlanguage grammar and discourse. In Granger, Sylviane, editor, Learner English on Computer, pages 132 – 141. Longman, London. Agresti, Alan. 1996. An Introduction to Categorical Data Analysis. Wiley, New York. Association, International Phonetics. 1949. The Principles of the International Phonetic Association. IPA, London. 24

Automatically Extracting Typical Syntactic Differences from Corpora

June 8, 2009

Baayen, H., H. van Halteren, and F. Tweedie. 1996. Outside the cave of shadows. Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11.

Jarvis, S. 2003. Probing the effects of the L2 on the L1: A case study. In Cook, V., editor, Effects of the Second Language on the First, pages 81–102. Multilingual Matters, Clevedon, UK.

Bot, Kees de, Wander Lowie, and Marjolijn Verspoor. 2005. Second Language Acquisition: An Advanced Resource Book. Routledge, London.

Kessler, Brett. 2001. The Significance of Word Lists. CSLI Press, Stanford. Larsen-Freeman, D. and M.H. Long. 1991. An Introduction to Second Language Acquisition Research. Longman, London.

Brants, Thorsten. 2000. TnT - a statistical part of speech tagger. In 6th Applied Natural Language Processing Conference, pages 224 – 231. ACL, Seattle.

Lauttamus, T. and P. Hirvonen. 1995. English interference in the lexis of American Finnish. The New Courant, 3.

Chambers, J.K. 2003. Sociolinguistic Theory: Linguistic Variation and its Social Implications. Blackwell, Oxford.

Lauttamus, Timo, John Nerbonne, and Wybo Wiersma. 2007. Detecting Syntactic Contamination in Emigrants: The English of Finnish Australians. SKY Journal of Linguistics, 20.

Clyne, M. and S. Kipp. 2006. Australia’s community languages. International Journal of the Sociology of Language, 180.

Lenneberg, E. 1967. Biological Foundations of Language. John Wiley, New York.

Cook, V. 2003. Effects of the Second Language on the First. Multilingual Matters, Clevedon, U.K.

Manning, Chris and Hinrich Sch¨utze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge.

Coseriu, Eugenio. 1970. Probleme der kontrastiven Grammatik. Schwann, D¨usseldorf.

Nelson, Gerald, Sean Wallis, and Bas Aarts. 2002. Exploring Natural Language: working with the British component of the International Corpus of English. John Benjamins Publishing Company, Amsterdam/Philadelphia.

Edington, Eugene S. 1987. Randomization Tests. Marcel Dekker Inc., New York. Ellis, R. 1994. The study of second language acquisition. Oxford University Press, Oxford.

Nerbonne, John and Wybo Wiersma. 2006. A measure of aggregate syntactic distance. In Nerbonne, J. and E. Hinrichs, editors, Linguistic Distances, pages 82 – 90. PA: ACL, Shroudsburg.

Faerch, C. and G. Kasper. 1983. Strategies in interlanguage communication. Longman, London. Fillmore, Charles and Paul Kay. 1999. Grammatical constructions and linguistic generalizations: the what’s x doing y construction. Language, 75.

Nichols, Thomas and Satoru Hayasaka. 2003. Controlling the familywise error rate in functional neuroimaging: a comparative review. Statistical methods in medical research, 12.

Garside, Roger, Geoffrey Leech, and Tony McEmery. 1997. Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London/New York.

Nichols, Thomas E. and Andrew P. Holmes. 2002. Nonparametric Permutation Tests For Functional Neuroimaging: A Primer with Examples. Human brain mapping, 15.

Good, Phillip. 1995. Permutation Tests: a practical guide to resampling methods for testing hypotheses. Springer, New York.

Odlin, Terence. 1989. Language Transfer: Crosslinguistic Influence in Language Learning. Cambridge University Press, Cambridge.

Hirst, G. and O. Feiguina. 2007. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing, 22.

Odlin, Terence. 1990. Word-order transfer, metalinguistic awareness and constraints on foreign language learning. In VanPatten, B. and J. Lee, editors, Second Language Acquisition/Foreign Language Learning, pages 95–117. Multilingual Matters, Clevedon, U.K.

Hirvonen, P. 1988. American Finns as language learners. In M.G. Karni, O. Koivukangas and E.W. Laine, editors, Finns in North America. Proceedings of Finn Forum III 5-8 September 1984, pages 335–354. Institute of Migration, Turku, Finland. 25

Automatically Extracting Typical Syntactic Differences from Corpora Odlin, Terence. 2006a. Could a contrastive analysis ever be complete? In Arabski, J., editor, Crosslinguistic Influence in the Second Language Lexicon, pages 22–35. Multilingual Matters, Clevedon, U.K.

June 8, 2009

Waas, M. 1996. Language Attrition Downunder: German Speakers in Australia. Peter Lang, Frankfurt. Watson, G. 1995. A corpus of Finnish-Australian English: A preliminary report. In Muikku-Werner, P. and K. Julkunen, editors, Kielten v¨aliset kontaktit, pages 227–246. University of Jyv¨askyl¨a, Jyv¨askyl¨a.

Odlin, Terence. 2006b. Methods and inferences in the study of substrate influence. Pietil¨a, P. 1989. The English of Finnish Americans with Reference to Social and Psychological Background Factors and with Special Reference to Age. Turun yliopiston julkaisuja, 188.

Watson, Greg. 1996. The Finnish-Australian English corpus. ICAME Journal: Computers in English Linguistics, 20.

Piller, I. 2002. Passing for a native speaker: Identity and success in second language learning. Journal of Sociolinguistics, 6.

Weinreich, Uriel. 1968. Mouton, The Hague.

Languages in Contact.

Westfall, Peter H. and S. Stanley Young. 1993. Resampling Based Multiple Testing: Examples and Methods for p-value Adjustment. John Wiley & Sons, Inc., New York.

Poplack, Shana and David Sankoff. 1984. Borrowing: the synchrony of integration. Linguistics, 22. Poplack, Shana, David Sankoff, and Christopher Miller. 1988. The social correlates and linguistic processes of lexical borrowing and assimilation. Linguistics, 26. Ritchie, William C. and Tej K. Bhatia. 1998. Handbook of Child Language Acquisition. Academic, San Diego. Sampson, Geoffrey. 2000. A proposal for improving the measurement of parse accuracy. International Journal of Corpus Linguistics, 5. Sanders, Nathan C. 2007. Measuring Syntactic Differences in British English. In Proceedings of the Student Research Workshop, pages 1 – 7. Omnipress, Madison. Schmid, M. 2002. First Language Attrition, Use and Maintenance: The case of German Jews in Anglophone Countries. John Benjamins, Amsterdam. Schmid, M. 2004. Identity and first language attrition: A historical approach. Estudios de Sicioling¨uistica, 5. Sells, Peter. 1982. Lectures on Contemporary Syntactic Theories. CSLI, Stanford. Thomason, Sarah and Terrence Kaufmann. 1988. Language Contact, Creolization and Genetic Linguistics. University of California Press, Berkeley. Thomason, Sarah G. 2001. Language Contact: An Introduction. Edinburgh University Press, Edinburgh. Van Coetsem, Frans. 1988. Loan Phonology and the Two Transfer Types in Language Contact. In Publications in Language Sciences, page 27. Foris Publications, Dordrecht. 26

Automatically Extracting Typical Syntactic Differences from Corpora

June 8, 2009

Table 4: All 137 inspected trigrams with samples. The ‘rank’ is the ranking of the trigram in the top 308 when sorted using the between-types normalisation. The ‘pre’ and ‘post’ are the words before and after the trigram in the sentence. The ‘trigram’ shows the trigram, with 3 examples from the corpus below it. And the ‘usages’ shows the name of the phenomenon from table 3 Rank/Pre

Trigram

Post

2

???

N(com,plu)

ADV(ge)

Uh about nearly

4 4 4

years points years

ago per ago

4

N(com,plu)

PRON(nom)

PRON(pers,sing)

those the my

stones things brains

what what what

I I I

5

ADV(ge)

PREP(ge)

INTERJEC

uh ’s Corners

today still sometimes

on in in

uh ah uh

6

V(cop,pres,encl)

INTERJEC

N(com,sing)

It it It

’s ’s ’s

ah uh uh

blump bread um

12

PRON(pers,plu)

V(intr,pres)

PREP(ge)

and where and

we we we

stay live stay

in in in

16

N(com,sing)

INTERJEC

INTERJEC

For steak and

um uh ”

uh uh uh

uh uh uh

17

INTERJEC

INTERJEC

INTERJEC

oh and well

oh uh huh

oh uhm huh

oh uh huh

18

CONJUNC(subord)

INTERJEC

INTERJEC

that know

uh that Because

market uh uh

uh uh uh

21

N(com,sing)

INTERJEC

PRON(pers,plu)

there a next

couple sportsman election

oh oh uh

we we they

22

PRON(univ,sing)

PRON(univ,sing)

N(com,sing)

really Gosh from

every every each

every every each

day time party

24

CONNEC(ge)

PREP(ge)

N(prop,sing)

And But And

in after in

Finland Susan Finland

26

CONJUNC(coord)

INTERJEC

UNTAG

we wintertime spades

not and and

cook uh uh

r fish tha

27

Usages [useless]

per I Relative-What carried wanted have Disfluent midday in Disfluent

what Present Finland the England Disfluent porridge then sorry Disfluent I everything Disfluent uh uh Disfluent usually used think Disfluent is diff take [useless] you born maybe Disfluent that mostly is

Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre

Trigram

June 8, 2009 Post

31

PRON(univ)

N(com,sing)

PREP(ge)

doing need all

all all all

kind kind kind

of of of

32

N(com,sing)

V(cop,past)

ADJ(ge)

weather work my

weather load husband

was was was

different different alive

34

CONNEC(ge)

INTERJEC

ADV(ge)

But some Yes

uh but but

I uh uh

never not not

38

CONJUNC(coord)

INTERJEC

N(com,plu)

Um time factory

and this and

uh time uh

carrots people bookbinders

41

CONNEC(ge)

INTERJEC

PREP(ge)

And

And then And

ah uh uh

in in on

42

ART(indef)

INTERJEC

N(com,sing)

in play was

a a a

ah uh uh

sewage champagne driver

43

ADV(phras)

PREP(ge)

ADJ(ge)

know went charge

in on up

in through to

Finnish near ready

46

CONJUNC(subord)

INTERJEC

PREP(ge)

it this now

that that that

uh uh uh

for in before

48

REACT

INTERJEC

REACT

Oh Oh

yeah yeah Yeah

oh oh oh

yeah yeah yeah

49

PRON(dem,sing)

N(com,sing)

INTERJEC

’s uh that

that that that

k dough place

uh uh uh

50

V(intr,past)

FRM

FRM

nobody I parts

cared said started

you you you

know know know

53

INTERJEC

REACT

CONJUNC(coord)

Uh Oh Oh

yeah yeah yeah

and and and

54

PRON(dem,sing)

INTERJEC

N(com,sing)

on on that

that that uh

ah ah celebration

vitamin doctor year

28

Usages [useless]

things books things [useless] and

Disfluent been very very

of want and

Disfluent Articles

Disfluent ’ that the Disfluent pipes you [useless] culture my to Disfluent the Finland yeah Disfluent it

cross from classifying

Disfluent Articles

No-There we ” when

the uh I

two but

Disfluent Articles

Disfluent Articles

Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre

Trigram

June 8, 2009 Post

56

AUX(do,pres,neg)

V(montr,infin)

ADV(ge)

I I I

don’t don’t don’t

have watch know

really anymore nowadays

57

PREP(ge)

N(prop,sing)

???

life life life

in in in

Australia Australia Australia

? ? ?

58

INTERJEC

UNTAG

UNTAG

that Uh uh

uh mit was

ba se i

A s was

60

V(montr,infin)

PRON(dem,sing)

PRON(one,sing)

can’t to don’t

explain say like

that that that

one one one

65

V(cop,past)

ADJ(edp)

N(com,plu)

I I I

was was was

5 9 7

weeks years years

66

V(montr,infin)

CONJUNC(subord)

INTERJEC

To don’t think

see think that

like that ah

ah ah ah

68

CONJUNC(subord)

INTERJEC

N(com,sing)

territory if Uh

that that because

uh uh uh

top place system

75

INTERJEC

NUM(card,sing)

N(com,sing)

apprentice

uh Oh Oh

draftswo one one

woman thing time

78

N(com,plu)

PRON(nom)

PRON(pers,plu)

some 25 vinyl

leftovers minutes staff

what what what

we they they

81

CONJUNC(subord)

INTERJEC

CONJUNC(subord)

somehow then say

that when that

uh ah uh

that when when

84

INTERJEC

ADV(inten)

ADV(inten)

much Definite

Ah ah Oh

little too much

bit much much

85

INTERJEC

N(com,sing)

V(cop,pres)

they only you

uh uh uh

humour game life

is is is

87

CONJUNC(coord)

ADV(ge)

CONNEC(ge)

good Laksi bed

but and and

then now then

then but then

29

Usages Pre-Negator

lunch that what [useless] Yeah Yes Quite Disfluent American what Indonesia Articles how [useless] and old Disfluent in ah that [useless] there where uh Disfluent and which in Relative-What had kept Disfluent that I we Disfluent walking ah busier Disfluent so uh so Disfluent I oh I

Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre

Trigram

June 8, 2009 Post

89

PRON(pers,plu)

PRON(pers,plu)

V(montr,past)

visit what where

us we they

we we they

had did had

90

PRON(pers,plu)

V(intr,pres)

ADV(ge)

okay you now

we we we

stay come live

here here still

91

INTERJEC

CONJUNC(subord)

PRON(pers)

Melbourne But

uh uh Uh

when if when

you you you

92

PREP(ge)

ART(indef)

N(prop,sing)

a time even

in in in

a a a

Tuusula Finland Kouvola

93

CONJUNC(coord)

INTERJEC

N(prop,sing)

and and fortune

Mt.Isa uh and

Alice uh uh

Springs May Price

94

REACT

CONJUNC(coord)

PRON(pers,sing)

no time

no yeah Yes

and but but

I I he

95

ADJ(ge)

PRON(dem,sing)

N(com,sing)

bit was not

different good all

that that that

way time common

96

ADV(ge)

CONJUNC(subord)

INTERJEC

going now was

anywhere now underneath

because that when

uh uh ah

97

CONNEC(ge)

INTERJEC

PRON(dem,sing)

And

And And uh

uh ah so

that that that

98

CONJUNC(coord)

INTERJEC

INTERJEC

and and level

ah diamonds but

because and uh

ah uh uh

99

PRON(dem,sing)

V(cop,pres,encl)

INTERJEC

yeah

that That That

’s ’s ’s

uh uh uh

100

CONJUNC(coord)

INTERJEC

N(com,sing)

and iceball iceball

uh not not

and ice ice

gene hockey hockey

101

CONJUNC(subord)

PRON(pers,plu)

V(cop,pres)

kids maybe country

when if because

they they we

are become are

30

Usages Disfluent

a we a Present we if today Disfluent had can think Articles Nikkila because to Disfluent and Day is Disfluent move been was Articles that I uh Disfluent husband before under Disfluent accident ’s ’ll Disfluent ah uh I

yeah I I

general but but

Disfluent Formulae

Disfluent Articles

[useless] teenage you pretty

Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre

Trigram

June 8, 2009 Post

102

ADJ(ge)

PRON(pers,plu)

V(montr,pres)

so call how

long Pure good

they they we

buy jump have

103

ART(indef)

NUM(ord)

N(com,sing)

have in a

a a next

third last door

dog moment neighbour

104

PREP(ge)

N(com,sing)

INTERJEC

start in and

to in in

p paper wintertime

uh uh uh

105

INTERJEC

PRON(pers,sing)

AUX(modal,pres,neg)

I was uh

can’t uh uhm

I I I

can’t can’t can’t

106

CONJUNC(subord)

PRON(pers)

AUX(do,pres,neg)

If but it

if if because

you you you

don’t don’t don’t

108

PRON(pers,plu)

AUX(modal,pres,neg) V(montr,infin)

chairman so

they We we

can’t can’t can’t

do avoid have

109

PRON(pers,plu)

AUX(prog,pres)

ADV(ge)

we people

are they We

not are are

anymore not not

110

ADV(inten)

ADJ(ge)

PRON(dem,sing)

this really married

very quite too

close witty young

that that that

111

PREP(ge)

PRON(dem,sing)

INTERJEC

movies years allergic

before for for

that that that

uh uh ah

113

INTERJEC

PRON(dem,sing)

N(com,sing)

And uh of

ah that uh

that market this

wartime place church

114

INTERJEC

ADV(ge)

PRON(pers,plu)

had No

ah Oh uh

yesterday sometimes sometimes

we they we

115

ADV(ge)

V(intr,ingp)

ADV(ge)

was ’s ’m

always not not

saying coming going

well easy there

116

REACT

CONJUNC(coord)

ADV(ge)

yes meat yourself

but yeah yes

not but but

not then otherwise

31

Usages [useless]

second it [useless] and that is Disfluent subcontra he the Disfluent really tell say [useless] change have wanna [useless] it dogs [useless] so not sweeteaters Articles one ’s ’s

long same chlorine

not butcher work

had

Disfluent Articles

Disfluent Articles

Disfluent Pre-Negator

have [useless] I any because Articles very again no

Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre

Trigram

June 8, 2009 Post

117

PRON(pers,plu)

V(intr,pres)

PRON(pers,plu)

ships where we

they we we

are run are

they they we

118

ART(def)

INTERJEC

N(com,sing)

through to that

the the the

uh uh uh

outback committee operation

122

INTERJEC

PRON(quant)

PREP(ge)

but and but

uh ah uh

most most most

of of of

123

PRON(ass)

N(com,plu)

CONJUNC(coord)

visit got and

some some some

mates vegetables bikinis

but or and

124

REACT

FRM

FRM

” yeah

Yeah Yeah I

oh I mean

dear know like

126

INTERJEC

NUM(ord)

N(com,sing)

and ’s on

uh ah ah

last other next

night sort H

127

ADV(ge)

PRON(pers,plu)

V(intr,pres)

we but

first Sometimes now

we we they

live go travel

128

PREP(ge)

ART(indef)

PREP(ge)

me hot a

in in West

a a Coast

in in like

132

CONJUNC(coord)

ADV(ge)

ADJ(ge)

stir-fries there up

and but and

probably there not

quick certain big

136

V(montr,pres)

PRON(ass)

N(com,sing)

find one we

some have make

corner some some

sittin sort soup

140

V(intr,pres)

ADV(ge)

PREP(ge)

you you I

stand go move

there now here

like at on

141

ART(indef)

PREP(ge)

ART(indef)

as as in

a a a

as as in

a a a

144

INTERJEC

PREP(ge)

PRON(dem,sing)

not uh and

ah in uh

in that after

that that that

32

Usages Disfluent

are ’ve a Disfluent and of

it the time

Disfluent Articles

Articles uh salad like No-There

sk Disfluent the of for Present in out all Disfluent a the in Articles and they changes [useless] ’ of or [useless] you nighttime Ja Disfluent they as night

age kind Hele

Disfluent Articles

Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre

Trigram

June 8, 2009 Post

145

PRON(pers,sing)

AUX(perf,pres)

V(cop,edp)

uh and

I I I

have have have

been been been

146

CONNEC(ge)

PRON(dem,sing)

N(com,sing)

uh

And but And

that this that

way life teacher

147

N(com,sing)

CONJUNC(subord)

INTERJEC

dinner the my

time pension daughter

because because that

uh oh ah

151

ADJ(ge)

CONJUNC(coord)

INTERJEC

uh everybody loved

bloody ready Western

and and and

uh uh uh

152

PRON(pers,plu)

V(montr,past)

INTERJEC

anyway variety we

they they had

got got ah

uh uh oh

153

INTERJEC

PREP(ge)

N(prop,sing)

Melbourne on and

ah Good uh

in Friday in

Australia 19 Australia

154

INTERJEC

N(com,sing)

INTERJEC

SBS timber language

uh mill uh

station and language

uh ah uh

155

INTERJEC

CONJUNC(coord)

ADV(ge)

but and and

uh uh uh

but and and

strictly generally maybe

156

INTERJEC

ADV(ge)

ADV(inten)

Uh then a

not uh uh

very not maybe

much too sort

157

CONNEC(ge)

PRON(pers,plu)

PRON(pers,plu)

yeah yeah

but So but

we we they

we we they

159

CONNEC(ge)

INTERJEC

PRON(pers,plu)

And

ah Well And

then uh uh

they they they

160

ADV(ge)

ADV(ge)

ADV(inten)

desserts ’s maybe

not not maybe

um not not

too much that

161

V(intr,infin)

CONJUNC(coord)

INTERJEC

to gonna to

walk die happen

and and but

uh uh uh

33

Usages No-Be

once how always Formulae he is was Disfluent he that ah

um I war

Disfluent Articles

Disfluent to and what Disfluent ’ uh we Disfluent in summer problem

speaking it politics

Disfluent Articles

Disfluent not big of Disfluent wouldn’t had take Disfluent dilate seemed coming Disfluent much different much

that that but

Disfluent Articles

Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre

Trigram

June 8, 2009 Post

162

PREP(ge)

INTERJEC

N(com,sing)

ah on TV

in uh in

ah um uh

constitution meat news

164

INTERJEC

ADV(inten)

ADJ(ge)

they nearly

uh uh Oh

bit nearly too

different wet busy

165

CONJUNC(coord)

INTERJEC

PREP(ge)

usually and uh

on uh and

sort then uh

of with on

166

PRON(pers,plu)

V(cxtr,pres)

PRON(pers,sing)

how what a

they they we

call call call

it it it

168

INTERJEC

CONJUNC(subord)

PRON(pers,plu)

and but

Oh ah uh

when when when

we we we

169

N(prop,sing)

CONJUNC(coord)

INTERJEC

to uh the

Australia Hawaii Fortune

but and and

uh uh uh

170

PRON(dem,sing)

V(cop,pres,encl)

PRON(univ)

and and but

that that that

’s ’s ’s

all all all

173

ART(def)

N(com,sing)

INTERJEC

’s got know

the the the

olde wife picture

uh ah uh

174

N(com,sing)

CONJUNC(coord)

INTERJEC

and the asthma

fruit hospital trouble

but and but

uh uh ah

176

N(com,sing)

PRON(nom)

PRON(pers,plu)

that same taxing

ball thing office

what what whatever

we they they

178

CONNEC(ge)

INTERJEC

CONNEC(ge)

mentioned

and And And

uh uh uh

then but but

180

CONJUNC(coord)

CONJUNC(coord)

INTERJEC

industry body car

and and and

and and and

ah uh ah

181

INTERJEC

ADV(ge)

PRON(pers,sing)

more Um But

uh oh uh

maybe maybe now

it I I

34

’s meat and

Usages Disfluent Articles

Disfluent but se too

movies my the

Disfluent Articles

Formulae

lunch Disfluent came get have

then Los Price

Disfluent Articles

Formulae that o Disfluent this we uh

that

Disfluent Articles

ah Relative-What play doned call

it uh they

I when oh

’s ’ve think

Disfluent Articles

Disfluent Articles

Disfluent Pre-Negator

Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre

Trigram

June 8, 2009 Post

182

INTERJEC

CONJUNC(coord)

INTERJEC

uh and Say

uh uh uh

and and and

uh uh uh

183

CONJUNC(coord)

INTERJEC

PRON(pers,plu)

Brisbane and but

and ah uh

uh when we

we they we

184

AUX(modal,pres)

V(montr,infin)

PRON(dem,sing)

you probably I

can can can

turn repeat say

that that that

185

INTERJEC

INTERJEC

N(com,sing)

of of African

uh uh uh

uh uh uh

life 500 experience

186

CONNEC(ge)

CONJUNC(subord)

PRON(pers,plu)

for

And then Only

when when that

we we they

187

INTERJEC

PRON(pers,sing)

PRON(pers,sing)

Um I uh

I I I

I I I

I I I

191

CONNEC(ge)

INTERJEC

CONJUNC(coord)

And And And

uh uh ah

and and and

198

CONNEC(ge)

INTERJEC

PRON(pers,sing)

But ah war

uh well then

I ah uh

I it I

201

INTERJEC

PRON(dem,sing)

V(cop,pres,encl)

I

Uh ah Oh

that that that

’s ’s ’s

202

CONJUNC(coord)

INTERJEC

CONJUNC(subord)

now old jobs

but but and

ah ah uh

when because after

204

N(com,sing)

INTERJEC

N(com,sing)

uh um uh

um blackcurrant day

uh ah uh

bleedin jelly day

208

ADV(ge)

ADV(ge)

ADV(ge)

Uh well fell

up not slowly

usually really slowly

now not back

210

INTERJEC

CONNEC(ge)

INTERJEC

mostly they and

mhm uh ah

but but but

uh uh ah

35

they hm on

turn see had

Usages Disfluent Articles

Disfluent Articles

[useless] off sauna Disfluent style and and [useless] was got ’re Disfluent found I notice

we I ah

Disfluent Articles

Disfluent like is was

about what a

we I that

Disfluent Formulae

Disfluent Articles

Disfluent ’ you [useless] in really and Disfluent I I most

Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre

Trigram

June 8, 2009 Post

212

PRON(pers,sing)

V(cop,past)

ADJ(edp)

time work when

I I I

was was was

scared 12 10

213

PRON(dem,plu)

N(com,plu)

CONJUNC(coord)

only things for

these and those

things and machines

and and and

215

PRON(dem,sing)

ADJ(ge)

N(com,sing)

that had holding

that this this

personal strange little

accident feeling girl

216

CONJUNC(coord)

INTERJEC

CONNEC(ge)

impatient doctors know

and and but

ah ah ah

then then but

219

CONJUNC(coord)

INTERJEC

ADV(ge)

and but orchestra

ah uh and

that not uh

then recently back

221

INTERJEC

N(com,plu)

N(com,plu)

like uh at

uh uh uh

bank beans church

accountants peas dances

222

CONJUNC(coord)

INTERJEC

PRON(dem,sing)

ah celery and

and and uh

ah uh oh

that that that

224

INTERJEC

ADJ(ge)

N(com,sing)

and and a

uh uh uh

different long small

kind jump piece

226

INTERJEC

N(com,sing)

CONJUNC(coord)

and Oh club

uh rice uh

bread porridge trade

and or and

228

ADV(ge)

CONJUNC(coord)

PRON(pers,plu)

them s accepted

properly somehow then

or and and

they they they

233

INTERJEC

N(com,sing)

PREP(ge)

uh dementia the

porridge uh uh

cup part sort

of of of

239

INTERJEC

FRM

FRM

just is

uh Ah uh

you you you

know mean know

240

PRON(pers,sing)

V(cop,pres,encl)

INTERJEC

said so

it it It

’s ’s ’s

uh uh uh

36

Usages [useless]

years years Articles uh time we Articles that ’sthese

the I if

Melbourne no in

Disfluent Articles

Disfluent Articles

Disfluent there carrots or

was meat ’s

of and of

sausage semolina things

Disfluent Articles

Disfluent Articles

Disfluent Articles

Articles are s gave Disfluent coffee the a

you in you

Disfluent No-There

Disfluent when it

Automatically Extracting Typical Syntactic Differences from Corpora

Rank/Pre

Trigram

June 8, 2009

Post

241

PRON(pers,plu)

V(montr,pres)

ART(indef)

really nighttime course

they we they

get have have

a a a

243

INTERJEC

N(com,sing)

N(com,sing)

got control on

uh by uh

money quality um

production control meat

247

INTERJEC

CONJUNC(subord)

PRON(pers,sing)

and uh

uh Uh if

because if I

I I it

253

PRON(pers,plu)

V(montr,past)

PRTCL(to)

days course um

we we they

had had wanted

to to to

261

INTERJEC

N(prop,sing)

N(prop,sing)

Um been uh

uh uh uh

Billy Saint Family

Crystal George Feud

263

CONJUNC(coord)

INTERJEC

CONJUNC(coord)

did before do

and and and

uh uh uh

and and and

269

NUM(card,sing)

N(com,sing)

PREP(ge)

and Uh one

one one one

day place side

at to of

276

N(com,plu)

CONJUNC(coord)

INTERJEC

short understand sometime

shorts people nights

and and and

uh uh ah

37

Usages [useless]

good just right Disfluent money manager meat Disfluent I want ’s [useless] say leave meet Disfluent ambulance

so uh plus

Disfluent Articles

[useless] the another the

barefoot you dollar

Disfluent Articles

Suggest Documents