Automatically Extracting Typical Syntactic Differences from Corpora
Wybo Wiersma∗ University of Groningen
[email protected]
Abstract We develop an aggregate measure of syntactic difference for automatically finding typical syntactic differences between collections of text. With the use of this measure it is possible to mine for statistically significant syntactic differences between for example, the English of learners and natives, or between related dialects, and to find not only absence or presence, but also under- and overuse of specific constructs. We have already applied our measure to the English of Finnish immigrants in Australia to look for traces of Finnish grammar in their English. The outcomes of this detection process were analysed and found to be very useful, about which a report is included in this paper. Besides explaining our method, we also go into the theory behind it, including the less wellknown branch of statistics called permutation statistics, and the custom normalisations required for applying these tests to syntactical data. We also explain how to use the software we developed to apply this method to new corpora, and give some suggestions for further research.
1
Introduction
Languages are always changing and never homogenous or completely isolated. For example, language contact is a common phenomenon, ∗
Written by Wybo Wiersma as his BA-thesis, but meant to be publised with John Nerbonne (University of Groningen,
[email protected]) and Timo Lauttamus (University of Oulu,
[email protected]) as co-authors. They contributed substantial parts to the research-design, the analysis and to the paper.
and one which may even be growing due to the increased mobility of recent years. There are also differences in language usage between various professions, classes and sub-cultures in society, whether or not under the influence of education and old and modern media, such as newspapers, television and the internet. And lastly, there are of course differences between regional dialects, which might also be changing, merging, or branching under the influence of the abovementioned, and many other factors, including their own complex internal dynamics. But as these rich fields for investigation are being explored, we nonetheless still seem to have no way of assaying the aggregate differences in language usage between various groups. For example, most of the cross-linguistic research into Second Language Acquisition (SLA) has so far focused on examining typical second-language learners’ errors, such as absence of the copula, absence of prepositions, different (deviant) uses of articles, loss of inflectional endings, and a deviant word order, and on making inferences about them to explain potential substrate influence (Odlin, 1989; Odlin, 1990; Odlin, 2006a; Odlin, 2006b). As Weinreich famously noted: “No easy way of measuring or characterizing the total impact of one language on another in the speech of bilinguals has been, or probably can be devised. The only possible procedure is to describe the various forms of interference and to tabulate their frequency.” (Weinreich, 1968, p. 63) This paper proposes, explains and tests a computational technique for measuring the aggregate
Automatically Extracting Typical Syntactic Differences from Corpora
closely, who suggest focusing on tag sequences in learner corpora, just as we do. We shall add to their suggestion a means of measuring the aggregate difference between two varieties, and show how we can test whether that difference is statistically significant. Nathan Sanders (2007) has, in the meantime, extended our method to use parse tree leaf-path ancestors of Sampson (2000) instead of n-grams. These are the tags along the routes from the root tag of the parse-tree up to and not including each word of the sentence. He used it on the British part of the ICE (Internatonal Corpus of English) corpus, which is already fully parsed. In this paper, however, we only report on the method using n-grams, but we do see his extensions as promising for the technique. Baayen, van Halteren & Tweedie (1996) work with full parses on an authorship recognition task, while Hirst & Feiguina (2007) apply partial parsing in a similar study, obtaining results that allow them to distinguish a notoriously difficult author pair, the Bronte sisters. Also they establish that their technique can work for even short texts (500 words and fewer), and this could be an enormous advantage in considering practical applications of methods such as the one presented here or even a combination of our methods.
degree of syntactic difference between two varieties of language. With it we attempt to measure the “total impact” in Weinreich’s sense, and we have come a long way, albeit with respect to a single linguistic level, syntax. It may make the data of many linguistic, sociolinguistic and dialectological studies amenable to the more powerful statistical analysis currently reserved for numerical data. Naturally we wanted more than a measure which simply assigns a numerical value to the difference between two syntactic varieties, we also wanted it, just as any useful statistical measure, to be able to show us whether, and if so, how significant the difference is. In addition to this, we also wanted to be able to examine the sources of the difference, both in order to win confidence in the measure, but also to answer linguistic questions, such as those about the relative stability/volatility of syntactic structures. Thus the technique presented not only offers a significance value for the aggregate difference, but also allows one to pinpoint the syntactic differences responsible for it. In the second and third sections we introduce, explain and discuss our method. In the fourth we go into some practical results it has produced as applied to the English of Finnish immigrants in Australia. The fifth will go deeper into the statistical theory behind it, and contains a full, mathematical description of the method. Then the sixth and seventh sections contain an explanation of how to use the software we produced for doing this research, resp. some suggestions for possible future research. We will then conclude and give some acknowledgements. 1.1
June 8, 2009
2
Method
The fundamental idea of the proposed method is to tag the corpus to be investigated syntactically, to create frequency vectors of n-grams (trigrams, with n = three, for example) of POS (part-ofspeech) tags, and then to compare and analyse these using a permutation test. This then results in both a general measure of difference and a list with the POS-n-grams that are most responsible for the difference.
Related Work
Thomason and Kaufmann (1988) and van Coetsem (1988) noted, nearly simultaneously, that the most radical (structural) effects in language contact situations are to be found in the language of switchers, i.e., in the language used as a second or later language. In line with this we looked at the English of immigrants in our example research. Poplack and Sankoff (1984) introduced techniques for studying lexical borrowing and its phonological effects, and Poplack, Sankoff and Miller (1988) went on to exploit these advances in order to investigate the social conditions in which contact effects flourish best. We follow Aarts and Granger (1998) most
In five steps the method is: • POS-tag two collections of comparable material • Take n-grams (1- to 5-grams) of POS-tags from it • Compare their relative frequencies using a permutation test • Sort the significant POS-n-grams by extent of difference 2
Automatically Extracting Typical Syntactic Differences from Corpora
• Analyse the results
2.2
Taking n-grams of POS-tags
Then, in order to be able to look at syntax instead of just single POS-tags, the tags are collected into n-grams, i.e. sequences of POS tags as they occur in corpora. For a sentence such as “The cat sat on the mat”, the tagger assigns the following POS labels:
Now we will describe these steps in more detail, starting with the tagging of two comparable collections of text. 2.1
June 8, 2009
Tagging two collections of comparable material
We start with two collections of comparable material. For this you can think of two sets of interviews with people from different dialect areas, or essays from two different grades and other similar pairs of samples. In our example research into the English of Finnish immigrants in Australia we used interviews divided in two generations of arrival in Australia. One group was aged 16 or younger when they disembarked, and the other was 17 or older. The interviews come from the F INNISH AUSTRALIAN E NGLISH C ORPUS by Greg Watson (1996), and are reduced to the parts that were free conversation, leaving a remaining total of 305,000 words. Then the material is POS-tagged, or in other words, syntactic categories are assigned to all words in all the sentences. For practical reasons this is best done automatically, with a POS-tagger. There are many taggers, both statistical and rulebased taggers (such as CLAWS, the TreeTagger, Tsujii’s Tagger, and ALPINO for Dutch,1 but we use Thorsten Brants’ Trigrams ’n Tags (TnT) tagger, a hidden Markov model tagger which has performed at state-of-the-art levels in organized comparisons, achieving a precision of 96.7% correct tag assignments on the material of the Penn Treebank corpus (Brants, 2000). And as our chosen tagger is a statistical tagger, we also need a tag-set and a corpus to train it on. For this we choose the British part of the ICE corpus (Nelson et al., 2002). This corpus is fully tagged and checked by hand, so it forms a sound basis. It uses the TOSCA-ICE tagset, which is a linguistically sensitive set, designed by linguists (not computer scientists) and consisting of 270 POS tags (Garside et al., 1997).
The ART(def)
cat N(com,sing)
sat V(intr, past)
the ART(def)
mat N(com,sing)
. PUNC(per)
on PREP(ge)
And for a sentence such as “We’ll have a roast leg of lamb tomorrow (...)” (extracted from our data), the tagger assigns the following POS labels: We PRON(pers,plu)
’ll AUX(modal,pres,encl)
have V(montr,infin)
a ART(indef)
roast N(com,sing)
leg N(com,sing)
of PREP(ge)
lamb N(com,sing)
tomorrow ADV(ge)
These are then collected into n-grams. Trigrams are as follows: (first sentence) ART(def)-N(com, sing)-V(intr, past), ..., ART(def)-N(com, sing)PUNC(per), and (second sentence) PRON(pers, plu)-AUX(modal, pres, encl)-V(montr, infin),..., ART(indef)- N(com, sing)- N(com, sing),..., PREP(ge)- N(com, sing)-ADV(ge) ... In our research into the English of Finnish immigrants to Australia we used POS trigrams, and thus collected about 47,000 different kinds of trigrams, 47,000 POS-trigram-types in other words, from our corpus. For optimization reasons, and for a reason that we will come back to in section 5.3.2, we removed all trigrams that occurred 5 times or less, leaving us with a remaining total of 8,300 POS-trigram-types. 2.3
Compare frequencies using a Permutation Test
The next step consists of counting how frequently each of the POS-n-grams (the 8,300 trigrams in our case) occurs in both of the data-sets, or subcorpora. These counts result in two vectors, or in more familiar phrasing, in two table rows in a 2 × 8,300 element table, with for each n-gram a column of two cells, containing frequency counts, one for each group. These form the input for our
1
List of taggers: http://www-nlp.stanford.edu/links/ statnlp.html#Taggers, CLAWS & TreeTagger are on this list, Tsujiis Tagger: http://wwwtsujii.is.s.u-tokyo.ac.jp/∼tsuruoka/postagger/, ALPINO: http://www.let.rug.nl/ vannoord/alp/Alpino/ (only for Dutch)
3
Automatically Extracting Typical Syntactic Differences from Corpora
sponsible for the difference. This in order to be able to mine the corpus for the linguistic sources of the difference, to understand these better, and also to be able to test the method. We get a list of responsible POS-n-grams by looking at the vectors from the base-case again, and applying a permutation-test to all the individual POS-n-grams. In other words; we do 8,300 permutation-tests, one for each n-gram. So for each individual n-gram the RSquare value of the base-case is kept, and then in each permutation the RSquare value for the same n-gram is compared to it for being as large or larger, counting these, and using them to arrive at p-values, as in the original permutation test. For practical reasons we did these tests all at the same time, by keeping track of and comparing all the n-gram RSquare values, besides the aggregate RSquare value. It might look like there is issue with so called familywise errors, because of the 8,300 tests, of which 5% (about 415) would get a p-value below 0.05 by pure chance. We cover this by requiring the aggregate difference to be significant at a p level of 0.05 first (as an omnibus test, for the possible truth of the overall null-hypothesis; there being no overall difference). This guards us from complete null hypothesis family wise errors (Westfall and Young, 1993). More on this, and possible improvements, in section 7.3. Once we have this list of significant POS-ngrams we sift them into the groups for which they are typical. We do this for each n-gram by comparing the value found in the base-case for that ngram with the expected value for it based on the group-size. If the groups are of the same size, the expected values are the same for both groups (half the corpus-wide count for the n-gram) and it is simply a case of looking at which of the basecase values is larger, but if group-sizes differ, we need to use the expected values based on groupsize (see the third normalisation, section 5.3.3). So we have two lists now, one for each group. Then for each group we sort the n-grams in its list by how characteristic the n-gram type is for that group. The extent of typicality can be determined both relatively, normalised for frequency, and absolutely, but more on these options in sections 5.3.2 and 5.3.1. What is important here is that they are sorted in this step. As noted we only select and sort the significant POS-n-grams. The requirement of significance fil-
statistical-test. And even though there is no convenient test in classical statistics that we can apply to check whether the differences between vectors containing 8,300 elements are statistically significant, we fortunately may turn to permutation tests with a Monte Carlo technique in this situation (Good, 1995). We will explain it in detail (with all formulas) in section 5, but the fundamental idea in a permutation test is very simple: We first measure the difference between two sets of data in some convenient fashion, obtaining a degree of difference, lets call this the basecase. We then extract two sets at random from the total of all data pooled together, which become our new sets, and we calculate the difference between these two in the same fashion. We then look if the difference is the same or bigger than in the base case, and if it is, we take note of this. We repeat this process with the random sets ten thousand times, taking different random sets every time, and in the end we sum the total number of times the difference was at least as extreme as in the base case. This value is then divided by the ten thousand times we tried, and the outcome of that is — in standard hypothesis testing terms — our p-value. This represents how many times in ten thousand (standardized to one) a difference as large as found in the base-case would have occurred by pure chance. But before a statistical difference can be determined, a measure for the difference has to be chosen, and we have to decide what elements to permutate. For various statistical reasons explained in section 5.1 we permutated speakers, using our so called subject normalisation, and we normalised for frequency using our between-types normalisation (see section 5.3 for these). As our measure we developed RSquare, which takes the square of the difference between each normalised n-gram count for the two groups, more on this in section 5.4. For now it is important that some normalisations and a suitable measure are needed to get meaningful results from a permutation-test, that we permutated authors instead of sentences or n-grams, and that the test provides us with a p-value for the overall difference between the two sets of data. 2.4
June 8, 2009
Sort POS-n-grams by extent of difference
In addition to testing the aggregate difference between the two sub-corpora, we also wanted to be able to extract the POS-n-grams that were most re4
Automatically Extracting Typical Syntactic Differences from Corpora
June 8, 2009
ters out any n-grams which occur only or mostly in one group, but for which this is probably due to chance. For example if we selected n-grams for which more than, say, 80% of occurrences fall in one group, then we would also select n-grams occurring only once, purely by chance, in one of the groups (these would be 100% characteristic, but never significant). At the end of this step we thus have two lists of typical, significant POS-n-grams which can be analysed for linguistic causes of the aggregate difference between the groups. Moreover because we sort them to bring the most characteristic ngrams to the top, we are certain to see the most relevant data first, which is especially useful when analysing the data only up to a specific cut-off point, such as for example the top-200 of each list.
tact by themselves. A POS-n-grams being significantly characteristic for one group or another is only a fact, and a fact that can be accounted for by multiple theories. Thus there can be more than one explanation for the over- or under-use of POS-ngrams, and different samples from the corpus can suggest different explanations. So caution is warranted here. To conclusively test hypotheses using this method one has to formulate them in advance, and only then test them using the method, preferably using multiple, representative data-sets (also see section 7.3). But besides all these cautions, the method is quite powerful, and has been shown to be useful for at least data mining in our research from the analysis we made of the English of Finnish immigrants in Australia. More on this in section 4.
2.5
3
Analyse the results
The last step consists of analysing the data. This comes down to putting the top-x n-grams in context and comparing them with various linguistic hypotheses about the aggregate difference. This obviously requires intimate knowledge of the languages and/or dialects under comparison. It is where the computation ends, and the linguistic skills begin. Still to make things easier we did develop some tools to help the analysis, e.g. a program to find examples from the corpus for given POS-n-grams. See section 6 for more on this, and the other tools for applying the method. Besides tools and linguistic skills there are two important facts to keep in mind while analysing the results. The first is that the method will find differences only between the two sets provided for comparison. That is if a certain POS-n-gram is over- or under-used in both groups, relative to the general population, then the method will not find it as being characteristic for both groups. So when dealing with two generations of second language learners, as we did, one has to be careful not to make claims that involve a comparison to the broader population of native speakers. Thus one either has to make the right comparison, or has to limit one’s claims. The second thing to note is that when a certain (set of) POS-n-grams is significantly characteristic for a group, this does not mean that one’s explanation for it being so, is significant too. That is we do not claim that POS-n-grams can play an explanatory role in the account of language con-
Rationale
Now we will discuss the reasons for, and assumptions behind our method, and after that some of its features and options. 3.1
Why this Method
An important feature of our method is that it can identify not only deviant syntactic uses (errors), but also the overuse and underuse of linguistic structures, whose importance is emphasized by researchers on second-language acquisition (Coseriu, 1970; Ellis, 1994; Bot et al., 2005). According to these experts it is misleading to consider only errors, as second language learners likewise tend to overuse certain possibilities and tend to avoid (and therefore underuse) others. For example, de Bot et al. (2005) suggest that non-transparent and difficult constructions are systematically avoided even by very good secondlanguage learners. In a similar vein, Thomason (2001, 148) argues that learners often ignore or fail to learn certain target language distinctions that are opaque to them at early to middle stages of their learning process. And we expected to find this kind of SLA behaviour in our migrants’ data. This brings us to a second feature of the method: it can be applied to rather rough data, such as transcripts of speech, the writing of second-language learners, or texts from a different dialect, or even the intersection of these, as in our case: transcripts of the speech of Australian immigrants, using a tagger trained on British data (a different dialect). And as noted in the introduction it can be applied 5
Automatically Extracting Typical Syntactic Differences from Corpora
shall need automatically annotated data, we encounter a second problem with parse-trees, viz., that our parsers, the automatic data annotators capable of full annotation, are not yet robust enough for this task, especially for rougher data, such as that of spoken language or non-natives (even the best score only about 90% per constituent on edited newspaper prose).
to any language, using any tag-set for which there is a tagger or a tagged corpus available (TnT can be trained on any tagged corpus). We note that natural language processing work on tagging has compared different tag sets, noting primarily the obvious, that larger sets result in lower accuracy (Manning and Sch¨utze, 1999, 372 ff.). Still, since we aimed to contribute to the study of language contact and second-language learning, we chose a linguistically sensitive set, that is, a large set designed by linguists, not computer scientists, but other choices are possible, and may lead to lower error-rates or more linguistic precision. A third benefit of this method is that it delivers numerical data, and also provides significance levels with them. This allows one to go beyond the anecdotal and give a more rigorous grounding to dialectological or linguistic hypotheses. It can be important not only in the study of language contact, but also in the study of second-language acquisition. And it is also more generally linguistically significant, since contact effects are for example prominent in linguistic structure and wellrecognized confounders in the task of historical reconstruction. Numerical measures of syntactic difference may enable these fields to look afresh at many issues. It is important however to know that our method hinges on two assumptions, even if we think that they are reasonable and sound. We will introduce and discuss them now. 3.2
June 8, 2009
A second objection to the use of POS-n-grams could be that syntax concerns more than POSn-grams. In response we wish to deny that this is a genuine problem for the development of a measure of difference. We note that our situation in measuring syntactic differences is similar to other situations in which effective measures have been established. For example, even though researchers in first language acquisition are very aware that syntactic development is reflected in the number of categories, and rules and/or constructions used, the degree to which principles of agreement and government are respected, the fidelity to adult word order patterns, etc., they are still in large agreement that the very simple mean length of utterance (MLU) is an excellent measure of syntactic maturity (Ritchie and Bhatia, 1998). Similarly, in spite of their obvious simplicity, life expectancy, infant mortality rates and even average height are considered reliable indications of health when large populations are compared. We therefore continue, postulating that the measure we propose will correlate with syntactic differences as a whole, even if it does not measure them directly.
POS-n-grams show Syntax
In fact we can be rather optimistic about using POS-n-grams given the consensus in syntactic theory that a great deal of hierarchical structure is predictable given the knowledge of lexical categories, in particular given the lexical head. Sells (1982, §§ 2.2, 5.3, 4.1) demonstrates that this was common to theories in the 1980s (Government and Binding theory, Generalized Phrase Structure Grammar, and Lexical Function Grammar), and the situation has changed little in the successor theories (Minimalism and Head-Driven Phrase Structure Grammar). Even though the consensus of twenty years ago has been relaxed in recognition of the autonomy of constructions (Fillmore and Kay, 1999), it is still the case that syntactic heads have a privileged status in determining a projection of syntactic structure. So it is likely that even individual POS-n-grams typical for a group of language users can give the skilled linguist at
The pivotal assumption behind our method is that POS-n-grams offer a good aggregate representation of syntax. And a first objection to it could be that POS-n-grams do not reflect syntax completely and that we thus should focus on full parse-trees instead. However a practical problem with using parse-trees is first of all that there is no syntactic equivalent of an international phonetic alphabet (Association, 1949), i.e. an accepted means of noting (an interesting level of) syntactic structure for which there is reasonable scientific consensus. Moreover, the ideal system would necessarily reflect the hierarchical structure of dependency found in all contemporary theories of syntax, whether indirectly based on dependencies or directly reflected in constituent structure. Since it is unlikely that researchers will take the time to hand-annotate large amounts of data, meaning we 6
Automatically Extracting Typical Syntactic Differences from Corpora
least a good indication of what the differences in the underlying syntax of their speech or writing might be. 3.3
that our method is impractical due to about half of the 3-grams of the full tagset being erroronous. But fortunately we can fall back on a property of the statistics of large numbers here, namely that as the errors produced by automatic taggers are more or less random, they will cancel each other out, effectively annulling the influence of tagging errors. This cancelling of errors happens because in most cases we can expect the tagger to make similar mistakes in both groups, so they favor neither of the groups. A tentative analysis confirmed this as we found that the tagging errors in both groups seem to be of a similar kind, and also the errorrates in both groups are very similar, as can be seen in table 2 (also see the results in section 4.6 for a further confirmation).
Part of Speech Taggers are Accurate Enough
The method’s second assumption is that POStaggers are accurate enough for the POS-n-grams to reflect syntax at an aggregate level. First of all the facts: As noted we used Thorsten Brants’ Trigrams ’n Tags tagger. We used it with the tagset of the TOSCA-ICE consisting of 270 tags (Garside et al., 1997), of which 75 were never instantiated in our material. We trained the tagger on the British ICE corpus, which totals 1.000.000 words. Since our material was spoken English (see section 4), we also did some experiments with training TnT on only the spoken halve of the ICE corpus, but performance was better when using the whole corpus. Even then, using the British ICE corpus was suboptimal, as the material we wished to analyze was the English of Finnish emigrants to Australia, but we were unable to acquire sufficient tagged Australian material. We tested the TnT tagger with a sample of 1.000 words from our material which was tagged by hand. We found that the tagger was correct for 81.2% of POS-tags. The accuracy is poor compared to newspaper texts, but we are dealing with conversation, including the conversation of nonnatives here, so it is pretty good. Still, n-grams consist of multiple POS-tags, so performance is worse for larger n-grams; namely 56.1% for 3grams, and even 39.0% for 5-grams. In addition we also performed our small test using the reduced ICE tagset. It consisted of only 20 tags. For this performance was a bit better, namely: 86.7% for 1-grams, 66.6% for 3-grams, and 51.8% for 5grams. See table 1 for the full set of results.
Table 2: Performance of TnT for the groups. Results of both groups, for the full and reduced tagsets.
N-grams
full tagset
reduced tagset
81.2% 67.5% 56.1% 46.7% 39.0%
86.7% 76.2% 66.6% 58.8% 51.8%
N-grams
youth
adults
1-grams 2-grams 3-grams 4-grams 5-grams
81.7% 67.3% 56.0% 47.6% 40.4%
80.9% 67.7% 56.2% 46.2% 38.2%
Thus the results of the method in the form of a p-value and the POS-n-grams responsible for the difference, will be due mostly to the correctly tagged POS-n-grams, and thus it is capable of delivering meaningful results. Especially when the individually detected POS-n-grams are analysed and looked up in the corpus, as in that case any parsing-errors that would show up in the results can be corrected by hand or left out of the analysis.
4
Tangible Results
In this section we will describe our research into the English of Finnish immigrants in Australia. We will especially examine the trigrams responsible for the difference. We have performed experiments with bigrams and with combinations of n-grams for larger n, but we restrict the discussion here to trigrams in order to simplify presentation. We mainly test their usefulness for linguistic analysis here, in order to evaluate the method.
Table 1: Performance of TnT on our data. Results per n-gram size for both the full and the reduced tagsets. 1-grams 2-grams 3-grams 4-grams 5-grams
June 8, 2009
4.1 So the performance of TnT is not that good for 3-grams and longer n-grams, making it arguable
Our Research
As noted we applied the procedure described in section 2 to the Finnish Australian English Corpus 7
Automatically Extracting Typical Syntactic Differences from Corpora
corpus, a corpus of interviews compiled in 1994 by Greg Watson of the University of Joensuu, Finland (Watson, 1995; Watson, 1996). The informants were all Finnish emigrants to Australia and they are classified into two groups in this report: (1) the ADULTS (group ‘a’, adult immigrants), who were 18 or older upon arrival in Australia; (2) the JUVENILES (group ‘j’, juvenile immigrant children of these adults), who were born in Finland and were all under the age of 18 when they disembarked.
4.2
June 8, 2009
The Corpus and the Interviewees
Greg Watson interviewed the subjects between 1995 and 1998. They were farmers and skilled or semi-skilled working class Finns who had left Finland in the 1960’s at the age of 25-40 years old, some with children. The average age of the juveniles was 6 at the time of arrival, and 36 at the time of the interview, as opposed to the adults, who were, on average, 30 at the time of arrival and 59 at the time of the interview. The genders are almost equally represented in the two groups. There are sixty-one conversations with adults and twenty-seven with juveniles, each lasting approximately 65 to 70 minutes. The interviews were transcribed in regular orthography by trained language students and later checked by the compiler of the corpus. We used only those sections of the interviews which consist of relatively free conversation, i.e. a total of 305,000 word tokens. Of these 84,000 were by the juveniles and 221,000 by the adults group. Neither group of speakers was formally tested for their proficiency of English. By observing the data, however, we can confidently say that the average level of the adults’ English is considerably lower than that of the juveniles, who had received their school education in Australia, as opposed to the adults, who were educated in Finland. Also the sentences of the juveniles were substantially longer (a MLU of 27.1 tokens) than those of the adults (16.3 tokens), confirming a difference in their language skills. The rest of this description is more hypothetical and based on a similar account of a number of immigrant groups of European origin in the United States (Lauttamus and Hirvonen, 1995). It is however supported by the biographical interview data elicited from our informants in Australia. The adult immigrants (adults) typically went on speaking Finnish at home and carried on most of their social life in that language. They struggled to learn English, with varying success. Even the best learners usually retained a distinct foreign accent and some other foreign features in their English. But we can say that they were marginally bilingual, as most of them could communicate successfully in English in some situations at least, although their immigrant language was clearly dominant. The immigrant parents also spoke their native language to their children (the juveniles), so this
The goal of this research was to use our method to detect the linguistic sources of the syntactic variation between the two groups, and we examined the degree of what we call syntactic ‘contamination’ in the English of the adult speakers. We wanted to detect the linguistic sources of the variation between the two groups and interpret the findings from (at least) two perspectives, universal vs. contact influence. In our reading, the notion of ‘universal’ is concerned, not with hypotheses about Chomskyan universals and their applicability to SLA or with hypotheses about potential processing constraints in any detail, but rather with more general properties of the language faculty and natural tendencies in the grammar, called ‘vernacular primitives’ by Chambers (Chambers, 2003, 265-266). To explain differential usage by the two groups, we also draw upon the strategies, processes and developmental patterns that second-language learners usually evince in their inter-language regardless of their mother tongue (Faerch and Kasper, 1983; LarsenFreeman and Long, 1991; Ellis, 1994; Thomason, 2001). In addition to looking for ‘universal’ and contact-influences we also distinguish between adult immigrants and immigrant children based on Lenneberg’s (1967) critical age hypothesis, which suggests a possible biological explanation for successful second language (L2) acquisition between age two and puberty, and in any case predicts a difference in the expected L2 proficiency between the juveniles and the adults. Thus we also tested our idea about measuring syntactic differences by looking for such a difference. The issue is not remarkable, but it allowed us to verify whether the measure is functioning. 8
Automatically Extracting Typical Syntactic Differences from Corpora
‘learner language’ or interlanguage (Pietil¨a, 1989; Hirvonen, 1988). Thus it is expected that the English spoken by first-generation Finnish Australians is affected only to a lesser extent in its morphosyntax. But as this research is only about morphosyntax, we will not go further into vocabulary and phonology here. Also we expect that the adults show more contact-induced effects in their speech than the juveniles, since the former only temporarily shift to English, as in the case of an interview, whereas the latter have already shifted to English as their language of everyday communication.
generation usually learned the ethnic tongue as their first language. The oldest child might not have learned any English until he or she went to school. The younger children often started learning English earlier, from their older siblings and friends. At any rate, during their teen years the second-generation children became more or less fluent bilinguals. Their bilingualism was usually English-dominant: they preferred to speak English to each other, and it was sometimes difficult to detect any foreign features at all in their English. As they grew older and moved out of the Finnish communities, their immigrant language started to deteriorate from lack of regular reinforcement. Some give up speaking their ethnic language altogether. In the second generation it was still common to marry within the ethnic group. But even if the spouses shared an ethnic heritage, they were usually not comfortable enough in the ethnic language to use it for everyday communication with each other. Therefore they did not speak it to their children, either. For an ethnic language to stay alive as a viable means of communication for a longer time than two generations would require a continuous influx of new immigrants but, as (Watson, 1995, 229) points out, since the late 1970’s onwards immigration to Australia has been quite restricted, to all nationalities. 4.3
June 8, 2009
In line with this we are also led to assume here that the children of Finnish Australian immigrants represent a case of shift without interference similar to that of ‘urban immigrant groups of European origin in the United States’ (Thomason and Kaufmann, 1988, 120). The immigrants maintain their own ethnic languages for the first generation, while their children (and grandchildren) shift to the English of the community as a whole with hardly any interference from the original languages. Thus even members of the first generation of immigrants (adults) demonstrate a variety of achievements, including native-like ability, and members of the second generation (juveniles) speak natively (Piller, 2002), and language attrition does not wait till the third generation but begins with the first generation (Waas, 1996; Schmid, 2002; Schmid, 2004; Cook, 2003; Jarvis, 2003).
The Language Contact Situation
From a typological point of view, the following generalisations can be made. The basic pattern is one of maintenance of the ethnic language by the Finnish immigrant generation (adults) and the subsequent shift from Finnish to English by the second generation (juveniles). The first generation is characterised as ‘marginally bilingual’ and in contrast with the fluent bilinguals of the second generation, they can also be regarded as non-fluent bilinguals, or as L2 (second language, English) learners with some degree of L1 (Finnish) interference. Here Finnish is linguistically dominant over English, whereas English is socially dominant over Finnish. It is expected that this interference from their Finnish begins with sounds (phonology) and morphosyntax, not with vocabulary. We expect this because in the English spoken by first-generation Finnish Americans phonological and morphosyntactic patterns also often deviate from standard (American) English in the manner typical of
In addition Clyne & Kipp (2006) also note that the language groups ‘with a very low shift to English’ are those that have recently arrived from Southeast and East Asia and Africa, groups with different ‘core values such as religion, historical consciousness, and family cohesion’. The highshift groups tend to be ones ‘for whom there is not a big cultural distance from Anglo-Australians’. Although there is no mention of Finnish Australians in Clyne & Kipp, we argue that they belong to the high-shift group. Consequently, we expect to find most of the evidence for syntactic contamination in the English of first-generation Finnish Australians, as the second generation may have already shifted to English without any interference from Finnish. 9
Automatically Extracting Typical Syntactic Differences from Corpora
4.4
June 8, 2009
3. Omission of primary (copula) ‘be’ and of the primary verb ‘be’ in the progressive (present and past) is frequent in the adults as opposed to the juveniles. Learners who rely on auditory input such as the less educated adults, leave out unstressed elements because they do not perceive them and, consequently, cannot process them.
Significance and Trigrams
After applying the method we found that the groups clearly differed in the distribution of POS trigrams they contain (p < 0.0001). This means that the difference between the original two subcorpora was in the largest 0.01% of the Monte Carlo permutations, and thus highly significant. Also we find genuinely deviant syntax patterns if we inspect the individual trigrams responsible for the difference between the two sub-corpora. As noted in section 2.5 only differences between the groups can be found using our method, and thus we have no evidence of potential contamination of the juveniles’ L2 acquisition relative to native speakers. Therefore we cannot yet prove or refute our hypothesis about the strength of contact influence as opposed to that of the other factors. Nevertheless we can still deduce much from the POS trigrams that contributed to the statistically significant differences that we did find in the data. It should be remembered that we only used the POS-trigrams that were found to be statistically significant and typical (individual α levels of 0.05). We limited our analysis to a practically random selection of 137 trigrams from the 308 trigrams found to be significant and typical for the adult-speakers.2 They were sorted by their relative typicality (see section 5.3.2 for the normalisation) and analysed using 20 random sentences from the corpus which contained them. The findings can be described and listed as follows:
4. Underuse of the existential (expletive) ‘there’ and the anaphoric ‘it’ in subject position is frequently attested in the English of the adults. For example in: ‘often wine if is some visitors’, where the speaker is aiming at ‘...if there are some visitors’. This can be explained in terms of substratum influence from Finnish, which would assign the subject argument of the copula verb ‘be’ to the NP ‘some visitors’, and consequently, would not mark the subject in the position before the copula. 5. The adults produce utterances where the negator ‘not’ is placed in pre-verbal position. This can be ascribed to Finnish substratum transfer (shift-induced interference), because Finnish always has the negative item (ei, inflected in person and number such as any verb in standard Finnish) in pre-verbal position. Nevertheless examples can be found in other non-standard varieties of English, e.g. in Spanish interlanguage English (Odlin, 1989, 1040). 6. The adults misuse the pronoun ‘what’ as a relative pronoun or complementiser. This can be ascribed to substratum transfer from Finnish where ‘what’ in plural is used as interrogative pronouns or relative pronouns. We argue that this usage of ‘what’ as a relative is overgeneralised on the model of Finnish, even though we know that it is found in some substandard English.
1. Overuse of hesitation phenomena (pauses, filled pauses, repeats, false starts etc.) and parataxis (particularly with ‘and’ and ‘but’, not only at the phrasal level but also at the clausal level) by the adults as opposed to the juveniles. These arise from difficulties in speech processing in general and lexical access in particular. Their speech is less fluent. 2. The adults demonstrate overuse (and underuse) of the indefinite and definite articles. The first generation speakers supply redundant definite articles, such as ‘in the Finland’. It appears that hypercorrection (or overgeneralization), common in learner language, may be a contributing factor, as Finnish has no article system.
7. The adults extend the simple present (as opposed to the past tense and the progressive) to describe not only present but also past or future events. This can be partly ascribed to Finnish substratum transfer, as Finnish has no equivalent progressive forms or ways to express future events in its repertoire and mainly uses the simple present in these functions. Pietil (1989, 176) also reports on this in
2
Originally it was the top-176 obtained using our earlier, slightly less robust normalisation
10
Automatically Extracting Typical Syntactic Differences from Corpora
the English of first-generation Finnish Americans.
relative or complementiser (6), and overuse of the simple present to describe past and future events (7).
8. There is also evidence that the adults have acquired some formulae such as ‘that’s’ and ‘what’s’ as in ‘that’s is a’. Acquired as fixed phrases, they have apparently been processed as single elements. Learners often produce formulae or ready-made chunks as their initial utterances. Acquired formulae cannot be ascribed to substratum transfer, as they tend to be recurrent in any interlanguage.
These findings are supported by the fact that Ellis (1994, 304 - 306) argues that avoidance (‘underrepresentation’) and overuse (‘over- indulgence’) can result from transfer. Learners avoid using non- transparent L2 constructions, underusing them, and overuse simpler alternatives. Accordingly, the overuse of parataxis coupled to hesitation phenomenon by the adults (1), for example, may partly result from their inability or unwillingness to produce subordinate constructions (hypotaxis).
The table below (3) shows a listing of the numbers of trigrams supporting each of the above claims:
On the other hand what makes our argument for the role of contact somewhat less convincing is that many, and many of these ‘deviant’ features might also be ascribed to more ‘universal’ properties of the language faculty, such as the hesitation phenomena (1), absence of the copula be (3) and acquired formula (8), while others may be explained in terms of both factors (such as 5, 6 and 7).
Table 3: Trigram counts for the eight findings. The index-number corresponds with the numbered findings list in the text. The ‘short name’ is a descriptive name that corresponds to the full table of trigrams at the end of the paper. Count is the number of occurrences in the top. The ‘[useless]’ label refers to the trigrams that were not fruitful for analysis Index 0 1 2 3 4 5 6 7 8
Short name [useless] Disfluent Articles No-Be No-There Pre-Negator Relative-What Present Formulae
Count
However, our findings support other empirical evidence, reviewed, for example, by Larsen Freeman and Long (1991, 96-113), and Ellis (1994, 299-345), that shows how the learners L1 influences the course of L2 development at all levels of language, although transfer seems to be more conspicuous in phonology, lexis and discourse than in morphosyntax. See our earlier paper (Lauttamus et al., 2007) for a more extensive qualitative analysis.
24 90 39 1 3 3 3 3 5
It should be noted that we have the most evidence for disfluency and the over- and under-use of articles, though the other findings are also well supported. In addition, because this method also detects over and under-use, the notion of avoidance does not therefore imply total absence of a feature in either group. 4.5
June 8, 2009
On the basis of the analysis so far, we contend that the statistical evidence obtained from our data indeed reflects syntactic distance between the two varieties of English and, consequently, aggregate effects of the difference in the two groups’ English proficiency. We also confirm that the adults are (still) showing features of ‘learner’ language, and therefore those of ‘temporary’ or ‘partial’ shift, whereas the juveniles seem to show those of shift without interference.
Qualitative Analysis
In the English of the adult speakers of Finnish emigrants to Australia we have detected a number of features that we describe as ‘contaminating’ the interlanguage, and can be attributed to Finnish substratum transfer. These features include overuse (and underuse) of articles (2), omission of the existential there (4), use of the negator not in pre-verbal position (5), misuse of what as a
These findings are in accordance with Lenneberg’s (1967) critical age hypothesis, and many others who point out that age (of onset) is a crucial determinant of successful L2 acquisition. 11
Automatically Extracting Typical Syntactic Differences from Corpora
4.6
June 8, 2009
carding those that have to do with hesitation phenomenon (the dotted lines, ‘> disfluency’). When discarding the first two categories (both disfluency and articles), however, the results are more alike again. This suggests two things, namely that the over-use of articles occurs in frequent, but not very typical trigrams, and that the same, but to a lesser extent, might be true for hesitations. And while not conclusively proven it might be a good idea to look at the absolute data first in later research, especially if there are many n-grams found or when time-constraints are strict. Lastly we also hand-checked the performance of TnT for the 20 examples found with each of the analysed trigrams from the top-308 list. While doing this we did take into consideration a bandwidth of three words on each side of the trigram, so the context was taken into account. The results of this are very promising, because when compared to the performance of TnT on the whole corpus, performance for the significant n-grams is up by almost 20 percent points; 76.3% as opposed to the overall 56.5%. This means that the errors as introduced by the tagger are to a great extent indeed random noise that is filtered out by our method, confirming what we argued for in section 3.3. Note also that 76.3% approaches the ceiling of TnTs 81.2% performance on 1-grams very closely. This means both that it will likely be even better on cleaner data, and that eventhough we might still be missing something due to systematic tagging errors, it is unlikely to be very much, or to cause false overall significance. Thus the evaluation of our method via our experiment on the corpus containing the English of Finnish emigrants to Australia, is promising in that the method works well in distinguishing two different groups of speakers, in highlighting relevant syntactic deviations between the two groups, and all this while providing a precision that can be called very reasonable.
Quantitative Analysis
Now we come to a more quantitative analysis, where we will examine the consequences of sorting the n-grams, and check the performance of TnT on the top-n-grams to see if errors are indeed filtered out as noise. N-grams can be sorted in two ways, namely by absolute and relative typicality. The first normalisation (between-subjects normalisation, section 5.3.1) produces absolute data in the sense that the total number of occurances of trigrams still play a role, and thus more frequent trigrams will end up at the top. Whereas the second normalisation (between-types normalisation, section 5.3.2) normalises for frequency, and thus produces relative data, moving the most typical trigrams to the top. We start with a scatterplot (figure 1) of the distributions as produced by sorting the typical and significant trigrams by the two normalisations. Please note again that only the significant trigrams are sorted, and that exactly the same trigrams are sorted (but differently) on both lists. As can be seen in the scatterplot, the more frequent trigrams are generally less typical and vice versa. Thus the two ways of normalising really do produce differently ordered lists of trigrams. To an extent this distribution is something that should be expected as it is less likely that something very frequent is confined to only one group. Here for example because constructs that are frequent in a language are heard or seen more often, and in most cases easier, and thus more likely to be overused, instead of only erroronously being made up by many learners at the same time (and becoming frequent that way). So frequent trigrams are — as expected — in many cases less typical. Our second graph (figure 2) lists the cumulative percentage of useful trigrams found at each rank position. This graph is much like recall graphs as shown for Information Retrieval systems. For example for the ‘absolute’ data, after the highest ranking 120 trigrams have been looked at, approximately 40% of the interesting ones in the total set of 308 have been seen. What it shows is slightly surprising, and it is that the absolute data, that is the data normalised with just the between-subjects normalisation (the black line) is better than the relative data (normalised for frequency, the grey line). The difference is modest for the whole collection of useful trigrams, but quite a bit bigger for the useful trigrams when dis-
5
Statistical Theory
The test we need has to be able to check whether the differences between vectors containing 8, 300 elements are statistically significant, and how significant the individual differences are. Also it has to be a method that will not find significance for tiny differences due to the large number of variables alone. Fortunately there are permutation tests, which can fulfill these requirements when 12
Automatically Extracting Typical Syntactic Differences from Corpora
June 8, 2009
Figure 1: Distribution of Top-trigrams with absolute values on the X-axis (only between-subjects normalised) and relative values on the Y-axis (normalised for frequency, between-types normalised).
13
Automatically Extracting Typical Syntactic Differences from Corpora
June 8, 2009
Figure 2: Recall of Useful Trigrams both for absolute (only between-subjects normalised) and relative data (normalised for frequency, between-types normalised).
14
Automatically Extracting Typical Syntactic Differences from Corpora
three demands because they are all related to being able to make inferences about the population, and permutation-tests leave out any assumptions about the relationship between the data under scrutiny and the population. Permutation tests just tell how likely it is that two data-sets would differ by chance. Another reason for obviating the first two of the three requirements of parametric tests is that permutation tests implicitly generate their own significance tables. That is, if one would plot the difference values obtained for all 10.000 permutations, one would get the probability distribution for the data as it is used at that time. This is also what makes them more sensitive than parametric tests such as the t-test, as with those the distribution and its matching significance table is always a rough estimation, while with permutation-tests these are exactly tailored to the data. Kemperthorne and Fisher remark about this: “conclusions [of the ANOVA] have no justification beyond the fact that they agree with those which could have been arrived at by this elementary method [(the permutation test)]” (Edington, 1987, 11). So the permutation test is the truly correct one and the corresponding parametric test is valid only to the extent that it results in the same statistical decision. The only down-side of permutation tests is that they require quite some computations, but this has become much less of a problem now that fast computers have become abundant and cheap. Permutation-tests are also easier to understand and more transparent than parametric tests, as to most non-mathematicians the three requirements and the significance tables that come with most parametric tests remain a bit opaque, if not magical, while every step of permutation-tests can be fully understood and followed by laymen.
used wisely (Good, 1995). Kessler (2001) contains an informal introduction for an application within linguistics. 5.1
June 8, 2009
Permutation Tests
As noted the fundamental idea behind permutation tests with Monte Carlo permutation is very simple, but not just that, its formulas are also relatively simple: we measure the difference between the two original sets in some convenient fashion, obtaining δ(A, B). We then apply the Monte Carlo permutations (random permutations), which means extracting two sets of the same size as the base-sets at random from A ∪ B. We call these two random sets A1 , B1 , and we calculate the difference between these two in the same fashion, δ(A1 , B1 ), recording if δ(A1 , B1 ) >= δ(A, B), i.e., if the difference between the two randomly selected subsets from the entire set of observations is as extremely or even more extremely different than the original. If we repeat this process of permutations, say, 10, 000 times (including the one for the original sets), then counting the number of times we obtain differences at least as extreme as in the basecase, allows us to calculate how strongly the original two sets differ from a chance division with respect to δ. In that case we may conclude that if the two sets were not genuinely different, the original division into A and B was likely to the degree of p = n/10, 000. Put into more standard hypothesis-testing terms, p is the p-value, i.e. the probability of seeing the difference between the two sets if there is no difference in the populations they represent. Based on this value we may reject (or retain) the null hypothesis that there is no significant difference between the two sets. Permutation tests are quite different from parametric tests such as the (M)ANOVA and the t-test in that they make fewer assumptions and are applicable to more kinds of data, while being just as sensitive, or even more sensitive than their parametric counter-parts. They do not require the data to be normally distributed (to follow a bell-curve, or any known probability distribution for that matter), to be homoscedastic (have regular and finite variance), nor for it to be collected from subjects that were randomly selected (selected by random sampling). And this is good news as most linguistic research violates at least one or two of these requirements. Permutation tests lack these
5.2
Exchangability and Relevance
Permutation tests have only two real requirements which are relatively straightforward. First, one needs to be alert about exactly what relationships between the data-sets are being measured by ones test and its permutations. What is important here is the exchangeability of elements under the nullhypothesis, or in other words that there is no interdependence between the elements that are being permutated. We can best illustrate this by explaining our decision to measure the difference in syntax by permutating the speakers in the two groups, 15
Automatically Extracting Typical Syntactic Differences from Corpora
June 8, 2009
lem.3 Secondly there is an important difference between statistical significance and effect size, or even more broadly; relevance, which — while important with any statistical test — is often overlooked. A statistically significant result is not necessarily a relevant result. Significance measures the likelihood of a difference being due to chance. Thus very small, but consistent differences can always be made significant by increasing the size of the dataset. If a difference is consistent over a large dataset, it likely is not a coincidence, even if the difference is very small. But on the other hand this very small difference might possibly be not very relevant. For example if body height were related to presentation-skills, but only if averaged over millions of people, with quite some variation in-between, then facing a tall presenter, or even having a class of tall pupils presenting, would not tell you very much about the quality of the upcoming presentation(s). Again the problem in this example is not that the relationship is not significant, it is just not big enough, and thus not relevant, as in not informative for the audience of a presentation. So even though sheer corpus size should not lead to higher than warranted estimates of statistical significance when using permutation-tests, such as those that can happen with Chi Square and similar classical tests, having a bigger corpus will still make it easier to find more significance as is warranted by having more data (Agresti, 1996; Edington, 1987). So the corpus that is used should not be too big, and one should never apply the method without keeping an eye on the possible relevance of the results for the research questions at hand.4 In short: when used wisely permutation tests are a very suitable tool, also for finding significant syntactical differences and the POS-n-grams that contribute to this difference.
and not the n-grams or the sentences — as we did previously (Nerbonne and Wiersma, 2006). Because if we had permutated separate n-grams then we might — besides the differences between the groups — also be measuring the syntactic relationships between (overlapping) n-grams within sentences. And when permutating sentences of multiple different speakers across two groups we could also be measuring differences between the personal syntactic styles of the subjects, instead of just those differences that are caused by them being members of the two groups. So only the speakers can be considered to be truly independent and exchangable. More formally speaking ‘interdependent’ here means that there is no higher than chance probability-relationship among the elements (in statistical terms there should be no autocorrelation). For example between the leaves of a single plant there is an extremely high probability that they are of a similar type, as they were all formed by the same plant. Thus when we have plants of different species standing on each end of our desk, then it is not a good idea to snap off their leaves and to permutate them when trying to find out whether their differences in leaf-colour could either have been caused by the positions of their pots, or by chance, because the leaves of each plant were (mostly) formed by their plants genes, not by the position of its pot. This example also allows us to stress the important role the null-hypothesis and the alternative hypothesis play in determining what interdependence is and is not allowed. So if the plantmutilation example was not about the location of the desk-plants, but about finding out whether the colour of their leaves was dependent on their species (different alternative hypothesis, with as null-hypothesis leaves of various types growing randomly on the plants), then it was perfectly fine to permute their leaves and find a zero pvalue accordingly. One measures something other than, and often a lot more, than one thinks (genetic relation instead of pot position in the example) and this is why permutating elements that are not exchangeable under the null-hypothesis leads to false significance. And as noted, the interdependence between elements suggested by the alternative hypothesis (e.g. genetic relationship when that is what is being tested for) is no prob-
5.3
Normalisations
Each measurement of difference that is part of the permutation test — whether the difference is between the original two samples or between two 3
The requirement of exchangeability under the nullhypothesis is also known as the requirement of random assignment in some of the literature 4 In our 2006 paper we formulated this a bit less clearly and conservatively. In fact we thought then that corpus size could be ignored, which is wrong.
16
Automatically Extracting Typical Syntactic Differences from Corpora
June 8, 2009
samples which arise through permutations — is applied to the collection of POS-n-gram frequencies once these have been normalised. We first describe the normalisation that is required. This normalisation is called the BE TWEEN SUBJECTS NORMALISATION . It excludes a whole range of factors that could otherwise violate the requirements of permutation tests, and it is thus quite conservative and robust. It normalizes for differences in the size of texts, instead of for variations in sentence-length, like the IN - PERMUTATION NORMALISATION described in our previous paper on the method (Nerbonne and Wiersma, 2006), which was both more complex and a bit less conservative. The second normalisation we used is called the BETWEEN - TYPES NORMALISATION , and it normalizes for differences in frequency between ngram types. Using it is optional and its purpose is mainly the elimination of frequency as a factor, allowing one to detect significance if the typical ngrams are the less frequent ones. It is also different from the normalisation that we used for the same purpose before (the BETWEEN - PERMUTATIONS NORMALISATION ). The difference being that the between-types normalisation is much simpler to calculate for the output of the between-subjects normalisation, and that part of its function is now delegated to a third normalisation. The BETWEEN - GROUPS NORMALISATION is the third normalisation, and it has a really different function from the others as its output is not used as input for a measure or a permutation-test, but is solely meant for detecting for which of the two groups a given n-gram is typical. So it can only be used after individual n-grams were selected by using one of the other normalisations. It normalises for group-size, so it can detect what constitutes over- or under-use, even if the groups are of different sizes.
not change when authors are permutated between them, so the requirement for constant group-sizes is not violated. The normalisation functions, in short, by correcting for the size of the text by the speaker as measured by the number of POSn-grams inside it. It has the possible disadvantage that it is not always possible, easy or feasible to split up a corpus according to authorship. For example some texts, such as those in the Bible, have multiple or contested authors. Solutions can be found for this, such as splitting up the text into more or less random sizeable chunks (for example chapters), and then treating them as belonging to different authors. Still one should be aware of the extra things that might be measured because of such complications. For the between-subjects normalisation we first collect from the tagger a sequence of counts ci of POS-tag-n-grams (index i) for each subject (c). Giving us one vector (cs ) per subject.
5.3.1
Then we use the fraction-vectors in this list as the elements we permutate in our permutation test, instead of n-grams or sentences, effectively permutating subject speakers. We permutate these subjects between two groups, which we shall refer to as juveniles (‘j’) and adults (‘a’) as we did in our example research. Each permutation results in two lists — one for each group — containing the fraction-vectors of the subjects ending up in that group (f sj for a subject-vector in the juveniles
cs = < cs1 , cs2 , ..., csn > After that we normalise for total number of POS-n-grams produced by a given subject. We do this by calculating the sum of the subject’s ngram-counts, and then dividing each of his individual n-gram counts by it. This gives us an ngram fraction vector (f s ) for the subject. Ns =
Pn
s i=1 ci
f s = < ..., fis (= csi /N s ), ... >
Pn
s i=1 fi
=1
After we have done this for all the subjects, we have a list (F) that contains vectors, namely the fraction-vectors of all subjects. F = < f s 1 , f2s , ..., fns >
The Between-Subjects Normalisation
The between-subjects normalisation is, as noted, applied to texts belonging to single authors or speakers. Its purpose is to make the collection of POS-n-grams that make up the texts of a speaker comparable to that of other speakers so that no one subject has more influence on the groups vector than any other. This is also important for permutating, as it ensures that the sizes of the groups in terms of n-grams counts will 17
Automatically Extracting Typical Syntactic Differences from Corpora
June 8, 2009
group). So for each permutation, including the base-case, we produce two lists of fraction-vectors (Fj and Fa ).
The inner part of the formula, the total count for each n-gram (si ), can be calculated for the whole vector (s) as follows:
Fj = < f sj 1 , f2sj , ..., fnsj > Fa = < f sa 1 , f2sa , ..., fnsa >
s = < ..., si (= sji + sai ), ... > N-grams with large summed authors’ fraction counts are those with high frequencies in the original sub-corpora. And the normalisation under discussion strips away the role of frequency and thus allows us to find significance if there are typical n-grams, even if they are less frequent. The values thus normalised will be 0.5 on average when the groups are of the same size. They will be skewed in the direction of the larger group when one of the groups is bigger than the other. Therefore the output of this normalisation cannot directly be used to determine if there is under- or over-use (as 0.3 can be over-use for a small group, and extreme under-use for a bigger one). It is primarily meant for determining the overall (corpus-wide) p-value without regard to frequency. It has no influence on the p values reported for individual n-grams, because it is only a linear transformation when looked at per n-gram. In addition we do use it for sorting the n-grams that were found to be significant, as it will cause infrequent, but typical n-grams to move to the top (see section 2.4). The normalisation is applied to the base-case and all subsequent permutations, and the pair of arrays it produces (tj and ta ) is ready to be used as input for a measure.
As the last step of this normalisation for each group we sum the fraction-vectors of all subjects that ended up in that group (summing the fractions per n-gram for different authors in the group, not across different n-grams), giving us a vector holding per-n-gram sums for each group: sj and sa (note this last step is also done for for all permutations). n sj = Fji Pi=1 n a a s = i=1 Fi
P
The pair of single arrays it produces at each permutation (sj and sa ) can, besides being already directly usable when no further normalisation is done for frequency, also be used as input for the second normalisation, the between-types normalisation. 5.3.2 The Between-Types Normalisation The purpose of the between-types normalisation is the removal of the influence of absolute frequencies in n-gram counts. This is useful because as can be seen in the graph for trigrams (figure 3), the frequency of POS-n-gram types follows Zipfs law, and thus a few very frequent (relatively uninteresting) n-grams could have too much influence on the reported significance of the difference between the two groups. So this normalisation allows one to find significance if there are enough n-gram types that are typical for the two original sub-corpora, regardless of their frequency. It is similar to the second step of the subject normalisation (the fractions per author), except that the linear transformation is applied over the total count for each n-gram type, instead of over all n-grams of each author: for each POS-n-gram type i in each group (sub-corpus) g ∈ {a, j}, the summed authors fraction count of the group sgi is divided by the total count (both as provided by the subject normalisation) for that type si . The outer part of the formula for tg is:
5.3.3 The Between-Groups Normalisation The between-groups normalisation is applied to the output of the first normalisation (the betweensubjects normalisation) and it is meant for detecting whether an n-gram is being over-used in one group or the other. The average normalised value will be 1, with cases of over-use in the group having a normalised value of bigger than 1, and cases of under-use having a value smaller than 1. It is calculated by dividing the found per-n-gram count by the average value for that n-gram in that group. It works because the average value for each of these n-grams across permutations is equal to the expected value for a group of that size under the null-hypothesis. More formally it is arrived at in the following way: for each POS-n-gram i in each group (subcorpus) g ∈ {a, j}, the summed authors fraction
tj = < ..., tji (= sji /si ), ... > ta = < ..., tai (= sai /si ), ... > 18
Automatically Extracting Typical Syntactic Differences from Corpora
June 8, 2009
Figure 3: Distribution of the trigrams in the FAEC corpus in a log log graph, with the ideal Zipf distribution in grey.
19
Automatically Extracting Typical Syntactic Differences from Corpora
count sgi is divided by the average of the count (both as provided by the between-subjects normalisation) for that type in that group (across all permutations): sgi . The outer part of the formula for og is:
values, such as 1/0.8 = 1.25. This can lead to false significance by introducing fluctuations in the normalised group-sizes, and by turning twosided measures, such as RSquare, into one-sided ones (more on RSquare in section 5.4). When these two things are kept in mind it can be used as a convenient normalisation for assigning the detected pos-n-grams to the list of the group by which they are being over-used. Just as the previous normalisations, this normalisation is also applied to the base-case and all subsequent permutations, and for each it produces a pair of arrays (oj and oa ).
oj = < ..., oji (= sji /sji ), ... > oa = < ..., oai (= sai /sai ), ... > Now for large numbers of permutations the inner part of the formula, namely the average count (sgi ), can be calculated as (sji + sai )/2 when the groups are of equal sizes, and as follows when the groups differ in size:
5.4
sj = sa =
Measures
The choice of vector difference measure, e.g. cosine vs. χ2 , does not affect the proposed technique greatly, and alternative measures can easily be swapped in. Accordingly, we have worked with both cosine and two measures inspired by the RECURRENCE (R) metric introduced by Kessler (Kessler, 2001, 157 and further). We also call our measures R and RS QUARE. The advantage of our R and RSquare metrics is that they are transparent, they are simple aggregates, that allow one to easily see how much each n-gram contributed to the difference. Actually we also used them to calculate a separate p-value per n-gram. Our R is calculated as the sum of the differences of each cell with respect to the average for that cell. If we have collected our data into two vectors (sj , sa ), and if i is the index of a POS-n-gram, R for each of these two groups is equal, and it simply looks as follows.
Where N j and N a are the number of subjects in the juveniles, resp. adults group. The values thus normalised will, as can be noted, be 1 on average across permutations, and thus when summed for large corpora, be equal to the number of POS-n-gram types. Pn oj = n Pni=1 ai i=1 oi
June 8, 2009
=n
We firmly note again that the per-POS-n-gram values output by this normalisation are only useful for detecting over- and under-use when used together with information on per-POS-n-gram statistical significance as provided by one of the other normalisations. First of all it cannot be used to find typical or extreme POS-n-grams without consideration for significance, because infrequent n-grams are especially likely to have high values with respect to sgi . For example a n-gram occurring only once, in one sub-corpus, can get a value of 1/0.5 = 2 (as it is indeed very typical for this sub-corpus), while with a count of one it clearly will not be statistically significant (moving between equally sized sub-corpora with a chance of 50 % during permutations). Secondly this normalisation is not suitable to generate per n-gram p-values itself when groupsizes differ, because then the output will not be symmetrical. To illustrate this we can look at our single n-gram again: as a n-gram occurring in a smaller sub-corpus can produce very big normalised values, such as 1/0.2 = 5, and one in a bigger sub-corpus will always produce smaller
R=
j g s − s i=1 i i
Pn
With the average between the two group for each n-gram cell (sgi ) being. sgi = (sji + sai )/2 The RSquare measure emphasises few large differences more than many small ones, and it is simply the square of R, and thus calculated in this way (with sgi being the same as for R). RSquare =
j i=1 (si
Pn
− sgi )2
Both R and RSquare are two-sided measures. This means that the same R(Square) value is obtained when an n-gram has more than the average 20
Automatically Extracting Typical Syntactic Differences from Corpora
value to a certain extent (say x), as when it has less than the average to this same extent (also x). So tests using them are two-tailed, which is statistically conservative. Also both very typical and a-typical n-grams are detected as significant. Thus assignment to the groups for which they are typical is done as a separate step using the betweengroups normalisation (see 5.3.3). All in all we had best results with the RSquare measure and thus have used it for our example research.
6
for the Finnish Australian English Corpus. Besides transforming the corpus, some metadata about it should also be provided. Most notably a division, which is a set of two files, each containing a list with the names of the files that belong to the two groups stored below the corpusData directory. It can be created using the corpusdivider.pl tool, based on collected metadata. See the test-corpus that is included with the tool set for an example. All tools in the ComLinToo also provide a short description and a detailed list of possible arguments when they are called with ‘-h’, for example:
Software
Our method can easily be applied to new questions and different corpora with the help of the C OM PUTATIONAL L INGUISTICS T OOLSET (the ComLinToo).5 It is Free Software licensed under the Affero GPL (available for free, and for modification, but without warranty, as long as they remain free). It runs under Linux (and should work on Macs too), is written in Perl and consists of about 40 separate command-line tools, grouped by various tasks such as corpus preparation, examination and sense disambiguation. For our research it was extended with the permstat and tagging tasks. We will now introduce and explain the corpus preparation, tagging and permstat tasks, but only to the extent that this is neccesary for applying our method. 6.1
June 8, 2009
./corpussplitter.pl -h Also note that most tools have default configuration-values, which are set in each script’s config.pl file. These are used when a command-line argument is not provided for them. Then when the corpus is prepared it needs to be tagged if it is not a tagged corpus already. For this the ComLinToo uses Thorsten Brants’ TnT tagger (Brants, 2000). Sadly enough we cannot provide it with our tool set for licensing reasons, but it can be obtained separately and is free for research purposes.7 For tagging using TnT the corpus2tnt.pl, tagtnt.pl and tnt2tagrow.pl scripts are used. If TnT first needs to be trained on a tagged corpus, then traintnt.pl should be used before tagging. After the tagging is done tagrow files matching the wordrow files are created. These files (one for each text) contain the tags for each sentence on a new line, with tags separated by spaces. The tools for the tagging-task, and every other task, can also be called grouped together in tasks, with just one command-line, by using the goall.pl script, such as tagginggoall.pl. If it is called with ‘-lt’ it reports the possible tasks, and the calls to the other tools that it will make to fulfil them. In this way you can get a feel for the order in which the scripts are normally called, and their most usual arguments. The goall.pl scripts also accept arguments, such as the corpus (with −c), or the sub-corpora division that should be used (−d), and they pass these on to the tools they call. An example goall-call is:
Step by Step
Everything starts with selecting a corpus. Things to keep in mind here are the fact that it either should be tagged already or that at least a tagged corpus in the same language has to be available to train the Part of Speech tagger on. If a corpus is selected it should be transformed to plaintext format, with every separate text in a new file, every sentence on a new line and words and punctuation split by spaces (these are the wordrow files). This transformation can be done with the corpussplitter.pl (where splitting in files is needed), corpusstripper.pl and corpusrefiner.pl tools. They require a custom, but small software library for each corpustype, and these are currently available for the various International Corpus of English corpora6 and 5
The Computational Linguistics Toolset is available from http://en.logilogi.org/Wybo Wiersma/User/Com Lin Too 6 The ICE-corpora are available from http://www.ucl.ac.uk/english-usage/ice/
7
TnT is available saarland.de/ thorsten/tnt/
21
here
http://www.coli.uni-
Automatically Extracting Typical Syntactic Differences from Corpora
and time of their experience in the second language, the role of formal instruction, etc. And beyond this, comparisons could be made between dialects and variants of languages (such as the various Englishes as collected in the ICE-corpora). Also different discourses can be compared, such as the language of lawyers or academics as compared to that of laymen or students. Besides all this it might also be applied for the analysis of literary styles such as the syntactic difference between romance and detective novels. The change of syntax through time could also be followed, as reflected in novels or newspapers published in different decades or even different centuries. In addition it can for example be used to track traces of the grammar of a sourcelanguage (say Ancient Greek) in translations (say of the Bible). In short, as long as a common tagset exists or can be devised for a corpus of comparable material, and the comparison answers one or more meaningful questions, it can be used, and not just for purely scientific purposes. It might even lead to very concrete applications in teaching, so second language learners, aspiring creative writers, or journalism-students can be shown what grammatical structures they — either collectively or individually — are overor under-using compared to native speakers, famous authors, resp. acclaimed journalists. And if the method is confirmed to work for small corpus-sizes and/or is improved and calibrated sufficiently, many more applications may be thought of, such as e-mail-classification or authoridentification. More on analysis and possible improvements now.
./tagginggoall.pl -c test -lt And finally when the corpus has been cleaned and is fully tagged, the permstat task can be performed. Key-tools here are (besides permstatgoall.pl) textpermutator.pl and permutation statter.pl. The former generates the actual permutations, and the latter applies one or more normalisations and a measure to them to arrive at actual test-results. For the normalisations and the measures small plugin libraries are used again. It should be noted that although both scripts use zip-compression for their intermediate output files, fairly large amounts of diskspace may be required. In our set-up, with a corpus of 305.000 words and 10.000 permutations we needed more than 10 gb per experiment. The other two tools that are used for the task are permutationstattabler.pl and permstatresultselector.pl. The tabler tool transforms the output of permutation statter.pl to a tab-delimited format for easy importing in most software-packages (such as SPSS and Excel), and the latter selects and sorts significant n-grams for easy inspection. The list of n-grams obtained in this way can then be looked up in the corpus (with random examples) by using the tagsamplefinder.pl tool, from the examine task. Browsing through the tools in the various task-sets, and reading their descriptions can give you further guidance on which tools can be useful for you.
7
Further Research
First we will go into possible applications of the method, then we will do some suggestions for further testing and analysis, and we will conclude this section with some suggestions for improvements. 7.1
June 8, 2009
7.2
Analysis of the method
A first thing that could be analysed more in-depth are the corpus-sizes at which the method is effective in finding genuine differences. Note that ideally both a minimum and a maximum size should be established. A minimum could be found by trying it on sub-corpora of various sizes, composed of material whose syntax is known to be different. For finding a maximum it would be necessary to apply the method to material that is known to be very similar, and then taking the biggest corpussize at which no significant difference is found. It should be noted that especially agreeing on a maximum size could be hard as a degree of subjectivity is involved in deciding what ‘very similar’ entails,
Applications of the Method
As already shortly mentioned in the introduction, the method as presented here can be applied to many data-sets. It can offer quantitatively based answers to questions much beyond our data of two generations of immigrants. More specific questions could be asked such as the time course needed for second-language acquisition, the relative importance of factors influencing the degree of difference such as the mother tongue of the speakers, other languages they know, the length 22
Automatically Extracting Typical Syntactic Differences from Corpora
and it likely varies with research-questions. It might also be valuable to further analyse the various types of errors that the method might produce. Obviously here are tagging errors, and there is the slight possibility of there being problematic — that is significance causing — patterns in them. So far we found none in our tentative analysis, and even found better tagger-performance for the significant n-grams, as noted in sections 3.3 and 4.6. Still, having these errors even more thoroughly examined and tested would add to the trust in the method. In addition one or more re-applications to studies where syntax-differences were found by other means might be useful for additional verification. Another opportunity for analysis might be the testing of the method with various tag-sets and ngram sizes: especially tag-sets of various precisions, and for larger n-grams. We have not experimented with different tagsets apart from the full and reduced ICE-tagsets. And here we have limited our (preliminary) analysis to the results of tagset-size on significance for various n-gram sizes. We found that there appears to be a bandwidth between 1- and 5-grams at which it is easier to find significance, but apart from noting decreased significance at the edges, we have not yet been able to map it conclusively.8 Also our results are limited since they result from just one corpussize and just our two ICE tag-sets. In addition to quantitative analysis of the method, as proposed above, there are also more opportunities for qualitative analysis of the usefulness of results for linguists or other researchers when using different corpus-sizes and tag-sets. 7.3
June 8, 2009
validation steps. This means that the corpus is split into two parts (of sizes, say 20% stands to 80%), and then the second part is used to mine for typical n-grams, after which these are confirmed by applying the method again, but now to the first part. This process can be repeated so all data is used for mining and confirming in turn. Doing this would make the results more robust. Another improvement would be to add a stronger protection from family wise errors, such as one against partial null hypothesis family wise errors (part of the null-hypothesis being true, as in for separate n-grams). In the literature on neuroimaging, where also comparisons between hundred thousands of data points have to be made, proposals and solutions for this with regard to permutation tests have recently been put forth, such as threshold techniques (Nichols and Holmes, 2002; Nichols and Hayasaka, 2003). Applying these or similar solutions to our method could make the individual significance of all found n-grams also fully certain. Another small improvement would be to add start- and end- of sentence anchors. This might make the tagger perform better, and, in addition it will result in more fine-grained results. This because when using such anchors, mid-sentence ngrams will be differentiated from those at the end or beginning of sentences, leading to very different, and maybe even more informative results. In addition, as already noted in section 5.4, it could be useful to experiment with various measures. It would be especially useful if a measure were found that could standardize the method in some way, or calibrate it. This would make specifically corpus-wide comparisons more useful, as we would not only have an overall p-value, but also a standardized, overall difference-value. This would make it easier to compare the outcomes of different studies. A way to improve the method in a more radical sense, especially when taggers and parsers become more accurate, could be going beyond POS-tags, to chunking, that of partial parsing, and that of full parsing. Chunking attempts to recognize nonrecursive syntactic constituents, ‘[the man] with [the red hair] in [the camel-hair jacket] waved’, while partial parsing attempts to infer more abstract structure where possible: ‘[S [NP [the man] with [the red hair]] in [the camel-hair jacket] waved]’. Note that the latter example is only par-
Improvements to the method
Strictly speaking the method as currently presented is only a method for data-mining, and not for hypothesis-testing. This is because especially when the typical n-grams are going to be used, and not just the p-value that states whether the two subcorpora do differ, then no hypothesis is set before testing, and thus strictly speaking the results cannot be considered proven. To alleviate this theoretical problem, it would be possible — especially when there is much data at hand — to add cross8
We found no significance for 1-grams from the reduced tagset (there was significance for the full tagset), and a slightly reduced one (but still < 0.01) for 5-grams for the full tagset
23
Automatically Extracting Typical Syntactic Differences from Corpora
data-sets and used to answer many questions, such as factors influencing language learning, comparisons between discourses and literary styles, and maybe even teaching. Secondly the method could be further analysed, testing it with various corpus, tagset- and n-gram-sizes, and doing more qualitative analysis. Lastly there are also still many opportunities to improve the method, by adding and evaluating more statistical safe-guards, by experimenting with various measures, and by using chunking or parsing, instead of POS-tags, especially for cleaner data. Thus while we fall short of Weinreich’s goal of assaying “total impact” in that we focus on syntax, we do take a large step in this direction by showing how to aggregate, test for statistical significance, and examine the syntactic structures responsible for the difference.
tially parsed in that the phrase ‘in the camel-hair jacket’, is not attached in the tree. In general, chunking is easier and therefore more accurate than partial parsing, which, however, is a more complete account of the latent hierarchical structure in the sentence. There is a tradeoff between the resolution or discrimination of the technique and its accuracy. Going beyond POStags is especially possible for cases where tagging accuracy might be less an issue, such as newspaper text or novels as Sanders and others have shown (Sanders, 2007; Baayen et al., 1996; Hirst and Feiguina, 2007).
8
June 8, 2009
Conclusion
Weinreich (1968) regretted that there was no way to “measure or characterize the total impact one language on another in the speech of bilinguals,” (1968, p. 63) and speculated that there could not be. This paper has proposed a way of going beyond counts of individual phenomena to a measure of aggregate syntactic difference. The technique proposed follows Aarts and Granger (1998) in using part-of-speech n-grams. We argue that such lexical categories are likely to reflect a great deal of syntactic structure given the tenets of linguistic theory according to which more abstract structure is, in general, projected from lexical categories. We went beyond Aarts and Granger in Showing how entire vectors of POS ngrams may be used to characterize aggregate syntactic distance, and in particular by showing how these, and individual n-gram counts can be analysed statistically. The technique was implemented using an automatic POS-tagger, several normalisations and permutation statistics, and it was shown to be effective on the English of Finnish immigrants to Australia. We were able to detect various forms of interference of their L1 (Finnish) on the English of the adult speakers, such as over-use of articles and placing not in pre-verbial position, as well some due to universal contact-influences, such as the over-use of hesitation phenomenon. The ComLinToo, the software implementing the method, including the normalisations, is freely available. It is developed to allow easy application to other data-sets, and generalization to n-grams of any size. There are many possibilities for future research. First of all it can be applied to a wide range of
9
Acknowledgments
We are grateful to Lisa Lena Opas-H¨anninen and Pekka Hirvonen of the University of Oulu, who made the data available and consulted extensively on its analysis. We also thank audiences at the 2005 TaBu Dag, Groningen; at the Workshop Finno-Ugric Languages in Contact with English II held in conjunction with Methods in Dialectology XII at the Universit´e de Moncton, Aug. 2005; the Sonderforschungsbereich 441, “Linguistic Data Structures”, T¨ubingen in Jan. 2006; the Seminar on Methodology and Statistics in Linguistic Research, University of Groningen, Spring, 2006; the Linguistic Distances Workshop at the Coling/ACL 2006 in Sydney, July. 2006; and the Thirteenth International Conference on Methods in Dialectology, Leeds, Aug. 2008, and especially Livi Ruffle, for useful comments on our earlier paper.
References Aarts, Jan and Sylviane Granger. 1998. Tag sequences in learner corpora: A key to interlanguage grammar and discourse. In Granger, Sylviane, editor, Learner English on Computer, pages 132 – 141. Longman, London. Agresti, Alan. 1996. An Introduction to Categorical Data Analysis. Wiley, New York. Association, International Phonetics. 1949. The Principles of the International Phonetic Association. IPA, London. 24
Automatically Extracting Typical Syntactic Differences from Corpora
June 8, 2009
Baayen, H., H. van Halteren, and F. Tweedie. 1996. Outside the cave of shadows. Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, 11.
Jarvis, S. 2003. Probing the effects of the L2 on the L1: A case study. In Cook, V., editor, Effects of the Second Language on the First, pages 81–102. Multilingual Matters, Clevedon, UK.
Bot, Kees de, Wander Lowie, and Marjolijn Verspoor. 2005. Second Language Acquisition: An Advanced Resource Book. Routledge, London.
Kessler, Brett. 2001. The Significance of Word Lists. CSLI Press, Stanford. Larsen-Freeman, D. and M.H. Long. 1991. An Introduction to Second Language Acquisition Research. Longman, London.
Brants, Thorsten. 2000. TnT - a statistical part of speech tagger. In 6th Applied Natural Language Processing Conference, pages 224 – 231. ACL, Seattle.
Lauttamus, T. and P. Hirvonen. 1995. English interference in the lexis of American Finnish. The New Courant, 3.
Chambers, J.K. 2003. Sociolinguistic Theory: Linguistic Variation and its Social Implications. Blackwell, Oxford.
Lauttamus, Timo, John Nerbonne, and Wybo Wiersma. 2007. Detecting Syntactic Contamination in Emigrants: The English of Finnish Australians. SKY Journal of Linguistics, 20.
Clyne, M. and S. Kipp. 2006. Australia’s community languages. International Journal of the Sociology of Language, 180.
Lenneberg, E. 1967. Biological Foundations of Language. John Wiley, New York.
Cook, V. 2003. Effects of the Second Language on the First. Multilingual Matters, Clevedon, U.K.
Manning, Chris and Hinrich Sch¨utze. 1999. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge.
Coseriu, Eugenio. 1970. Probleme der kontrastiven Grammatik. Schwann, D¨usseldorf.
Nelson, Gerald, Sean Wallis, and Bas Aarts. 2002. Exploring Natural Language: working with the British component of the International Corpus of English. John Benjamins Publishing Company, Amsterdam/Philadelphia.
Edington, Eugene S. 1987. Randomization Tests. Marcel Dekker Inc., New York. Ellis, R. 1994. The study of second language acquisition. Oxford University Press, Oxford.
Nerbonne, John and Wybo Wiersma. 2006. A measure of aggregate syntactic distance. In Nerbonne, J. and E. Hinrichs, editors, Linguistic Distances, pages 82 – 90. PA: ACL, Shroudsburg.
Faerch, C. and G. Kasper. 1983. Strategies in interlanguage communication. Longman, London. Fillmore, Charles and Paul Kay. 1999. Grammatical constructions and linguistic generalizations: the what’s x doing y construction. Language, 75.
Nichols, Thomas and Satoru Hayasaka. 2003. Controlling the familywise error rate in functional neuroimaging: a comparative review. Statistical methods in medical research, 12.
Garside, Roger, Geoffrey Leech, and Tony McEmery. 1997. Corpus Annotation: Linguistic Information from Computer Text Corpora. Longman, London/New York.
Nichols, Thomas E. and Andrew P. Holmes. 2002. Nonparametric Permutation Tests For Functional Neuroimaging: A Primer with Examples. Human brain mapping, 15.
Good, Phillip. 1995. Permutation Tests: a practical guide to resampling methods for testing hypotheses. Springer, New York.
Odlin, Terence. 1989. Language Transfer: Crosslinguistic Influence in Language Learning. Cambridge University Press, Cambridge.
Hirst, G. and O. Feiguina. 2007. Bigrams of syntactic labels for authorship discrimination of short texts. Literary and Linguistic Computing, 22.
Odlin, Terence. 1990. Word-order transfer, metalinguistic awareness and constraints on foreign language learning. In VanPatten, B. and J. Lee, editors, Second Language Acquisition/Foreign Language Learning, pages 95–117. Multilingual Matters, Clevedon, U.K.
Hirvonen, P. 1988. American Finns as language learners. In M.G. Karni, O. Koivukangas and E.W. Laine, editors, Finns in North America. Proceedings of Finn Forum III 5-8 September 1984, pages 335–354. Institute of Migration, Turku, Finland. 25
Automatically Extracting Typical Syntactic Differences from Corpora Odlin, Terence. 2006a. Could a contrastive analysis ever be complete? In Arabski, J., editor, Crosslinguistic Influence in the Second Language Lexicon, pages 22–35. Multilingual Matters, Clevedon, U.K.
June 8, 2009
Waas, M. 1996. Language Attrition Downunder: German Speakers in Australia. Peter Lang, Frankfurt. Watson, G. 1995. A corpus of Finnish-Australian English: A preliminary report. In Muikku-Werner, P. and K. Julkunen, editors, Kielten v¨aliset kontaktit, pages 227–246. University of Jyv¨askyl¨a, Jyv¨askyl¨a.
Odlin, Terence. 2006b. Methods and inferences in the study of substrate influence. Pietil¨a, P. 1989. The English of Finnish Americans with Reference to Social and Psychological Background Factors and with Special Reference to Age. Turun yliopiston julkaisuja, 188.
Watson, Greg. 1996. The Finnish-Australian English corpus. ICAME Journal: Computers in English Linguistics, 20.
Piller, I. 2002. Passing for a native speaker: Identity and success in second language learning. Journal of Sociolinguistics, 6.
Weinreich, Uriel. 1968. Mouton, The Hague.
Languages in Contact.
Westfall, Peter H. and S. Stanley Young. 1993. Resampling Based Multiple Testing: Examples and Methods for p-value Adjustment. John Wiley & Sons, Inc., New York.
Poplack, Shana and David Sankoff. 1984. Borrowing: the synchrony of integration. Linguistics, 22. Poplack, Shana, David Sankoff, and Christopher Miller. 1988. The social correlates and linguistic processes of lexical borrowing and assimilation. Linguistics, 26. Ritchie, William C. and Tej K. Bhatia. 1998. Handbook of Child Language Acquisition. Academic, San Diego. Sampson, Geoffrey. 2000. A proposal for improving the measurement of parse accuracy. International Journal of Corpus Linguistics, 5. Sanders, Nathan C. 2007. Measuring Syntactic Differences in British English. In Proceedings of the Student Research Workshop, pages 1 – 7. Omnipress, Madison. Schmid, M. 2002. First Language Attrition, Use and Maintenance: The case of German Jews in Anglophone Countries. John Benjamins, Amsterdam. Schmid, M. 2004. Identity and first language attrition: A historical approach. Estudios de Sicioling¨uistica, 5. Sells, Peter. 1982. Lectures on Contemporary Syntactic Theories. CSLI, Stanford. Thomason, Sarah and Terrence Kaufmann. 1988. Language Contact, Creolization and Genetic Linguistics. University of California Press, Berkeley. Thomason, Sarah G. 2001. Language Contact: An Introduction. Edinburgh University Press, Edinburgh. Van Coetsem, Frans. 1988. Loan Phonology and the Two Transfer Types in Language Contact. In Publications in Language Sciences, page 27. Foris Publications, Dordrecht. 26
Automatically Extracting Typical Syntactic Differences from Corpora
June 8, 2009
Table 4: All 137 inspected trigrams with samples. The ‘rank’ is the ranking of the trigram in the top 308 when sorted using the between-types normalisation. The ‘pre’ and ‘post’ are the words before and after the trigram in the sentence. The ‘trigram’ shows the trigram, with 3 examples from the corpus below it. And the ‘usages’ shows the name of the phenomenon from table 3 Rank/Pre
Trigram
Post
2
???
N(com,plu)
ADV(ge)
Uh about nearly
4 4 4
years points years
ago per ago
4
N(com,plu)
PRON(nom)
PRON(pers,sing)
those the my
stones things brains
what what what
I I I
5
ADV(ge)
PREP(ge)
INTERJEC
uh ’s Corners
today still sometimes
on in in
uh ah uh
6
V(cop,pres,encl)
INTERJEC
N(com,sing)
It it It
’s ’s ’s
ah uh uh
blump bread um
12
PRON(pers,plu)
V(intr,pres)
PREP(ge)
and where and
we we we
stay live stay
in in in
16
N(com,sing)
INTERJEC
INTERJEC
For steak and
um uh ”
uh uh uh
uh uh uh
17
INTERJEC
INTERJEC
INTERJEC
oh and well
oh uh huh
oh uhm huh
oh uh huh
18
CONJUNC(subord)
INTERJEC
INTERJEC
that know
uh that Because
market uh uh
uh uh uh
21
N(com,sing)
INTERJEC
PRON(pers,plu)
there a next
couple sportsman election
oh oh uh
we we they
22
PRON(univ,sing)
PRON(univ,sing)
N(com,sing)
really Gosh from
every every each
every every each
day time party
24
CONNEC(ge)
PREP(ge)
N(prop,sing)
And But And
in after in
Finland Susan Finland
26
CONJUNC(coord)
INTERJEC
UNTAG
we wintertime spades
not and and
cook uh uh
r fish tha
27
Usages [useless]
per I Relative-What carried wanted have Disfluent midday in Disfluent
what Present Finland the England Disfluent porridge then sorry Disfluent I everything Disfluent uh uh Disfluent usually used think Disfluent is diff take [useless] you born maybe Disfluent that mostly is
Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre
Trigram
June 8, 2009 Post
31
PRON(univ)
N(com,sing)
PREP(ge)
doing need all
all all all
kind kind kind
of of of
32
N(com,sing)
V(cop,past)
ADJ(ge)
weather work my
weather load husband
was was was
different different alive
34
CONNEC(ge)
INTERJEC
ADV(ge)
But some Yes
uh but but
I uh uh
never not not
38
CONJUNC(coord)
INTERJEC
N(com,plu)
Um time factory
and this and
uh time uh
carrots people bookbinders
41
CONNEC(ge)
INTERJEC
PREP(ge)
And
And then And
ah uh uh
in in on
42
ART(indef)
INTERJEC
N(com,sing)
in play was
a a a
ah uh uh
sewage champagne driver
43
ADV(phras)
PREP(ge)
ADJ(ge)
know went charge
in on up
in through to
Finnish near ready
46
CONJUNC(subord)
INTERJEC
PREP(ge)
it this now
that that that
uh uh uh
for in before
48
REACT
INTERJEC
REACT
Oh Oh
yeah yeah Yeah
oh oh oh
yeah yeah yeah
49
PRON(dem,sing)
N(com,sing)
INTERJEC
’s uh that
that that that
k dough place
uh uh uh
50
V(intr,past)
FRM
FRM
nobody I parts
cared said started
you you you
know know know
53
INTERJEC
REACT
CONJUNC(coord)
Uh Oh Oh
yeah yeah yeah
and and and
54
PRON(dem,sing)
INTERJEC
N(com,sing)
on on that
that that uh
ah ah celebration
vitamin doctor year
28
Usages [useless]
things books things [useless] and
Disfluent been very very
of want and
Disfluent Articles
Disfluent ’ that the Disfluent pipes you [useless] culture my to Disfluent the Finland yeah Disfluent it
cross from classifying
Disfluent Articles
No-There we ” when
the uh I
two but
Disfluent Articles
Disfluent Articles
Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre
Trigram
June 8, 2009 Post
56
AUX(do,pres,neg)
V(montr,infin)
ADV(ge)
I I I
don’t don’t don’t
have watch know
really anymore nowadays
57
PREP(ge)
N(prop,sing)
???
life life life
in in in
Australia Australia Australia
? ? ?
58
INTERJEC
UNTAG
UNTAG
that Uh uh
uh mit was
ba se i
A s was
60
V(montr,infin)
PRON(dem,sing)
PRON(one,sing)
can’t to don’t
explain say like
that that that
one one one
65
V(cop,past)
ADJ(edp)
N(com,plu)
I I I
was was was
5 9 7
weeks years years
66
V(montr,infin)
CONJUNC(subord)
INTERJEC
To don’t think
see think that
like that ah
ah ah ah
68
CONJUNC(subord)
INTERJEC
N(com,sing)
territory if Uh
that that because
uh uh uh
top place system
75
INTERJEC
NUM(card,sing)
N(com,sing)
apprentice
uh Oh Oh
draftswo one one
woman thing time
78
N(com,plu)
PRON(nom)
PRON(pers,plu)
some 25 vinyl
leftovers minutes staff
what what what
we they they
81
CONJUNC(subord)
INTERJEC
CONJUNC(subord)
somehow then say
that when that
uh ah uh
that when when
84
INTERJEC
ADV(inten)
ADV(inten)
much Definite
Ah ah Oh
little too much
bit much much
85
INTERJEC
N(com,sing)
V(cop,pres)
they only you
uh uh uh
humour game life
is is is
87
CONJUNC(coord)
ADV(ge)
CONNEC(ge)
good Laksi bed
but and and
then now then
then but then
29
Usages Pre-Negator
lunch that what [useless] Yeah Yes Quite Disfluent American what Indonesia Articles how [useless] and old Disfluent in ah that [useless] there where uh Disfluent and which in Relative-What had kept Disfluent that I we Disfluent walking ah busier Disfluent so uh so Disfluent I oh I
Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre
Trigram
June 8, 2009 Post
89
PRON(pers,plu)
PRON(pers,plu)
V(montr,past)
visit what where
us we they
we we they
had did had
90
PRON(pers,plu)
V(intr,pres)
ADV(ge)
okay you now
we we we
stay come live
here here still
91
INTERJEC
CONJUNC(subord)
PRON(pers)
Melbourne But
uh uh Uh
when if when
you you you
92
PREP(ge)
ART(indef)
N(prop,sing)
a time even
in in in
a a a
Tuusula Finland Kouvola
93
CONJUNC(coord)
INTERJEC
N(prop,sing)
and and fortune
Mt.Isa uh and
Alice uh uh
Springs May Price
94
REACT
CONJUNC(coord)
PRON(pers,sing)
no time
no yeah Yes
and but but
I I he
95
ADJ(ge)
PRON(dem,sing)
N(com,sing)
bit was not
different good all
that that that
way time common
96
ADV(ge)
CONJUNC(subord)
INTERJEC
going now was
anywhere now underneath
because that when
uh uh ah
97
CONNEC(ge)
INTERJEC
PRON(dem,sing)
And
And And uh
uh ah so
that that that
98
CONJUNC(coord)
INTERJEC
INTERJEC
and and level
ah diamonds but
because and uh
ah uh uh
99
PRON(dem,sing)
V(cop,pres,encl)
INTERJEC
yeah
that That That
’s ’s ’s
uh uh uh
100
CONJUNC(coord)
INTERJEC
N(com,sing)
and iceball iceball
uh not not
and ice ice
gene hockey hockey
101
CONJUNC(subord)
PRON(pers,plu)
V(cop,pres)
kids maybe country
when if because
they they we
are become are
30
Usages Disfluent
a we a Present we if today Disfluent had can think Articles Nikkila because to Disfluent and Day is Disfluent move been was Articles that I uh Disfluent husband before under Disfluent accident ’s ’ll Disfluent ah uh I
yeah I I
general but but
Disfluent Formulae
Disfluent Articles
[useless] teenage you pretty
Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre
Trigram
June 8, 2009 Post
102
ADJ(ge)
PRON(pers,plu)
V(montr,pres)
so call how
long Pure good
they they we
buy jump have
103
ART(indef)
NUM(ord)
N(com,sing)
have in a
a a next
third last door
dog moment neighbour
104
PREP(ge)
N(com,sing)
INTERJEC
start in and
to in in
p paper wintertime
uh uh uh
105
INTERJEC
PRON(pers,sing)
AUX(modal,pres,neg)
I was uh
can’t uh uhm
I I I
can’t can’t can’t
106
CONJUNC(subord)
PRON(pers)
AUX(do,pres,neg)
If but it
if if because
you you you
don’t don’t don’t
108
PRON(pers,plu)
AUX(modal,pres,neg) V(montr,infin)
chairman so
they We we
can’t can’t can’t
do avoid have
109
PRON(pers,plu)
AUX(prog,pres)
ADV(ge)
we people
are they We
not are are
anymore not not
110
ADV(inten)
ADJ(ge)
PRON(dem,sing)
this really married
very quite too
close witty young
that that that
111
PREP(ge)
PRON(dem,sing)
INTERJEC
movies years allergic
before for for
that that that
uh uh ah
113
INTERJEC
PRON(dem,sing)
N(com,sing)
And uh of
ah that uh
that market this
wartime place church
114
INTERJEC
ADV(ge)
PRON(pers,plu)
had No
ah Oh uh
yesterday sometimes sometimes
we they we
115
ADV(ge)
V(intr,ingp)
ADV(ge)
was ’s ’m
always not not
saying coming going
well easy there
116
REACT
CONJUNC(coord)
ADV(ge)
yes meat yourself
but yeah yes
not but but
not then otherwise
31
Usages [useless]
second it [useless] and that is Disfluent subcontra he the Disfluent really tell say [useless] change have wanna [useless] it dogs [useless] so not sweeteaters Articles one ’s ’s
long same chlorine
not butcher work
had
Disfluent Articles
Disfluent Articles
Disfluent Pre-Negator
have [useless] I any because Articles very again no
Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre
Trigram
June 8, 2009 Post
117
PRON(pers,plu)
V(intr,pres)
PRON(pers,plu)
ships where we
they we we
are run are
they they we
118
ART(def)
INTERJEC
N(com,sing)
through to that
the the the
uh uh uh
outback committee operation
122
INTERJEC
PRON(quant)
PREP(ge)
but and but
uh ah uh
most most most
of of of
123
PRON(ass)
N(com,plu)
CONJUNC(coord)
visit got and
some some some
mates vegetables bikinis
but or and
124
REACT
FRM
FRM
” yeah
Yeah Yeah I
oh I mean
dear know like
126
INTERJEC
NUM(ord)
N(com,sing)
and ’s on
uh ah ah
last other next
night sort H
127
ADV(ge)
PRON(pers,plu)
V(intr,pres)
we but
first Sometimes now
we we they
live go travel
128
PREP(ge)
ART(indef)
PREP(ge)
me hot a
in in West
a a Coast
in in like
132
CONJUNC(coord)
ADV(ge)
ADJ(ge)
stir-fries there up
and but and
probably there not
quick certain big
136
V(montr,pres)
PRON(ass)
N(com,sing)
find one we
some have make
corner some some
sittin sort soup
140
V(intr,pres)
ADV(ge)
PREP(ge)
you you I
stand go move
there now here
like at on
141
ART(indef)
PREP(ge)
ART(indef)
as as in
a a a
as as in
a a a
144
INTERJEC
PREP(ge)
PRON(dem,sing)
not uh and
ah in uh
in that after
that that that
32
Usages Disfluent
are ’ve a Disfluent and of
it the time
Disfluent Articles
Articles uh salad like No-There
sk Disfluent the of for Present in out all Disfluent a the in Articles and they changes [useless] ’ of or [useless] you nighttime Ja Disfluent they as night
age kind Hele
Disfluent Articles
Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre
Trigram
June 8, 2009 Post
145
PRON(pers,sing)
AUX(perf,pres)
V(cop,edp)
uh and
I I I
have have have
been been been
146
CONNEC(ge)
PRON(dem,sing)
N(com,sing)
uh
And but And
that this that
way life teacher
147
N(com,sing)
CONJUNC(subord)
INTERJEC
dinner the my
time pension daughter
because because that
uh oh ah
151
ADJ(ge)
CONJUNC(coord)
INTERJEC
uh everybody loved
bloody ready Western
and and and
uh uh uh
152
PRON(pers,plu)
V(montr,past)
INTERJEC
anyway variety we
they they had
got got ah
uh uh oh
153
INTERJEC
PREP(ge)
N(prop,sing)
Melbourne on and
ah Good uh
in Friday in
Australia 19 Australia
154
INTERJEC
N(com,sing)
INTERJEC
SBS timber language
uh mill uh
station and language
uh ah uh
155
INTERJEC
CONJUNC(coord)
ADV(ge)
but and and
uh uh uh
but and and
strictly generally maybe
156
INTERJEC
ADV(ge)
ADV(inten)
Uh then a
not uh uh
very not maybe
much too sort
157
CONNEC(ge)
PRON(pers,plu)
PRON(pers,plu)
yeah yeah
but So but
we we they
we we they
159
CONNEC(ge)
INTERJEC
PRON(pers,plu)
And
ah Well And
then uh uh
they they they
160
ADV(ge)
ADV(ge)
ADV(inten)
desserts ’s maybe
not not maybe
um not not
too much that
161
V(intr,infin)
CONJUNC(coord)
INTERJEC
to gonna to
walk die happen
and and but
uh uh uh
33
Usages No-Be
once how always Formulae he is was Disfluent he that ah
um I war
Disfluent Articles
Disfluent to and what Disfluent ’ uh we Disfluent in summer problem
speaking it politics
Disfluent Articles
Disfluent not big of Disfluent wouldn’t had take Disfluent dilate seemed coming Disfluent much different much
that that but
Disfluent Articles
Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre
Trigram
June 8, 2009 Post
162
PREP(ge)
INTERJEC
N(com,sing)
ah on TV
in uh in
ah um uh
constitution meat news
164
INTERJEC
ADV(inten)
ADJ(ge)
they nearly
uh uh Oh
bit nearly too
different wet busy
165
CONJUNC(coord)
INTERJEC
PREP(ge)
usually and uh
on uh and
sort then uh
of with on
166
PRON(pers,plu)
V(cxtr,pres)
PRON(pers,sing)
how what a
they they we
call call call
it it it
168
INTERJEC
CONJUNC(subord)
PRON(pers,plu)
and but
Oh ah uh
when when when
we we we
169
N(prop,sing)
CONJUNC(coord)
INTERJEC
to uh the
Australia Hawaii Fortune
but and and
uh uh uh
170
PRON(dem,sing)
V(cop,pres,encl)
PRON(univ)
and and but
that that that
’s ’s ’s
all all all
173
ART(def)
N(com,sing)
INTERJEC
’s got know
the the the
olde wife picture
uh ah uh
174
N(com,sing)
CONJUNC(coord)
INTERJEC
and the asthma
fruit hospital trouble
but and but
uh uh ah
176
N(com,sing)
PRON(nom)
PRON(pers,plu)
that same taxing
ball thing office
what what whatever
we they they
178
CONNEC(ge)
INTERJEC
CONNEC(ge)
mentioned
and And And
uh uh uh
then but but
180
CONJUNC(coord)
CONJUNC(coord)
INTERJEC
industry body car
and and and
and and and
ah uh ah
181
INTERJEC
ADV(ge)
PRON(pers,sing)
more Um But
uh oh uh
maybe maybe now
it I I
34
’s meat and
Usages Disfluent Articles
Disfluent but se too
movies my the
Disfluent Articles
Formulae
lunch Disfluent came get have
then Los Price
Disfluent Articles
Formulae that o Disfluent this we uh
that
Disfluent Articles
ah Relative-What play doned call
it uh they
I when oh
’s ’ve think
Disfluent Articles
Disfluent Articles
Disfluent Pre-Negator
Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre
Trigram
June 8, 2009 Post
182
INTERJEC
CONJUNC(coord)
INTERJEC
uh and Say
uh uh uh
and and and
uh uh uh
183
CONJUNC(coord)
INTERJEC
PRON(pers,plu)
Brisbane and but
and ah uh
uh when we
we they we
184
AUX(modal,pres)
V(montr,infin)
PRON(dem,sing)
you probably I
can can can
turn repeat say
that that that
185
INTERJEC
INTERJEC
N(com,sing)
of of African
uh uh uh
uh uh uh
life 500 experience
186
CONNEC(ge)
CONJUNC(subord)
PRON(pers,plu)
for
And then Only
when when that
we we they
187
INTERJEC
PRON(pers,sing)
PRON(pers,sing)
Um I uh
I I I
I I I
I I I
191
CONNEC(ge)
INTERJEC
CONJUNC(coord)
And And And
uh uh ah
and and and
198
CONNEC(ge)
INTERJEC
PRON(pers,sing)
But ah war
uh well then
I ah uh
I it I
201
INTERJEC
PRON(dem,sing)
V(cop,pres,encl)
I
Uh ah Oh
that that that
’s ’s ’s
202
CONJUNC(coord)
INTERJEC
CONJUNC(subord)
now old jobs
but but and
ah ah uh
when because after
204
N(com,sing)
INTERJEC
N(com,sing)
uh um uh
um blackcurrant day
uh ah uh
bleedin jelly day
208
ADV(ge)
ADV(ge)
ADV(ge)
Uh well fell
up not slowly
usually really slowly
now not back
210
INTERJEC
CONNEC(ge)
INTERJEC
mostly they and
mhm uh ah
but but but
uh uh ah
35
they hm on
turn see had
Usages Disfluent Articles
Disfluent Articles
[useless] off sauna Disfluent style and and [useless] was got ’re Disfluent found I notice
we I ah
Disfluent Articles
Disfluent like is was
about what a
we I that
Disfluent Formulae
Disfluent Articles
Disfluent ’ you [useless] in really and Disfluent I I most
Automatically Extracting Typical Syntactic Differences from Corpora Rank/Pre
Trigram
June 8, 2009 Post
212
PRON(pers,sing)
V(cop,past)
ADJ(edp)
time work when
I I I
was was was
scared 12 10
213
PRON(dem,plu)
N(com,plu)
CONJUNC(coord)
only things for
these and those
things and machines
and and and
215
PRON(dem,sing)
ADJ(ge)
N(com,sing)
that had holding
that this this
personal strange little
accident feeling girl
216
CONJUNC(coord)
INTERJEC
CONNEC(ge)
impatient doctors know
and and but
ah ah ah
then then but
219
CONJUNC(coord)
INTERJEC
ADV(ge)
and but orchestra
ah uh and
that not uh
then recently back
221
INTERJEC
N(com,plu)
N(com,plu)
like uh at
uh uh uh
bank beans church
accountants peas dances
222
CONJUNC(coord)
INTERJEC
PRON(dem,sing)
ah celery and
and and uh
ah uh oh
that that that
224
INTERJEC
ADJ(ge)
N(com,sing)
and and a
uh uh uh
different long small
kind jump piece
226
INTERJEC
N(com,sing)
CONJUNC(coord)
and Oh club
uh rice uh
bread porridge trade
and or and
228
ADV(ge)
CONJUNC(coord)
PRON(pers,plu)
them s accepted
properly somehow then
or and and
they they they
233
INTERJEC
N(com,sing)
PREP(ge)
uh dementia the
porridge uh uh
cup part sort
of of of
239
INTERJEC
FRM
FRM
just is
uh Ah uh
you you you
know mean know
240
PRON(pers,sing)
V(cop,pres,encl)
INTERJEC
said so
it it It
’s ’s ’s
uh uh uh
36
Usages [useless]
years years Articles uh time we Articles that ’sthese
the I if
Melbourne no in
Disfluent Articles
Disfluent Articles
Disfluent there carrots or
was meat ’s
of and of
sausage semolina things
Disfluent Articles
Disfluent Articles
Disfluent Articles
Articles are s gave Disfluent coffee the a
you in you
Disfluent No-There
Disfluent when it
Automatically Extracting Typical Syntactic Differences from Corpora
Rank/Pre
Trigram
June 8, 2009
Post
241
PRON(pers,plu)
V(montr,pres)
ART(indef)
really nighttime course
they we they
get have have
a a a
243
INTERJEC
N(com,sing)
N(com,sing)
got control on
uh by uh
money quality um
production control meat
247
INTERJEC
CONJUNC(subord)
PRON(pers,sing)
and uh
uh Uh if
because if I
I I it
253
PRON(pers,plu)
V(montr,past)
PRTCL(to)
days course um
we we they
had had wanted
to to to
261
INTERJEC
N(prop,sing)
N(prop,sing)
Um been uh
uh uh uh
Billy Saint Family
Crystal George Feud
263
CONJUNC(coord)
INTERJEC
CONJUNC(coord)
did before do
and and and
uh uh uh
and and and
269
NUM(card,sing)
N(com,sing)
PREP(ge)
and Uh one
one one one
day place side
at to of
276
N(com,plu)
CONJUNC(coord)
INTERJEC
short understand sometime
shorts people nights
and and and
uh uh ah
37
Usages [useless]
good just right Disfluent money manager meat Disfluent I want ’s [useless] say leave meet Disfluent ambulance
so uh plus
Disfluent Articles
[useless] the another the
barefoot you dollar
Disfluent Articles