Oliver Mason and Susan Hunston. University of Birmingham, United Kingdom. Patterns describe the syntactic behaviour of lexical items by specifying their.
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.1 (54-122)
The automatic recognition of verb patterns A feasibility study Oliver Mason and Susan Hunston University of Birmingham, United Kingdom
Patterns describe the syntactic behaviour of lexical items by specifying their local environment. This paper reports on a pilot study to automatically recognise those patterns in text. This study has been successful in producing a system that identifies patterns using only limited linguistic knowledge. It also has raised several issues which will have to be dealt with in future work. Keywords: syntax, parsing, pattern grammar, pattern recognition
.
Introduction
In this paper we use the term ‘pattern’ to refer to an approach to language description that prioritises the lexical items in a language and their grammatical dependencies (Francis 1993; Francis et al. 1996; Hunston & Francis 1998, 1999). Patterns are expressed as sequences of elements, each element comprising a lexical item, word class, group, or clause. Recognising patterns in running text automatically, that is, by computer is not a trivial task, but it is an essential step towards large-scale analysis of textual data. Patterns are a useful approach to describing language for a variety of purposes; one of the primary objectives in doing this is to advance the general analysis of language by computer. This would enable a number of further, more applied investigations, for instance in the areas of language analysis, the evaluation of patterns, and language processing: –
Once patterns can be reliably recognised by computer it is possible to gather information about the distributional characteristics of patterns, i.e. to assess the relative frequencies of different patterns in various registers. International Journal of Corpus Linguistics : (), ‒. ‒ ⁄ - ‒ © John Benjamins Publishing Company
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.2 (122-174)
Oliver Mason and Susan Hunston
–
–
This would complement the information about a restricted set of patterns in Biber et al. (1999: 649–654, 663–722, 756–758). Annotating textual data with patterns allows us to evaluate the comprehensiveness of current descriptions of English in terms of pattern, both in terms of how complete the pattern inventory is for any particular word and how many words can be described through their patterns in a given text (analogous to the basic vocabulary which covers around 90% of a normal text). Identification of patterns would also facilitate further applications of a pattern approach to English, such as information identification and extraction. They could also be integrated with a local-grammar approach (e.g. Gross 1993; Hunston & Sinclair 2000).
We report here the first stages of a pilot project designed to assess the feasibility of identifying patterns automatically in English language data. The patterns we are dealing with for this pilot are those dependent on verbs. We start with a more detailed description of what verb patterns are, followed by the computational processing aspect of the project. Then we look at the results of the recognition procedure and present an evaluation of the successes and problems we encountered. Finally, we outline future plans. The examples in this paper are taken from the Bank of English corpus (University of Birmingham / HarperCollins publishers).
. Verb patterns Although words of all classes can be described in terms of their patterned behaviour, it is verbs that lend themselves to the most comprehensive description, and it is with verbs only that this pilot study is concerned. Essentially, verb patterns characterise the possible complementation patterns of a verb as a sequence of elements. This approach to the grammar of the verb operates only a very shallow analysis, and contrasts with functional analyses of verbal behaviour that identify Subject, Object and Complement clause elements (e.g. Quirk et al. 1985; Karlsson et al. 1995), or case or participant roles (e.g. Fillmore 1969; Halliday 1994). The elements that comprise verb patterns may be grouped as follows: The pattern includes a clause element e.g. V that (verb + that clause) as in Her uncle had suggested that she work for
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.3 (174-250)
The automatic recognition of verb patterns
his company during her gap year. V n wh- (verb + noun group + wh-clause) as in I asked her whether it was difficult to go on collecting without her. The pattern includes one or more group or word class elements e.g. V n (verb + noun group) as in There is no danger in eating some saturated fat. V n adj (verb + noun group + adjective/adjective group) as in Jude kicked the door open. V adv (verb + adverb) as in They ate well at lunch time. The pattern includes one or more specific lexical items e.g. V as n (verb + as + noun group) as in She resigned as an MP. V n from n (verb + noun group + from + noun group) as in . . . to release him from his contract. V poss way prep/adv (verb + possessive + way + prepositional phrase or adverb) as in I somehow talked my way out of a beating.
In the 1995 Collins Cobuild English Dictionary (Sinclair et al. 1995), the entry for each sense of each verb was annotated to show the patterns occurring with that sense in the Bank of English corpus. Francis et al. (1996) collected lists of all the verbs occurring with each pattern. As described elsewhere, these publications supported Sinclair’s observation that pattern and meaning comprise a unified phenomenon, as do grammar and lexis (Sinclair 1991; Francis 1993; Hunston & Francis 1999). For our purposes in this paper, however, it is sufficient to note the availability of a description of the complementation patterns of about 5000 of the most frequent verbs in the Bank of English. All the work that has been carried out so far on the identification of verb patterns has been done manually, that is, by lexicographers and grammarians examining concordance lines to identify patterns and the meanings with which they are associated. This has proved sufficient to compile the reference publications mentioned above, but further large-scale quantitative studies cannot proceed unless the computer takes over the task of identifying patterns in a corpus of running text. It is to this process of identification that we now turn. At this point we would like to stress that pattern analysis is not directly related to work on subcategorisation frames (e.g. Brent 1991). Though both approaches cover similar areas of linguistic descriptions, patterns are more general and not restricted to verbs. Furthermore they are more specific in the way the actual syntactic environment of a word is described.
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.4 (250-283)
Oliver Mason and Susan Hunston
. Shallow parsing As noted above, the various constituent elements used in the description of patterns are on different linguistic levels (clauses, groups, words). Some elements can be directly mapped on to the input text, for example explicitly specified lexical items. Others, namely groups and clauses, require more sophisticated processing. It is particularly important for the pattern recogniser to be able to identify noun groups (ranging from a single noun or pronoun to an expanded group with determiner and modifiers) and finite clauses, as these occur in so many different patterns. Thus, before the patterns themselves are identified the corpus has to undergo a form of shallow parsing. By ’shallow’ we refer to the level of completeness of the structural description achieved: we are only interested in identification of constituent elements, not in the full syntactic structure of the sentences (or phrases) in question. As a result, the parsing process is considerably simplified and more robust, i.e. it can deal with a wide range of language material, unlike any parser that attempts a full analysis of unrestricted input. As a pre-processing step the text is tagged with a parts-of-speech tagger (Qtag, see Tufis & Mason 1998). This probabilistic tagger assigns a set of likely word class labels to every input token and helps us to identify the verbs whose patterns we are trying to recognise and their syntactic environment. Then a noun group recogniser is used (as described in Mason & Uzar 2000) to mark up noun groups. Noun groups are comparatively limited in their variability, and are thus easily recognised. The recogniser used is based on a transition network which describes possible sequences of part-of-speech labels or literal words (Winograd 1983). A more elaborate version of this recogniser has been adapted to also label verb groups and potential clauses. For this an exhaustive search on all possible groups or phrases that can be identified is performed, and those found are entered in a chart. This chart will contain a number of potential readings of the sentence which might only be valid ‘locally’, i.e. would not make grammatical sense when looking at the whole sentence. They are the result of ambiguities, mainly caused by multiple potential word class assignments. As a result the chart contains more constituents than actually exist, but the subsequent processing steps will filter out all those which are not possible. However, a small number of genuine ambiguities remain which cannot be resolved just on a syntactic level. At this stage we are not worried about those; on the contrary, we want the parser to identify as many possible constituents as possible, as the pat-
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.5 (283-334)
The automatic recognition of verb patterns
tern recogniser can only operate on elements identified at this stage. Limiting the number of elements here might cause the program not to identify patterns later on, whereas superfluous constituents are simply ignored afterwards. This does allow us to delay decisions on PP-attachment: if a preposition is not required by a pattern it will be attached to a preceding noun group, otherwise it will become part of the pattern. The shallow parser based on the group/phrase recogniser operates in a ‘layered’ fashion: first, the basic constituents are identified, which are unlikely to contain others embedded in them; after that, a number of passes is made through the chart grouping the already identified constituents together to form larger elements. For example, a noun group followed by a verb group (and possibly a number of further noun groups) will be treated as candidate for a clause. It might not in fact be a clause in the sentence in question, but if a verb pattern expects to find a clause in that position, we can assume that the grouping was valid. This allows us to identify subordinate clauses even if they are not introduced by a conjunction or relative pronoun. Subordinate finite clauses are easy for the parser to identify if they have a specified indicator, such as that, but less easy if they do not. For example, in (1) If you decide you want to get pregnant, ...
the correct pattern of the verb decide is V that (verb followed by that-clause), but the indicator which might be used to mark this, the word that, is absent. In this case the shallow parser identifies potential clauses, such as you want to get pregnant, by looking for a simple noun-verb sequence. This approach works fine as long as the exact extent of the element is not important: we can discover the presence of a clause, but not identify all its component parts. However, at this stage we are only interested in the presence of a clause, and subsequent analysis could process the clause more thoroughly if required. At this stage in the annotation process, then, we have identified some groups, clauses and individual words, but not yet patterns. In other words, we have not specified the dependence of the groups etc on specific verbs. Thus the parser itself is independent of our theoretical approach, and any kind of syntactic disambiguator could now be used to find the correct path through the chart of possible constituents. In our approach, the next stage is to identify the patterns in the parsed text.
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.6 (334-441)
Oliver Mason and Susan Hunston
. Pattern recognition Once the text has been analysed in terms of potential groups and clauses, the next step of the analysis can begin: recognising the patterns themselves in the data. In this section we describe in more detail how this has been done. A list of the verbs and their patterns has been extracted from Sinclair et al. (1995). Some additional patterns have been added from the later volume (Francis et al. 1996), which provides more complete coverage, though not of all the verbs in the dictionary.1 The patterns are stored in the literal form in which they are found in the dictionary to make it easier for (human) editors to add further patterns at a later stage if necessary. Each meaning of each verb, with its associated patterns, is listed separately, so that we keep the option open of using the pattern recogniser as a possible tool for word sense disambiguation, where there is a direct correspondence between sense and pattern. When a verb is encountered in the input stream, it is lemmatised and located in the list of verbs and patterns. All available patterns for all its different meanings are then retrieved from the list and the matching process begins, using an exhaustive search which filters out those patterns whose individual components cannot be found in the text. The remaining patterns are then ranked according to a weighting (which is described in more detail below). The highest scoring pattern is selected. Once a sentence has been processed, it is printed in tabular format; Figure 1 gives an example showing only the pattern of the verb decide which applies to the sentence.
Should
the
UN
V decide
to-inf to act
,
it
has
two
choices
.
Figure 1
The tabular form of representation is chosen because this facilitates the representation of overlapping patterns that occur when a clause contains more than one verb.2 Figure 2 demonstrates the phenomenon of ‘pattern flow’ (Hunston & Francis 1999: 207). Pattern flow occurs when an item that is a component of one pattern is also the starting-point of another pattern. In Figure 2, for example, the verb want is part of the that-clause dependent on decide, but it also begins the pattern V to-inf. This phenomenon can be viewed hierarchically, with one clause shown
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.7 (441-481)
The automatic recognition of verb patterns V
that V
If
you
decide
you
want
to-inf V to get
adj pregnant
Figure 2 V
that V
If
you
decide
you
want
to
to-inf V adj get pregnant
Figure 3
as embedded inside another. Figure 3 shows the hierarchical interpretation of the same clause as in Figure 2. Alternatively, the clause may be viewed as a linear phenomenon, as in Figure 2, in which case what is seen to be important is the beginning of each component, but not its end (see Brazil 1995 for a discussion of linear grammar, and Hunston & Francis 1999: 241–244 for its application to pattern analysis). In terms of pattern identification, Figures 2 and 3 are the same analysis, but the representation in Figure 3 shows a traditional hierarchical representation, with one clause realising an element in another clause, whilst Figure 2 shows a linear representation, in which the components of one pattern are no longer significant once a new pattern has begun. In terms of language theory, a linear representation is more novel, and therefore more interesting. It is consistent with Tognini-Bonelli’s injunction that corpus investigations should inspire new ways of looking at language, rather than simply confirming old ones (Tognini-Bonelli 2001). It does, however, pose problems for text mark-up systems based on SGML or XML, which presuppose that language is structured hierarchically and which therefore require that the end of an element be marked in addition to its beginning. In effect, such markup requires3 that a corpus be parsed using a hierarchical structure, whereas our aim is to recognise patterns without necessarily doing a complete (phrase-structure) parse of the corpus. For this preliminary study we have avoided the mark-up problem by not actually marking up running text, but instead we have only looked at isolated sentences, where we have generated annotation in a table format similar to the one shown above. This is fine for visual inspection, but in order to facilitate
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.8 (481-544)
Oliver Mason and Susan Hunston
automatic processing a different solution needs to be found. This also raises the issue of pattern boundaries, as it is not always possible to unambiguously identify where a pattern component ends (e.g. with a potential clause). This information is not important if the aim is just to recognise patterns (e.g. for quantitative analysis of pattern distributions), but might be more relevant for further processing of texts in areas such as information extraction. In summary, we use limited linguistic knowledge only, namely the syntactical structure of basic phrases which are recognised in the input stream. No further grammatical information (e.g. on verb subcategorisation) is used, apart from the patterns themselves.
. Evaluation . Method In order to evaluate our preliminary pattern recognition program, we applied it to 100 instances of the verb decide, taken at random from the Bank of English corpus.4 We chose this verb because it is a good example of a verb with a number of patterns, some involving clauses and some involving prepositional phrases. The patterns of the verb, i.e. the patterns that the program had available to allocate to the instances, were taken from Sinclair et al. (1995). The results were evaluated in the sense that the pattern for each instance was marked as ‘correctly identified’ or ‘incorrectly identified’. In the case of incorrect identification, the correct pattern was noted. A pattern was indicated to be correctly identified if the tabular representation showed the beginning of each element in the pattern; there was no requirement for the complete element to be shown. For example, the representation in Figure 4 was said to be correct, even though the table did not identify the whole of the that-clause, only its beginning. In other words, in terms of the discussion above, we took a linear rather than a hierarchical approach to pattern.
V that About now , I decide I hate the countryside even more than ever .
Figure 4
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.9 (544-587)
The automatic recognition of verb patterns
From a methodological point of view the ideal evaluation procedure would be to annotate a sample corpus for testing with the patterns of some or all of the verbs occurring in it, and then comparing it (automatically) with the same sample processed by the system. We could then quantify its performance in terms of precision and recall: recall would indicate how many verbs were assigned a pattern by the system, and precision would rate how many times the correct pattern was chosen. Precision can usually be kept high by discarding all the difficult cases, this, however, would drive recall down. In an ideal world both precision and recall would be maximized, but unfortunately one often has to compromise and trade off a high value of one measure against a lower score in the other. . Results From the 100 lines the recogniser assigned the correct pattern in 85 cases. Of the remaining 15, there was one pattern that was identified by the evaluator but which was not listed in Sinclair et al. (1995) (decide between), six were in noncanonical form (see below), and eight were cases where the wrong pattern had been assigned. The various apparent causes of erroneous pattern recognition are discussed below. A further evaluation of the pattern approach is reported in Mason (2004), based on a different mechanism to recognise pattern instances. Those results, from studying the randomly chosen verbs blend and link, are very promising, with even better performance than the algorithm described in the current paper. . Intervening words and pattern ambiguity One problem for the recogniser is that of intervening words. These are words and phrases that occur between one element of the pattern and the next, as in this example: (2) Mehari said yesterday he would decide next week whether the jury . . .
The pattern recogniser erroneously identified the pattern of decide in this example as V instead of V wh. The adverbial group next week stopped the system from recognising the wh-clause, as it was only looking for it immediately after the verb. Most of the examples that fall into this category include phrases that are adverbial modifiers, which can be placed at virtually any position within
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.10 (587-647)
Oliver Mason and Susan Hunston
a sentence. Fortunately, these cases can be dealt with fairly easily, by marking them as adverbial groups and making the pattern recogniser skip them during the matching process. This might require some ‘deeper’ analysis, for example to identify date expressions, a fairly straightforward task given current NLP techniques. Intervening words turn out to be an important issue when investigating apparent ambiguity in grammar patterns. It could be argued that automatic pattern recognition is unlikely to be successful because verb + preposition combinations are inherently ambiguous. For example, the sequence They decided on the boat could illustrate the pattern V on n, with the meaning ‘They decided that they would travel by boat’. Alternatively, it could illustrate the pattern V, followed by an adverbial prepositional phrase, meaning ‘They took a decision, and they happened to be on the boat at the time’. Biber et al. (1999: 36), for instance, use an argument similar to this as a rationale for using manual annotation of their corpus. It certainly might appear that a program that simply looks for decide followed by on, without further manual correction, is likely to get many false hits. Sinclair (1991: 104–105), however, argues that ambiguity is much rarer in practice than it is in theory, and that where a single sequence of words has two possible meanings, one will be much more frequent than the other. One of the hypotheses to be tested in larger-scale versions of the study reported here would be that ambiguity will (or will not) cause a substantial problem for the recogniser. As a preliminary investigation, 100 instances of decided on were selected at random from the Bank of English corpus and examined manually to see whether or not they realised the pattern V on n. Excluding passives (14 instances), 75 (87%) of the remaining 86 instances were examples of the target pattern. The remainder (13%) were instances of prepositional phrases beginning with on intervening in other patterns. Four instances were of the type decided on [date] to/that/whether. . . ; and seven were of the type decided on the spot/instant/ on a whim/ on grounds of taste. . . There were no examples of the second decided on the boat type mentioned above. As long as the recogniser could identify passives (see below) and intervening adverbials, it is unlikely that substantial ambiguity would result in practice, though it would remain a theoretical possibility. It can be expected that further studies will come up with similar results, that in many places structural ambiguities are possible, but will in practice not be realised. Otherwise communication would be unnecessarily complicated.
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.11 (647-716)
The automatic recognition of verb patterns
. Multiple patterns Most verbs have more than just one pattern. According to Sinclair et al. (1995), decide, for example, has twelve patterns distributed among five senses. A list of patterns extracted from the whole of Sinclair et al. (1995) comprises 20 051 patterns for 4819 word types. (This list, however, counts separately patterns for different senses of a word, even if the pattern itself is the same for more than one sense. By this count, the five senses of decide have seventeen patterns between them.) The verbs with the largest number of patterns are take (156), go (148), and get (142). These verbs are classed as ‘super headwords’ in Sinclair et al. (1995) and have a large number of distinct senses and sub-senses, which often share identical patterns. In Sinclair et al. (1995) as a whole, there are 26 words with 50 or more patterns, 359 with 10 or more, 1109 with 5 or more, and 3396 with 2 or more. Interestingly, the distribution is not as one would expect from Zipf ’s law (Zipf 1949): there are only 423 words with a single pattern (and therefore also a single meaning), whereas one would expect there to be many more of these words than there are of the ‘two or more’ meaning words. The most likely explanation for this phenomenon is that it is an artefact of the way the dictionary was created. For example, senses might be split up to make them easier to explain to a learner, or rare meanings could have been omitted so that only one sense of a polysemic word appears. One reason is certainly that as a learner’s dictionary it deals only with the most frequent words, which can be expected to have higher polysemy (see Köhler 1986) and therefore multiple patterns, omitting a large number of low-frequency single-sense/single-pattern words. The tail end of the distribution, so to speak, has been cut off by editorial decision. As a consequence, a number of rare patterns are missing from the list. This should not, however, be a problem in practice, since these patterns will by definition not occur very often. It is mentioned here as a reminder that all secondary language resources, however reliable, have to be assessed to ascertain how well they model the way language behaves. With so many words having multiple patterns it does not come as a surprise that quite often for a given instance, more than one pattern appears to achieve a match, as shown in Figure 5. The verb decide has both the pattern V that and the pattern V n. In this example, from the computer’s point of view, their place can either be identified either as the subject of a that-clause with omitted that, or simply as a noun complementing the verb decide. The recogniser finds both alternative interpre-
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.12 (716-746)
Oliver Mason and Susan Hunston V that V n Modern women may decide their place is at home with the children .
Figure 5
tations, but V that is ranked higher (as indicated by the fact that it is in a higher row in Figure 5). The higher ranking pattern can be identified by the following two criteria: a longer matching pattern is chosen before a shorter one (with length defined as number of elements), and a pattern with more lexical items is chosen before one which contains word categories (specific as opposed to more general). If two pattern candidates have the same number of elements (as in the example above), the one which uses more input tokens is preferred (so V that wins 5:2 over V n). In the evaluation this heuristic selected the correct pattern in all except one of the cases. . Non-canonical patterns One problem that we could not solve during this pilot study is that of noncanonical patterns (Francis et al. 1996: 611–615). A non-canonical pattern is one where the word order does not follow the prototypical sequence. For example, the verb talk occurs with, among others, the pattern V to n (I haven’t talked to him today). This pattern may also, however, be passive (children who are talked to. . . ) or may occur as part of another structure that ‘moves’ part of the pattern (to whom I talked; who I talked to; she was easy to talk to). An example from the current study is: (3) The question a director has to decide is how...,
where the direct object of decide has been moved into the theme-slot for emphasis. In order to cope with this example the sentence pattern N1 -N2 -V could be interpreted as ‘transformation’ of N2 -V-N1 , and the pattern could be matched that way. However, one has to be careful that this does not introduce too many other mistakes as a side effect. A more complex case of a non-canonical pattern is (4) What women and men have to decide is exactly what, in the 1990s, is a father for?
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.13 (746-808)
The automatic recognition of verb patterns
Here we have a repeated wh-pronoun which has been moved to the front of the sentence. The appropriate pattern would be V wh, but the interspersed is exactly makes it very hard to find this pattern. The canonical form of this example would be (not attested): (5) Women and men have to decide what exactly, in the 1990s, a father is for.
With this form it would be no problem for the system to correctly spot V wh as the pattern in question. It might be possible to infer this from the sequence Wh-N-have-to-V, but again one has to be cautious not to relax constraints too much. One encouraging fact of non-canonical patterns is that their range is rather limited (Francis et al. 1996: 611–615). There are not very many constructions (such as clefts or relative clauses) which disturb the ‘normal’ order of a pattern. This indicates that it should not be an unsolvable problem. Quantitatively this is not a major problem either; as reported above, 6% of the encountered patterns of decide were in non-canonical form. Similar results have been achieved by Mason (2004), where performance was improved by adding ‘transformed’ rules dealing with the passive voice version of the respective pattern. This can be done automatically in a pre-processing step of the pattern list, reducing the overall amount of problematic cases. . Tagging errors A problem of a rather different kind that came up during the development of the system was that of tagging errors. The tagger used for this study is wholly automatic; its output has not been manually checked, and it is an important principle of our work that human involvement should be avoided in the process as much as possible (Sinclair 1992). It is quite common for English words to be verb-noun homographs, which means that the tagger can easily assign the wrong tag. This is especially true in the case of verbs, which seem to get wrongly tagged as nouns more often than nouns get tagged as verbs, probably because nouns usually occur within a more restrictive context (following determiners/adjectives). Obviously this wrong tagging mainly applies to the base form or the third person singular of the verb, but it can also happen to ingforms, if they are frequently used as nominalisations (e.g. making). This is not a problem in the case of decide, which can never be a noun, but it can lead to structural ambiguities that can result in choosing an incorrect interpretation.
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.14 (808-856)
Oliver Mason and Susan Hunston
Missing out verbs is of course fatal for our recogniser, as we would then not look for their associated patterns. However, having superfluous verbs is not as serious a problem, as a noun wrongly tagged as a verb would not normally occur in a verbal pattern within a sentence. This means that our system can spot false verbs more easily (due to their environment) than it can false nouns (which should have been tagged as verbs). Thus, in order to increase recall, the tagger was set up to pass the two most likely tags instead of simply the single best tag. This leads to an increase in ambiguities that the recogniser has to deal with, but it also results in a higher recall without much impact on precision. Only in those cases where a verb can stand on its own (the pattern V) do we end up with a genuine ambiguity that cannot be resolved by the pattern matcher, as no further information in terms of complementation patterns is available to reveal that there is a noun that has been tagged as a verb. One remaining ambiguity causing problems is the contraction ’s. In the fragment . . . decide the party’s policy for. . . , the recogniser decided that this was a contraction of is, and marked it as a clause candidate which got interpreted as a that-clause (“. . . decide that the party is policy for. . . ”). From a purely syntactic point of view, i.e. ignoring the lexical items involved, this was a valid interpretation, but as the party is policy for has no recognisable meaning, the interpretation was obviously wrong. This in fact is a genuine ambiguity that cannot be resolved on a purely syntactic level, and thus this type of misinterpretation is unavoidable, regardless of which tagger is used for assigning the part-of-speech labels. It would be possible to restrict the range of constructions accepted as a candidate clause, but that would then lead to missing out potential that-complements where that has been omitted. This again is a case of the precision/recall trade-off: loosening the requirements for recognising potential clauses increases recall but reduces precision. The choice ultimately depends on the purpose of the analysis or application, and whether it would be preferable to identify correct patterns only at the cost of missing out some.
. Conclusion Despite limited resources we were able to demonstrate through a pilot study that it is indeed possible to detect patterns automatically in open text. There are still a number of unresolved problems, some of which require quite a lot
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.15 (856-905)
The automatic recognition of verb patterns
more effort on the part of the computational pre-processing. We also need to evaluate the system on a larger scale, which really needs a way of automatic precision/recall analysis of an annotated sample corpus. This, however, requires a suitable mark-up scheme, and it remains to be seen whether XML-based schemes can be used for overlapping non-hierarchical structures without too much processing overheads. Another question is that of completeness. As mentioned, Sinclair et al. (1995), the primary source of the patterns used in the system, is a learner’s dictionary and thus restricts itself to the more frequent words. This means that the inventory available to the pattern recogniser is necessarily incomplete, and we must find a way of coping with the ‘knowledge acquisition bottleneck’: what we want to avoid is laborious manual addition of further patterns, ideally by finding a way of automatically analysing a text sample and getting the system to suggest a number of potential patterns for each verb. A larger-scale project is therefore feasible, and we would argue that it is desirable. As mentioned in the introduction, there are a number of possible applications of work of this kind. One such application is the quantification of patterns appearing in different registers. Biber et al. (1999: e.g. 674, 698, 749–750) calculate the distribution of a number of different complementation types across a variety of registers. They note, for example, that non-finite complement clauses (that is, to-infinitive clauses and ‘-ing’ clauses dependent on verbs, nouns and adjectives) occur most frequently in written registers such as news reporting, fiction and academic prose, whereas finite complement clauses (that-clauses and wh-clauses dependent on verbs, nouns and adjectives) are more frequent in conversation (Biber et al. 1999: 749). Such findings are useful starting points in the investigation of style. Identification of a more comprehensive set of complementation patterns would allow those investigations to be more productive. Corpora consisting of texts from different time periods and those consisting of language produced by different kinds of speakers (e.g. adults or children, native speakers of English or learners) would also yield useful information if run through this kind of program. It is possible that one of the mechanisms of language change is the alteration of patterns co-occurring with particular lexical items (Hunston & Francis 1999: 97–98). The acquisition of patterns is one of the processes of language development that children and learners go through. A second application, that has been developed by Mukherjee (2001), is the investigation of the various patterns associated with the same lexical item. Mukherjee, for example, compares instances of the verb provide occurring with
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.16 (905-963)
Oliver Mason and Susan Hunston
three patterns: V n for n, V n with n, and V n to n. This investigation allows Mukherjee to suggest contextual constraints on the selection of one alternative pattern rather than the other. A corpus in which more patterns were identified could allow a test of the hypothesis that similar constraints operate across many pattern pairs. On a more ambitious scale, automatic pattern recognition would facilitate the identification and extraction of information in and from a corpus, using the mechanism of local grammar (Barnbrook & Sinclair 1995; Allen 1999; Hunston & Sinclair 2000; Woodward 2002). In a local grammar, specific meaning elements are mapped on to the elements in a grammar pattern. For example, as noted by Woodward (2002), the verb distinguish has the pattern V n from n, as in the example: The search for explanations distinguishes science from other human activities.
Other verbs, such as differentiate, separate, mark off and set apart have the same pattern and meaning elements. A program that recognises these verbs and this pattern could also identify the two items being compared (science and other human activities) and the distinguishing feature (the search for explanations), in each case. If pattern elements can be recognised in a large, open corpus, so can the meaning elements. Finally, on a more theoretical level, we have argued above that the feasibility of an automatic pattern recogniser depends on the hypothesised low level of ambiguity in actual, as opposed to invented, language data. The evaluation of output from an automatic pattern recogniser would test that hypothesis.
Notes . Because of the very large number of verbs with the simple patterns V (verb followed by no dependent noun group, clause or preposition) and V n (verb followed by a noun group), only the most frequent are included in the Francis et al. (1996) study. . The current study deals only with verbs. It could easily be extended to include the patterns of other word classes, as represented in Sinclair et al. (1995) and in Francis et al. (1996). In that case, the table would show pattern overlap whenever another word (not just a verb) with a pattern was encountered. . It is of course technically possible to represent a linear overlapping analysis using SGMLbased markup, but that would remove the basic advantages of the markup.
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.17 (963-1053)
The automatic recognition of verb patterns . One hundred concordance lines were selected automatically by a random selection program, which selects each nth instance of a word from all the instances in the corpus.
References Allen, C. (1999). A local grammar of cause and effect. Unpublished MA dissertation, University of Birmingham. Barnbrook, G., & Sinclair, J. (1995). Parsing Cobuild entries. In J. Sinclair, M. Hoelter & C. Peters (Eds.), The Languages of Definition: The Formalisms of Dictionary Definitions for Natural Language Processing. Studies in Machine Translation and Natural Language Processing (pp. 13–58). Luxembourg: European Commission. Biber, D., Johansson, S., Leech G., Conrad, S., & Finegan, E. (1999). Longman Grammar of Spoken and Written English. London: Longman. Brazil, D. (1995). A Grammar of Speech. Oxford: Oxford University Press. Brent, M. (1991). Automatic Acquisition of Subcategorization Frames from untagged Text. In Proceedings of the 29th Meeting of the ACL (pp. 209–214). Berkeley, CA. Fillmore, C. J. (1969). Toward a Modern Theory of Case. In D. A. Reibel & S. A Shane (Eds.), Modern Studies in English (pp. 361–375). New Jersey: Prentice Hall. Francis, G. (1993). A Corpus-driven Approach to Grammar: Principles, Methods and Examples. In M. Baker et al. (Eds.), Text and Technology (pp. 137–156). Amsterdam: Benjamins. Francis, G., Hunston, S. & Manning E. (1996). Collins Cobuild Grammar Patterns 1: Verbs. London: HarperCollins. Gross, M. (1993). Local Grammars and their Representation by Finite Automata. In M. Hoey (Ed.), Data, Description, Discourse (pp. 26–38). London: HarperCollins. Halliday, M. A. K. (1994). An Introduction to Functional Grammar. 2nd edition. London: Arnold. Hunston, S., & Francis, G. (1998). Verbs observed: A corpus-driven pedagogic grammar. Applied Linguistics, 19, 45–72. Hunston, S., & Francis, G. (1999). Pattern Grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: Benjamins. Hunston, S., & Sinclair, J. (2000). A Local Grammar of Evaluation. In S. Hunston & G. Thompson (Eds.), Evaluation in Text: Authorial Stance and the Construction of Discourse (pp. 75–100). Oxford: OUP. Karlsson, F., Voutilainen, A., Heikkilä, J. & Anttila, A. (Eds.). (1995). Constraint Grammar: A language-independent system for parsing unrestricted text. Berlin: Mouton de Gruyter. Köhler, R. (1986). Zur Linguistischen Synergetik: Struktur und Dynamik der Lexik. Bochum: Brockmeyer. Mason, O. (2004). Automatic Processing of Local Grammar Patterns. In Proceedings of CLUK 2004 (pp. 166–171). University of Birmingham. Mason, O., & Uzar, R. (2000). NLP meets TEFL: Tracing the Zero Article. In Proceedings of PALC’99 (pp. 105–115). Lodz.
JB[v.20020404] Prn:11/10/2004; 16:12
F: IJCL9204.tex / p.18 (1053-1096)
Oliver Mason and Susan Hunston
Mukherjee, J. (2001). Principles of pattern selection. Journal of English Linguistics, 29, 295– 314. Quirk, R., Greenbaum, S., Leech, G. & Svartvik, J. (1985). A Comprehensive Grammar of the English Language. London: Longman. Sinclair, J. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press. Sinclair, J. (1992). The automatic analysis of corpora. In J. Svartvik (Ed.), Directions in Corpus Linguistics (pp. 379–397). Proceedings of the Nobel Symposium 82, Stockholm, 4–8 August 1991 (=Trends in Linguistics. Studies and Monographs 65): Berlin / New York: Mouton de Gruyter. Sinclair, J. et al. (1995). Collins Cobuild English Dictionary. London: HarperCollins. Tognini-Bonelli, E. (2001). Corpus Linguistics at Work. Amsterdam: Benjamins. Tufis, D., & Mason, O. (1998). Tagging Romanian Texts: A Case Study for QTAG, a Language Independent Probabilistic Tagger. In Proceedings of the First International Conference on Language Resources & Evaluation (LREC) (pp. 589–596). Granada (Spain), 28–30 May 1998. Winograd, T. (1983). Language as a Cognitive Process. Reading, MA: Addison-Wesley. Woodward, R. (2002). And now for something completely different: a corpus-driven local grammar of ‘difference’. Unpublished MA dissertation, University of Birmingham. Zipf, G. K. (1949). Human Behaviour and the Principle of Least-Effort. Cambridge MA: Addison-Wesley.