A Corpus-Trained Parser for Systemic-Functional Syntax - Core

16 downloads 0 Views 1MB Size Report
1985, 1988) and OALD (Hornby and Cowie 1974), expanded to include all the morphological variants; a total ...... Briscoe, Ted and Nick Waegner. 1992. Robust ...
A Corpus-Trained Parser for Systemic-Functional Syntax

David Clive Souter

Submitted in accordance with the requirements for the degree of Doctor of Philosophy.

The University of Leeds School of Computer Studies August 1996

The candidate confirms that the work submitted is his own and that appropriate credit has been given where reference has been made to the work of others.

ii

Abstract This thesis presents a language engineering approach to the development of a tool for the parsing of relatively unrestricted English text, as found in spoken natural language corpora. Parsing unrestricted English requires large-scale lexical and grammatical resources, and an algorithm for combining the two to assign syntactic structures to utterances of the language. The grammatical theory adopted for this purpose is systemic functional grammar (SFG), despite the fact that it is traditionally used for natural language generation. The parser will use a probabilistic systemic functional syntax (Fawcett 1981, Souter 1990), which was originally employed to hand-parse the Polytechnic of Wales corpus (Fawcett and Perkins 1980, Souter 1989), a 65,000 word transcribed corpus of children’s spoken English. Although SFG contains mechanisms for representing semantic as well as syntactic choice in NL generation, the work presented here focuses on the parallel task of obtaining syntactic structures for sentences, and not on retrieving a full semantic interpretation. The syntactic language model can be extracted automatically from the Polytechnic of Wales corpus in a number of formalisms, including 2,800 simple context-free rules (Souter and Atwell 1992). This constitutes a very large formal syntax language, but still contains gaps in its coverage. Some of these are accounted for by a mechanism for expanding the potential for co-ordination and subordination beyond that observed in the corpus. However, at the same time the set of syntax rules can be reduced in size by allowing optionality in the rules. Alongside the context-free rules (which capture the largely horizontal relationships between the mother and daughter constituents in a tree), a vertical trigram model is extracted from the corpus, controlling the vertical relationships between possible grandmothers, mothers and daughters in the parse tree, which represent the alternating layers of elements of structure and syntactic units in SFG. Together, these two models constitute a quasi-context-sensitive syntax. A probabilistic lexicon also extracted from the POW corpus proved inadequate for unrestricted English, so two alternative part-of-speech tagging approaches were investigated. Firstly, the CELEX lexical database was used to provide a large-scale word tagging facility. To make the lexical database compatible with the corpus-based grammar, a hand-crafted mapping was applied to the lexicon’s theory neutral grammatical description. This transformed the lexical tags into systemic functional grammar labels, providing a harmonised probabilistic lexicon and grammar. Using the CELEX lexicon, the parser has to do the work of lexical disambiguation. This overhead can be removed with the second approach: The Brill tagger trained on the POW corpus can be used to assign unambiguous labels (with over 92% success rate) to the words to be parsed. While tagging errors do compromise the success rate of the parser, these are outweighed by the search time saved by introducing only one tag per word. A probabilistic chart parsing program which integrated the reduced context-free syntax, the vertical trigram model, with either the SFG lexicon or the POW trained Brill tagger was implemented and tested on a sample of the corpus. Without the vertical trigram model and using CELEX lexical look-up, results were extremely poor, with combinatorial explosion in the syntax preventing any analyses being found for sentences longer than five words within a practical time span. The seemingly unlimited potential for vertical recursion in a context-free rule model of systemic functional syntax is a severe problem for a standard chart parser. However, with addition of the Brill tagger and vertical trigram model, the performance is markedly improved. The parser achieves a reasonably creditable success rate of 76%, if the criteria for success are liberally set at at least one legitimate SF syntax tree in the first six produced for the given test data. While the resulting parser is not suitable for real-time applications, it demonstrates the potential for the use of corpus-derived probabilistic syntactic data in parsing relatively unrestricted natural language, including utterances with ellipted elements, unfinished constituents, and constituents without a syntactic head. With very large syntax models of this kind, the problem of multiple solutions is common, and the modified chart parser presented here is able to produce correct or nearly correct parses in the first few it finds. Apart from the implementation of a parser for systemic functional syntax, the re-usable method by which the lexical look-up, syntactic and parsing resources were obtained is a significant contribution to the field of computational linguistics.

iii

Contents Abstract Contents Tables and Illustrations Acknowledgments Preamble

ii iii vi vii viii

Chapter 1. Introduction 1.1 Aims. 1.2 Parsing Natural Language: An Example. 1.3 Setting the Parsing Scene. 1.4 Definitions. 1.4.1 Natural Language. 1.4.2 The Lexicon. 1.4.3 Competence and Performance Grammar. 1.4.4 The Grammatical Description. 1.4.5 The Grammatical Formalism. 1.4.6 Systemic Functional Grammar. 1.4.7 Language Corpora. 1.4.8 Parsing. 1.5 Application and Domain. 1.6 Scope. 1.6.1 Input Format. 1.6.2 Output Format. 1.7 Original Contributions to the Field. 1.8 The Structure of the Thesis.

1 1 1 3 7 7 7 8 8 9 10 11 11 13 14 15 15 16 18

Chapter 2. Background to the Thesis. 2.1 Corpus-Based Computational Linguistics. 2.1.1 The Intuition-Based Approach to Developing a Grammar. 2.1.2 The Corpus-Based Approach to Developing a Grammar. 2.1.3 Selecting a Parsed Corpus. 2.1.4 The Polytechnic of Wales Corpus. 2.1.4.1 Transcription. 2.1.4.2 Syntactic Analysis. 2.1.4.3 Corpus Format. 2.1.5 The Edited Polytechnic of Wales Corpus. 2.2 Systemic Functional Grammatical Description and Formalism. 2.2.1 Systemic Grammar in NL Generation. 2.2.2 Systemic Grammar in NL Parsing. 2.2.3 Problems for a SFG Parser. 2.2.4 Systemic-Functional Grammar in the POW Corpus. 2.2.5 Choosing a Grammar Formalism and its Effect on Parsing. 2.3 Lexical Resources for Corpus-Based Parsing. 2.3.1 Corpus-Based Tag Assignment. 2.3.2 Dictionaries and Morphological Analysers. 2.3.3 Lexical Databases. 2.3.4 Choice of Lexical Resource. 2.4 Parsing Techniques. 2.4.1 Rule-Based Parsing. 2.4.1.1 Shift-Reduce Parsing.

20 20 21 22 24 25 26 26 26 28 29 32 34 37 38 44 47 48 51 52 54 55 57 58

iv 2.4.1.2 Chart Parsing. 2.4.1.3 Probabilistic Chart Parsing. 2.4.2 Probabilistic Parsing. 2.4.2.1 Probabilistic Language Models. 2.4.2.2 Search Techniques. 2.4.3 Selection of a Parsing Technique.

58 62 65 65 67 69

Chapter 3. Developing the Resources for Parsing. 3.1 Developing a Probabilistic Systemic Functional Syntax. 3.1.1 Inconsistencies in the Corpus. 3.1.2 Distribution and Frequency of the Rules. 3.1.3 Coverage of the Rules. 3.1.4 Editing the Corpus. 3.1.5 Collapsing a Syntax Model using Optionality. 3.1.6 Expanding a Syntax Model using Co-ordination. 3.2 Developing a Probabilistic Lexicon for Systemic Functional Grammar. 3.2.1 Inadequacies of the POW Corpus Word List. 3.2.2 Lexical Aims and Policy Decisions. 3.2.3 Selection of Lexical Resources. 3.2.4 The CELEX English Database. 3.2.5 Converting the CELEX Lexicon. 3.2.6 Testing the Lexicon. 3.2.7 Rapid Lexical Lookup. 3.2.8 Lexical Probabilities 3.3 Training the Brill Tagger 3.4 Conclusions.

71 71 73 73 76 76 77 81 85 86 87 88 88 90 93 95 96 96 97

Chapter 4. A Probabilistic Chart Parser for Systemic Functional Syntax. 4.1 Preliminary Test using Parser Version 5. 4.2 An Improved Parsing Algorithm: Version 6. 4.2.1 Incorporating Rules with Optional Daughters. 4.2.2 Incorporating Rules with Co-ordinating Daughters. 4.3 Version 7: Combining Context-Free and Vertical-Trigram Rules. 4.3.1 The Vertical Trigram Model. 4.3.2 Limiting Tree Depth. 4.3.3 Stopping the Parser.

99 101 105 105 107 110 111 112 112

Chapter 5. Parser Testing and Evaluation. 5.1. The Test Data. 5.2 Test 1: Prototype context-free parser with CELEX look-up. 5.2.1 Lexical Look-Up. 5.2.2 Limitations of the Syntax Formalism. 5.2.3 Near Misses. 5.3 Test 2: Brill tagging and a context-sensitive chart parser. 5.3.1 Test 2: Parser Efficiency. 5.3 2 Test 2: Lexical Tagging Results. 5.4 Formal Evaluation. 5.5 Conclusions.

114 114 116 118 120 122 125 128 129 131 135

Chapter 6. Improvements to the Parser. 6.1 Improving the Lexicon. 6.1.1 Disambiguated Tag Probabilities. 6.1.2 Refining the CELEX to POW Syntax Mapping.

137 137 137 138

v 6.1.3 Handling Recurrent Word Combinations. 6.1.4 Improving the Brill tagger. 6.2 Improving the Grammatical Formalism. 6.2.1 A Probability Matrix for Optional Rules. 6.2.2 Grammatical Coverage. 6.3 Improving the Probabilistic Chart Parser. 6.3.1 Efficiency Checks. 6.3.2 Applying Multi-Word Edges in a Chart. 6.3.3 Restricting Unnecessary Rule Application. 6.3.4 Controlling the Agenda. 6.3.5 Feature-Based Parsing. 6.3.6 Semantic Solution Pruning. 6.4 Conclusion.

138 139 140 140 140 141 141 142 143 143 145 146 147

Chapter 7. Conclusions. 7.1 Lexical Resources. 7.2 Grammatical Resources. 7.3 Parser Implementation and Testing. 7.4 The Research Method. 7.5 The Last Word.

148 148 149 150 151 153

References .

155

Appendices . Appendix 1. Sample Fragments of Parsed Corpora. Appendix 1.1 Lancaster/Leeds Treebank. Appendix 1.2 Nijmegen Corpus (CCPP). Appendix 1.3 Polytechnic of Wales Corpus. Appendix 1.4 Susanne Corpus. Appendix 1.5 IBM/Lancaster Spoken English Corpus. Appendix 2. A Brief Description of the POW Corpus. Appendix 3. Systemic-Functional Syntax Categories in the POW Corpus. Appendix 4. A Mapping from LDOCE to POW SF Syntax Tags. Appendix 5. A Fragment of a Context-Free SF Syntax Maintaining the Distinction between Filling and Componence. Appendix 6. A Fragment of a Context-Free SF Syntax Ignoring the Distinction between Filling and Componence. Appendix 7. A Fragment of a Vertical Trigram Model from the POW Corpus. Appendix 8. Rule and Word-Wordtag Frequency Distribution in the POW Corpus. Appendix 9. A Prototype Competence Systemic Functional Syntax. Appendix 10. Brill Tagger Context Rules Learned from POW Appendix 11. General lexical tagging rules used by the Brill tagger for untrained words. Appendix 12. The Reduced EPOW Filling Grammar. Appendix 13. The 100 Most Frequent Word-Wordtag Pairs in the EPOW Lexicon. Appendix 14. Pocock and Atwell's Weight-Driven Chart Parser. Appendix 15. Parser Version 7: Test Results.

165 166 166 167 168 169 170 171 172 175 179 180 181 182 183 186 187 188 189 190 191

vi

Tables and Illustrations Figure 1. Sample of transcribed speech from the Polytechnic of Wales Corpus. Figure 2. Parse Trees for the Utterances in Figure 1. Figure 3. A Graphical Representation of a Parse Tree. Figure 4. A Sample Section of a POW Corpus File. Figure 5: A fragment of a system network for MOOD in English. Figure 6. A Sample Section of GENESYS Output. Figure 7. Section of Lispified LDOCE. Figure 8. Building a Chart. Figure 9. The 20 Most Frequent Word-Wordtag Pairs in the POW Corpus. Figure 10. The 20 Most Frequent Context-Free Rules in the POW Corpus. Figure 11. A Fragment of Probabilistic RTN from EPOW. Figure 12. Ambiguous Analyses of the Sentence Fight on to save art. Figure 13. Rules which occur only once (6047 out of 8522). Figure 14. Syntax Rule Reduction Using Optionality. Figure 15. A Probabilistic QQGP Syntax. Figure 16. A Reduced Probabilistic QQGP Syntax. Figure 17. Co-ordination Likelihood in the EPOW Corpus. Figure 18. Variants of the Word-Stem brick in the POW Lexicon. Figure 19. Singleton Word/Word-tag Co-occurrences in the POW Corpus Lexicon. Figure 20. A Fragment of our CELEX English Lexicon. Figure 21. Structure of our CELEX Lexicon according to Grammatical Category. Figure 22. A Fragment of the Reformatted CELEX SFG Lexicon. Figure 23. Testing the CELEX SFG Lexicon with the EPOW Word List. Figure 24. A Trivial Test of Parser Version 5. Figure 25. Trivial Output from a Weight-Driven Chart Parser (version 6). Figure 26. Tree Depth in the POW Corpus. Figure 27. The Test Samples. Figure 28. Test 1: Results. Figure 29. Data from a Suspended Six-Word Parse. Figure 30. Near Misses. Figure 31. Test 2: Results. Figure 32: Evaluation of test results on 19 ‘seen’ and 6 unseen test utterances.

1 3 12 27 30 33 51 60 63 63 66 68 74 77 80 80 82 85 86 89 90 93 93 103 108 113 116 118 121 122 127 133

vii

Acknowledgments I would like to thank my supervisor, Eric Atwell for his generous support, insight and direction throughout my period of part-time study. Without him I would never have become a corpus linguist, and still been a John-likes-Marian. I am also indebted to my wife Clair, who has only ever known me as a Ph.D. student, and my mother, for their continuing encouragement. I would also like to thank my erstwhile colleagues on the COMMUNAL project at Cardiff, and Robin and Mick for collecting the POW corpus in the first place. I am most grateful to the School of Computer Studies and its staff, who persuaded me to embark on a part-time Ph.D. in the first place, and have never let me forget it. Special thanks go to my CCALAS colleagues over the years, particularly Eric Atwell, Tim O'Donoghue, Rob Pocock, John Hughes, George Demetriou and Uwe Jost. Finally, I would like to thank my external examiners, Robin Fawcett and Willem Meijs for their useful comments and criticisms, which resulted in this revised thesis.

viii

Preamble This project falls within the domain of language engineering, a relatively young discipline in which (usually large-scale) linguistic resources are harnessed using artificial intelligence and computer science techniques in order to provide intelligent tools allowing humans to interact with and exploit computers by natural language. As such, it will not consist of a great new insight into linguistic theory, nor will it necessarily advance the theoretical bounds of computer science. In contrast, its originality will lie in a combination of elements; the selection and development of linguistic resources (a natural language corpus and a lexical database of English), the extraction of lexical and syntactic knowledge from these resources into machine-tractable formalisms, their harmonisation under one grammatical description (systemic functional grammar), and finally the integration of the lexical and syntactic mode with a parsing algorithm. Of particular importance is the unrestricted nature of the spoken language in the corpus, which makes the language model very large, including the ability to handle the common features of speech, such as ellipted items, false starts, replacement items and unfinished sentences. This sets the present enterprise apart from parsing work based on competence grammars of the language, which are usually manually developed using expert linguistic intuition, and often rely on the notion of a syntactic constituent containing a linguistic head. It is only when computational linguists try to accommodate the full range of relatively unrestricted natural language as it is performed, that the potential of NL systems will finally be commercially realised.

The chosen grammatical description is systemic functional grammar (SFG), and specifically the version of systemic functional syntax used in the annotation of the Polytechnic of Wales corpus. SFG has been used quite widely as a model of natural language generation within the broader context of social interaction. Some researchers have worked explicitly on the task of systemic-functional syntax parsing within the confines of the syntactic coverage of a related NL generation system, such as that of the Penman project (Kasper 1988) and in the COMMUNAL project, O’Donoghue’s Vertical Strip Parser (1993) and Weerasinghe’s Probabilistic On-line Parser (1994) 1. The syntactic coverage of such parsers has been deliberately constrained in these cases by learning the syntax model from what the generator could generate. The parser’s output was intended to be the input of a semantic interpreter, 1

Although Weerasinghe does draw on the POW corpus for his probabilisitic model, the syntax itself is derived from the NL generator. O’Donoghue produced an earlier Simulated Annealing Parser based on the POW corpus (see Atwell et all 1988, Souter and O’Donoghue 1991), but his thesis work focuses on compatibility with the COMMUNAL generator GENESYS.

ix which once implemented, would produce semantic representations compatible with those in the generator, opening the way to a complete SFG-based NL interface. With such an application in mind, there is little point expanding the coverage of a syntactic parser beyond the potential of the generator (unless the interpreter was able to collapse semantically related syntactic analyses onto just the semantic representations that the generator could produce). The present project, however, does not so constrain itself, and can therefore analyse a wider range of structures. The corpus-trained syntax models which are presented here should provide useful source material to the developers of systemic-functional (and other) NL generation systems. One disadvantage of using the corpus-based syntax though, is that it is not directly compatible with that being used in the COMMUNAL NL generator, (even though they were both authored by Robin Fawcett); the syntactic node labels are slightly different, and the corpus version does not include participant roles. In this project, therefore, I am not attempting to accommodate the syntax of the related NL generator. I am working with a wider, arguably still richer model.

It should also have become clear that I consider the process of semantic interpretation to be beyond the scope of my project. When I refer to parsing, I will mean purely syntactic parsing with respect to the corpus-trained model, and not include the process of semantic interpretation (as some systemicists do). SFG provides a description and formalism at both the syntactic and semantic level. Therefore the use of the term SFG can be seen to encompass more than just a syntactic grammar, putting it at odds with linguists who would use the word grammar more exclusively for a purely syntactic model, such as a set of context-free rules. I will therefore (try to remember to) refer to my corpus-trained ‘grammar’ as a SF syntax, and reserve the term SFG for when I especially wish to refer to the full syntactic and semantic description.

The ethos of the work presented here is to allow the method to be re-used for other corpora and syntactic descriptions, rather than be tied exclusively to the one description. There are some areas where specific POW-corpus SF syntax modules have inevitably been included, but these have deliberately been kept to a minimum. There are many competing grammatical theories and descriptions, both in the field of corpus linguistics, and more general theoretical linguistics, none of which can be said to have universal support as yet. I will present a generally re-usable method, but implement it for one particularly rich (and therefore quite tricky) description.

1

Chapter 1. Introduction. 1.1 Aims. It is the aim of this thesis to produce a reliable method for assigning syntactic structure to sentences of relatively unrestricted English, as found in spoken corpora of the language. This process, which is called parsing, is one of the key components in many computer applications which require natural language processing (NLP) of some kind. Apart from building a particular implementation of a parser for the syntax of systemic functional grammar (SFG: Fawcett 1981) as found in the Polytechnic of Wales corpus (Fawcett and Perkins 1980, Souter 1989b), I will argue that, given adequate lexical and corpus resources, the same method can be adopted to develop parsers for other grammatical descriptions.

1.2 Parsing Natural Language: An Example. By way of introduction to the process of parsing, Figures 1 and 2 show the sort of input and output a parser for relatively unrestricted English might be expected to handle. The input would be transcribed spoken text, which ideally (but not invariably) would consist of separate sentences. The sample in Figure 1 is part of a spoken text produced by a twelve year old boy (PG) in conversation with two others while building some LEGO, and was chosen at random from a corpus of 65,000 words of spoken English called the Polytechnic of Wales (POW) Corpus.1

Figure 1. Sample of transcribed speech from the Polytechnic of Wales Corpus. (1) PG: WHY (2) PG: WHAT 'S THE POINT (3) PG: YOU PUT THESE ON FOR WINDOWS (4) AW: you don't have to (5) SM: won't be long (6) PG: IT 'S EASIEST MIND 1

The samples are taken from the corpus file 12abpspg.

2

(7) AW: I know something easy. Build a garage. (8) PG: FANTASTIC (9) SM: or something like a skyscraper (10) PG: THIS WORKED OUT IT WON'T FIT (11) SM: Go on. We can always move it along can't we. (12) PG: WILL THAT ONE FIT IN BY-THERE (13) PG: COME ON LETS GET GOING (14) PG: I CAN'T EVEN

The text in Figure 1 has been orthographically transcribed from recordings of the spoken data. While this fragment of speech hardly represents a typical interaction one might imagine between a human and a computer, it does contain a range of utterance types we would want a computer to be able to deal with; queries (1,2,11b,13), statements (3-7a,10), exclamations (8), and commands (7b,11a,13). It also includes examples of syntactic phenomena we would want our parser to be able to handle; juxtaposed sentences2 (10,13), ellipted (missing) words (5,7,9) and unfinished sentences (4,14), which are a particular problem for many parsing programs. I have selected this spoken text as an example because it illustrates such difficulties. By doing so I do not mean to preclude the parser from working on written corpora or adult English, which contain their own different problems of more complex grammatical structure and longer utterances, but these language varieties will not be the primary material the parser should handle.

Each sentence in the POW Corpus has been syntactically analysed manually, so that each word is labelled with a syntactic category (appearing immediately to the word's left), and the words grouped into phrases and clauses, forming a tree structure of nested labelled brackets (see Figure 2).

2

Juxtaposed sentences are those with no explicit separator between them. It is a matter of linguistic debate whether (10) is a single sentence consisting of two clauses, or two separate sentences. In a spoken corpus, such separation is hopefully marked by prosodic features such as pauses and intonation contours. Briscoe (1994; 97-8) refers to this as the chunking problem.

3

Figure 2. Parse Trees for the Utterances in Figure 1. (1) [Z [CL [AWH [QQGP [AXWH WHY]]]]] (2) [Z [CL [CWH [NGP [HWH WHAT]]] [OM 'S] [S [NGP [DD THE] [H POINT]]]]] (3) [Z [CL [S [NGP [HP YOU]]] [M PUT] [C [NGP [DD THESE]]] [CM [QQGP [AX ON]]] [A [PGP [P FOR] [CV [NGP [H WINDOWS]]]]]]] (6) [Z [CL [S [NGP [HP IT]]] [OM 'S] [C [QQGP [AX EASIEST]]] [AF MIND]]] (8) [Z [CL [EX FANTASTIC]]] (10) [Z [CL [S [NGP [DD THIS]]] [M WORKED] [CM [QQGP [AX OUT]]]] [CL [S [NGP [HP IT]]] [OMN WON'T] [M FIT]]] (12) [Z [CL [OM WILL] [S [NGP [DD THAT] [HP ONE]]] [M FIT] [CM [QQGP [AX IN]]] [C [QQGP [AX BY-THERE]]]]] (13) [Z [CL [M COME] [CM [QQGP [AX ON]]]] [CL [O LETS] [M GET] [C [CL [M GOING]]]]] (14) [Z [CLUN [S [NGP [HP I]]] [OMN CAN'T] [AI EVEN]]]

These manually parsed trees might be treated as the desired solutions a parser should find and produce as output (provided the experts who produced the trees agree that they are correct). However, in arriving at these analyses, the manual annotators may have had access to the original recordings (or information derived from them), which would have provided intonation and other contextual cues leading the annotators to select one analysis from perhaps many other readings. It is typical in natural language for a sentence to be syntactically ambiguous, but the hearer will (usually unconsciously) use prosodic, semantic and pragmatic information, as well as knowledge of the world, to select the most likely interpretation. The parser being developed here will not have access to such information. It will therefore be the parser's job to provide the permissible syntactic structures (with respect to a particular syntactic model) and no more, for subsequent semantic and pragmatic pruning by further modules of a NLP system, or by a human post-editor. Consequently, each parse tree in Figure 2 should be viewed as one of perhaps many legitimate analyses which a syntactic parser should produce.

1.3 Setting the Parsing Scene. Why is it so important to be able to accurately assign syntactic structure to a sentence? A reasonable analogy within the computing world would be to ask why is it important to be able to compile a program? The compiler parses the program from some programmer's file, interpreting

4

it as a set of instructions to perform on some given data. It is only by parsing the program that it can then successfully perform the instructions.

In the context of natural language input to a computer, the language utterance is analysed syntactically either because that is the goal of the computer system, or more commonly because the utterance is to be interpreted as a command, a query or a statement, and the meaning of the utterance is to be determined. In the latter case, the form of the utterance is found by the parser, and the form will dictate what the computer then does with its semantic content: add it to a database, perform some query on a database, or perform some other action such as opening a file. This kind of interaction presumes that the human is using natural language to access a data or knowledge base stored on computer. The advantages to be gained from using language in this way are that the user does not need to learn a specific database query language or operating system to make the computer act. Furthermore, if the language input mode is spoken, rather than written, it saves the user from needing to type, or click on a mouse, and hence will generally be quicker, and leave the user's hands free for other tasks.

There are, however, many other applications in which natural language is parsed, which promise to revolutionise the way we interact with computers and each other. Speech recognition devices, in which the user employs ordinary spoken language which is transformed into written text by the computer, commonly use a parser to help decide between the alternative possible mappings from acoustic signal to lists of words. If a speech recogniser has to choose between the following two mappings from signal to written text, the syntactic analysis will hopefully force it to select the first:

I'm hoarse because I scream I'm horse because ice cream

One of the anticipated NLP applications which has so far proved elusive is a general purpose machine translation or interpreting system between any pair of natural languages. One part of such a system might be to take the input sentence and parse it, passing the output of the parser to a semantic interpreter. It has long been assumed by many logicians that “the meaning of a complex expression should be a function of the meaning of its parts” (Allwood et al 1983; 130), the compositionality principle which is often attributed to Gottlob Frege (see for example Frege

5

1952). The function referred to by which the meanings of the parts of a sentence are combined to form the meaning of the whole is often taken to be the syntactic structure. The meaning of, say, a noun phrase is first determined from the phrase's subparts according to the structure of the noun phrase, and then combined with the meaning of other phrases in the same clause, and finally other clauses, to produce a meaning of the sentence. This method is commonly adopted when arriving at a semantic representation for a sentence (see for example Dowty et al 1981, Gazdar et al 1985), and used not only in machine translation, but in other applications such as text abstracting and summarising, and NL interfaces (as described above).

Other applications in which a syntactic analysis module is presumed are grammar and style checkers, where the purpose of the program is to point out to an author (perhaps non-native) that the construct they have used is, according to the system's grammar, unacceptable, or at least idiosyncratic.

Although parsers are usually associated with the analysis and interpretation of text, there is a further application in the domain of text-to-speech generation. Text-to-speech synthesizers are already commercially available. In early products the speech produced was quite intelligible, but sounded unnatural, primarily because of the lack of varying intonation and other prosodic features such as pauses for breath at the end of a tone unit. Recent systems have achieved more natural sounding speech by parsing the text, and applying an intonation and stress pattern to the generated speech according to the structure of the sentence.

Although many of these much heralded applications for natural language processing are beginning to be realised, many are doing so by restricting the complexity of the problem. For example, a speech recognition device might require the speaker to train the system to recognise his individual speech, and to leave a slight gap between each word (Dragon Dictate is one example). A machine translation system might be developed for just a limited domain between two languages, for instance translating weather forecasts between English and French (METEO, TAUM Montreal).

It is fair to say that researchers in NLP have for a long time confined their efforts to producing relatively modest working prototypes with restricted lexical and grammatical capabilities. One of the reasons for their limited success is the fact that they have imposed somewhat tight boundaries

6

on the range of words and constructs the user can produce. With ‘toy’ lexicons and grammars of probably fewer than a hundred words and rules, fast, efficient parsers have been developed using simple algorithms. Of course, such parsers fail frequently when they come across a word or syntactic structure not described in the lexicon or grammar, but perform admirably within the chosen sublanguage. It is only when speakers (or writers) are given a free rein with their linguistic creativity that NLP systems will graduate from prototypes to a robust and mature technology, and go on to have the profound impact on society, commerce and industry which has long been envisaged. It is in this context that the so-called language industry is emerging.

In the last eight to ten years, however, this problem of handling realistic unrestricted English text has begun to be addressed by more than just by a limited number of interested research groups (Sampson 1990). Several of the theories and techniques that have been developed for the small prototype systems are coming under close scrutiny by researchers faced with all the complexity and size of the English language, as observed in the large collections of a variety of English texts which I am referring to as corpora.

Several major questions arise when considering the parsing of unrestricted English:

What sort of lexicon and grammatical formalism is capable of describing the complex syntactic relations of a natural language like English? Do existing grammatical theories capture many or all of the phenomena found in real use of spoken and written natural language, such as ellipsis, unfinished sentences, discontinuous, repeated and replacement elements? What sort of parsing techniques are suitable for the very large grammars and lexicons that describe unrestricted English? Can the rule-based techniques used for competence grammars be adapted easily, or are purely probabilistic methods the only realistic alternative? Is it possible to produce a parser which combines the simplicity and efficiency of rule-based parsing with the robustness of the probabilistic counterparts? Some researchers are also questioning the abstract or so-called rational approach to grammar and lexicon development, where rules are written intuitively using expert linguistic knowledge, using what Chomsky called competence in the language, to an empirical approach guided by the actual performance of the language (Lyons 1970; 38-39). A grammar which stems from one of these two different approaches may be called either a competence grammar or a performance grammar.

7

The research presented here attempts to answer some of these questions, and produce a solution to the problem of wide-coverage parsing of unrestricted English. This solution falls squarely within the performance rather than the competence paradigm of grammatical development.

1.4 Definitions. In the preceding sections, I have tried to give a general introduction to the parsing problem, without providing any formal definition of technical terms. I will now assume the following definitions:

1.4.1 Natural Language. Parsing occurs with respect to a particular natural language. A natural language is distinguished from mathematical, logical or programming languages by being articulated by native human speakers, and typically containing ambiguity and vagueness. A natural language is itself defined by a lexicon and grammar for that language. The lexicon and grammar may be referred to together as the lexico-grammar. The grammar may be divided into components specifying the structure of words (morphology) and sentences (syntax), and the meanings of words and sentences (semantics).

1.4.2 The Lexicon. The lexicon theoretically lists all the wordforms (words) in the language, but in practice usually contains a more or less large subset of the morphemes in the language. Morphemes are the smallest semantically significant parts of words, and are referred to as stems and affixes. For instance, the word trains can be split into two morphemes: train and s. The suffix s is so called because it follows the stem. Since the morpheme train is syntactically ambiguous (it can be a noun or a verb), the suffix here could either be the plural ending for a noun, or the third person singular ending for a verb. A word is broken down into its component morphemes by a process called morphological analysis, and a program which does this is called a morphological analyser. A lexicon which consists of only morphemes is usually supplemented by a set of regular rules of morphology, which allow wordforms to be constructed from or broken down into their morphemes. If a wordform cannot be derived from the regular morphological rules for the

8

language, it is deemed to be irregular and stored in the lexicon as a separate morpheme. A typical product of lexical look-up and/or morphological analysis for a parser is the part-of-speech (syntactic tag/category) for a word (or more rarely a group of words treated as one item). Where a word is syntactically ambiguous, a lexicon should provide all the alternative parts-of-speech, but a syntactic tagging program (tagger) usually chooses the most likely single tag with respect to the surrounding words (and their tags).

1.4.3 Competence and Performance Grammar. The grammar (as has been suggested in section 1.3) can be defined in two different ways: In one view, a competence grammar describes how the wordforms of the language can be combined to produce grammatical sentences acceptable to the native speakers of the language. A competence grammar is an idealised encapsulation of what native speakers consider to be acceptable and meaningful word combinations in the language, and is usually created by introspection on the part of the speaker.

In the second view, a performance grammar describes all the ways in which the wordforms in the language can be combined by a speaker when uttering the language. A speaker usually produces an utterance for some communicative purpose, such as conveying some meaning. An utterance may be spoken or written, and may consist of one or more sentences, or only part of a sentence.

The relationship between performance and competence grammars is that the performance grammar reflects the way the language is used, and so must account for performance features such as interruptions, repetitions, slips of the tongue and mistakes, whereas the competence grammar should reflect what the speaker would ideally produce without these ‘imperfections’.

1.4.4 The Grammatical Description. A grammar contains a unique description of the language. The description includes labels for wordforms which behave similarly, classifying them into groups. The labels are commonly called parts of speech, such as noun, verb, preposition and adjective, but will be referred to here as word tags, and also as terminal categories (since they are the labels on the ends of the branches in a parse tree, and are attached to the wordforms themselves). The grammar description also

9

specifies recurring combinations of word tags, called constituents, which, in an individual grammar may be referred to as clusters, units, groups or phrases. These constituents themselves may be further combined into higher level groupings called clauses, which finally combine to make up a sentence. The labels for such constituents are called non-terminal categories. The label for the sentence constituent itself is called the root category. A description for unrestricted natural language is likely to contain many tens or even hundreds of grammatical categories depending on the level of delicacy the writer of the grammar wishes to identify. The finestgrained description would have a separate label for each wordform, which has led Halliday (1961; 267) to refer to the lexicon (lexis) as the most delicate grammar. In practice, grammatical descriptions tend to vary between a few dozen and upto around 200 categories. In some descriptions, the categories themselves are not treated as indivisible atomic items, but as combinations of syntactic features.

The constituent structure describes the form of the sentence, but a second level of description capturing the function of different parts of the sentence is employed in some grammars. The form of the string of words (15) is a noun phrase, but it may function differently depending on its position and relationship with other words in the sentence. In (16) the noun phrase acts as subject of the main verb. In (17) it acts as the verb's object, or complement, and in (18) it acts as the possessor in an enclosing noun phrase.

(15)

Two fat ladies

(16)

Two fat ladies wanted a photo

(17)

I photographed two fat ladies

(18)

I took the two fat ladies' photo

1.4.5 The Grammatical Formalism. The relationship between the categories is captured by a particular grammatical formalism. Several formalisms can theoretically be employed for any one description, although it is typical for one formalism to be associated with one description (and be referred to simply as a grammar), usually because the description will have been created for a particular purpose, such as teaching the language structure to native or non-native learners, linguistic research, automatic parsing of sentences, or generation of sentences. Examples of formalisms that have been employed to

10

capture grammatical relations in sentence analysis are finite-state rules, context-free and contextsensitive phrase structure rules, and unrestricted rewrite rules. Each of these formalisms varies in its generative power according to a hierarchy commonly referred to as the Chomsky hierarchy, and as the power increases, generally the complexity of the formalism does so too. Not all sentences permitted by a context-free grammar would be permitted by a finite-state grammar, for example. For a more detailed introduction see for example (Lyons 1970; 47-82) or (Chomsky 1957). It has long been the goal of linguists to discover the least powerful formalism which would still adequately cover the sentence structures of a natural language. In models of sentence generation other formalisms are adopted. For instance, systemic grammar employs system networks to specify the semantic and syntactic options a speaker has. Each choice may be associated with a realisation rule, which partially specifies the sentence structure or a lexical element.

Although I have used the general term grammar in my distinctions between competence and performance, and between description and fomalism, in my development of a grammatical model for use in parsing, I will focus on the syntactic component of such grammars.

1.4.6 Systemic Functional Grammar. If one's goal is to be able to parse unrestricted English, it is necessary either to create one's own grammatical description or select one from those ‘on offer’. The broad description chosen for this project is systemic functional grammar as defined by Fawcett (1981), and specifically its syntactic description as exemplified in the POW corpus. Systemic functional grammar (SFG) has developed from the linguistic traditions of J. R. Firth and Michael Halliday, and is primarily a grammar of language generation. That is to say it specifies how to generate acceptable sentences of the language from a set of semantic choices (technically referred to as systems ). This might seem a bizarre choice in the light of this project’s aims in language analysis. However, a variant of the syntactic component of SFG has been applied in the manual analysis of the Polytechnic of Wales Corpus (Fawcett and Perkins 1980), and it is seen as desirable precisely because variants of the same description have been used for both analysis and generation. This decision is explained further in sections 1.5 and 2.2.

11

1.4.7 Language Corpora. A set of utterances which have been produced by the speakers of a language may be collected as a corpus. A corpus is a strategic collection of texts, spoken or written, which attempt to represent the language as a whole. Corpora may also be collected to represent restricted usage of the language (sublanguages), such as that produced by children, or that used in legal, commercial or business environments, for example. A corpus is distinguished from an archive of texts by the fact that it has been strategically collected as a representative sample, rather than randomly assembled. We have already seen an example taken from a spoken corpus, in which the recorded speech was transcribed orthographically into a written form. Written corpora or orthographically transcribed spoken corpora which have received no annotation will be referred to as raw corpora. In some cases, however, the wordforms in a corpus will have been tagged with terminal grammatical categories, and will then be referred to as a tagged corpus. When a tagged corpus has further been annotated with full grammatical analyses for each sentence, it is called a parsed or analysed corpus or treebank.

1.4.8 Parsing. Parsing is the assigning of one or more syntactic analyses to an utterance of the language, and a program which achieves this is called a parser. An analysis can be represented as a parse tree, which unambiguously captures one particular structure for an utterance (for examples see Figure 2). An utterance may possibly be assigned more than one analysis, in which case the utterance is considered to be syntactically ambiguous with respect to the lexico-grammar.

In the case where the utterance contains words or constructs not described in the lexico-grammar, the utterance cannot be assigned a well-formed parse tree, and is termed ungrammatical. However, we should distinguish here between an utterance which is ungrammatical with respect to a theoretically complete lexico-grammar, and one which is ‘ungrammatical’ because it cannot be analysed according to one implementation of a lexicon and grammar. Such an implementation will inevitably be incomplete because it perhaps does not contain a rare word or construct. In practice then, an ‘ungrammatical’ utterance may indeed be perfectly acceptable. A parser for unrestricted English will minimise the possibility that an utterance is incorrectly classified as ungrammatical either because it contains wordforms not in the lexicon, or combinations of

12

syntactic categories not described by the syntactic component of the grammar. The identification of semantically anomalous sentences is not seen as part of the parser’s work, but is part of semantic interpretation.

Parse trees can be represented by nested labelled brackets, as shown in Figure 2. The relationship between a constituent and its subparts may instead be represented numerically, as example (2) in the original numerical format of the parsed POW corpus shows. (The bracketed form is repeated here for comparison).

(2a - bracketed) [Z [CL [CWH [NGP [HWH WHAT]]] [OM 'S] [S [NGP [DD THE] [H POINT]]]]] (2b - numerical) Z CL 1 CWH NGP HWH WHAT 1 OM 'S 1 S NGP 2 DD THE 2 H POINT

The structure is represented graphically in Figure 3.

Figure 3. A Graphical Representation of a Parse Tree. Z CL

CWH

OM

S

NGP

NGP

HWH WHAT

‘S

DD

H

THE

POINT

In example (2), the category label Z is the root, representing the sentence itself. The root label contains a single clause, labelled CL and we refer to the relation between the two as that of mother and daughter. A mother is said to dominate its daughter, or daughters. The daughters of the clause in example (2) are CWH, OM and S. The reason for the numerical representation being used in the POW Corpus is to capture discontinuities in the daughters, i.e. cases where the

13

daughters do not directly follow each other in the sentence, but are interrupted by another constituent. Discontinuities can be represented more elegantly using the numerical rather than the bracketed format, and would require crossing lines between the categories in a graphical representation.

Having introduced some of the terminology of parsing, the context in which the parser is expected to work will now be described.

1.5 Application and Domain. It is intended that the parser is designed as a general purpose tool, rather than having any single potential application in mind. However, during the early period of this research, I was working on phase 1 of the COMMUNAL Project (COnvivial Man Machine Understanding through NAtural Language)3 , directed by Eric Atwell at Leeds and Robin Fawcett at Cardiff, on the development of a parser which is expected to interact with a semantic interpreter, belief system and NL generator, as an interface to a knowledge-based system. As a consequence, this will be assumed to be a potential application for the parser, although I will not be attempting to explicitly link the parser to the generator’s syntactic description, nor to limit its coverage to that displayed by the generator4. A further project in which the parser may yet find a home is AMALGAM (Automatic Mapping Among Lexico-Grammatical Annotation Models)5, being conducted by Eric Atwell, Clive Souter and John Hughes at Leeds University. One of AMALGAM's aims is to produce a multi-treebank (a single corpus parsed according to several grammatical descriptions), and consequently there is a need for a general-purpose parser which can be adapted to work with such a variety of 3

The COMMUNAL Project phase 1 was jointly sponsored by the Defence Research Agency's Speech Research Unit (then RSRE Malvern), ICL and Longman (UK). See, for examples of Leeds work, (Atwell and Souter 1988b, Atwell et al 1988, Souter and Atwell 1988a, 1988b, Souter and O’Donoghue 1991).

4

Note that this approach to COMMUNAL as a possible application differs from the perspective of O'Donoghue (1993) and Weerasinghe (1994), who were developing parsers directly tied to a specific version of the COMMUNAL grammar. Their approach was essentially to assume the COMMUNAL grammar constituted a well-defined competence model of language, and to parse this well-bounded model.

5

AMALGAM is sponsored by the UK government's Engineering and Physical Science Research Council (EPSRC) grant number GR/J53508. See for example (Atwell et al 1994).

14

grammatical descriptions. The use of the parser being developed here in the AMALGAM project will depend on its adaptability, efficiency and accuracy.

The domain of the parser (as distinguished from its application) specifies whether it is aiming to analyse a particular sublanguage of English. Perhaps rather ambitiously, at the outset, no restrictions are being placed on either lexicon or syntactic coverage. An alternative interpretation of the previous statement would be to say that the specific variety of English we hope to handle will be general, in as much as the lexical resources the parser will use will not be from a limited technical domain, and the range of constructs the syntactic grammar will describe will be extremely wide, as found in the POW corpus.

1.6 Scope. As has been suggested in sections 1.2 and 1.3, producing parse trees may not be the final goal of an individual NLP system, (although there are some applications whose goal is primarily to produce such analyses, such as that of the AMALGAM project). The input to the parser may have come via other processes, such as speech recognition, and the output may be passed on to, say, a semantic interpreter or machine translation system. I will therefore delineate just where I expect the parser's work to begin and end, and specify what kind of text may be handled as input.

The parser will not be specially adapted to handle lattices of potential strings of words that might be produced by a speech recogniser. It will parse one utterance at a time, but will assign a score representing the likelihood of the analysis, so that alternative analyses can be ordered and compared. The parser will not be explicitly linked to a semantic interpreter for SFG or any other grammatical description. O'Donoghue (1990, 1991a, 1991e and 1994) has demonstrated a method for incorporating parse trees (virtually in the form to be produced here)6 into a systemic semantic interpreter called REVELATION, which is still that being assumed by the COMMUNAL project.

6

O'Donoghue's interpreter expects that the parse trees will include labels for SFG's participant roles, whereas the output of the present parser will not contain these. It is being assumed that the process of assigning such roles could be achieved automatically with a suitably large annotated lexicon for main verbs, developed according to Fawcett and Tucker's (1987) principles.

15

1.6.1 Input Format. The input to the parser will be English language plain ASCII text in a string of words separated by blank spaces. It will not need to have been tagged grammatically with parts of speech, as the parser will incorporate a lexical look up phase. Like many others before me, I will however assume that the input has been preprocessed, into a format with the following characteristics: •

Characters are in lower case except for the word I and the first letter of any proper nouns: Utterances should not begin with upper case letters, unless they fall into the aforementioned categories.



Sentence punctuation will have been removed, leaving only apostrophes and hyphens within words. Apostrophes in the morphemes 's, 'd, 'll etc. representing the enclitic forms of the verbs be, have and some modals as well as those indicating possession (eg. the boy's, the boys') should be separated from the noun phrase they are attached to by a blank space. Other apostrophes, such as those in the shortened negative morphemes of isn't, don't, won't and past tense endings baa'd, ski'd etc. should remain unseparated, as should those in monomorphemic wordforms such as e'en, ne'er and ma'am.



Numbers appearing in the input should be written alphabetically, as say, three hundred rather than 300.

1.6.2 Output Format. The output resulting from the parser being applied to a single utterance will be a (possibly empty) list of parse trees ordered most likely first, in which the sentence structure will be captured by nested brackets. The syntactic categories in the parse trees will be those used in the systemic functional grammar found in the Edited Polytechnic of Wales Corpus (see section 2.1), and so will include both units and elements of structure (roughly corresponding to formal and functional labels respectively), but not participant roles. Although the structure of such trees is quite difficult for the human reader to read without close inspection (and bracket counting!), it does have several advantages. This format is amenable to direct display by the POPLOG programming environment's library program showtree, which produces more easily interpreted graphical tree representations. Such a facility would be important for a human selector/posteditor in the context of the AMALGAM project preparing a multi-treebank. Bracketed trees have the advantage that they are compact for storage purposes, and most easily utilised as list

16

structures in AI programming languages. Bracketed trees are also the starting point to storing a parsed corpus in the Nijmegen Linguistic DataBase, which provides parsed corpus browsing and search facilities (van Halteren and van den Heuvel 1990, Souter 1992). Finally, they are also probably the nearest to a standard tree representation (see Souter 1993).

1.7 Original Contributions to the Field. Before describing the original contribution of this work to the field of computational linguistics, I would like to make clear the relationship it bears with that of colleagues who have been working alongside me over the past seven or eight years. The initial 18 month period of this research was spent working with the COMMUNAL team at Leeds and Cardiff, on a workplan devised by Robin Fawcett and Eric Atwell. In such a position, is quite difficult to separate out one's own research path from that of the project as a whole, and the early attempts at extracting a systemic functional grammar from the Polytechnic of Wales Corpus come under the auspices of COMMUNAL, as do investigations into morphological analysis and parsing using simulated annealing.

It has also been a problem (albeit a pleasant one) to be working part-time for my Ph.D. while Tim O'Donoghue was doing so full-time in the same subject area, often in close collaboration (see for example Souter and O'Donoghue 1991). Similarly, Rob Pocock was for 18 months a full-time research associate on the Leeds Speech Oriented Probabilistic Parsing (SOPP) Project. Rather than reinventing the wheel, the present research has, with due acknowledgment, built on some of their findings. I am grateful in this respect for Eric Atwell's guidance in keeping his research students working on separate but related paths.

After initially working quite closely with colleagues at Cardiff, the last six years of research have been conducted more or less independently of Robin Fawcett and his team, with only occasional communication, usually for clarification of some point in the grammar or corpus. Fawcett's research student Ruvan Weerasinghe has during this period also been developing a systemicfunctional syntactic parser, which resulted in the publication of his thesis (Weerasinghe 1994) within a few weeks of my first submission. My own research had up to that point been conducted independently of his. However, in the last 18 months, I have benefitted from the insights and

17

experience he gained working primarily with the more up-to-date versions of the COMMUNAL syntax, as well as in capturing systemic-functional syntax dependencies in a probabilistic model.

Whereas at the outset of this project followers of the empirical or performance paradigm were still very much in the minority, it has become increasingly common in the last five years or so for parsers to take account of the likelihood of a syntactic structure. Simply producing a working probabilistic parser is no longer unusual. However, the parser developed here is original in its language engineering approach, having the following combination of elements:

It will produce trees annotated with a rich grammatical description, rather than the skeletal coarse-grained grammar which has been characteristic of much other corpus-based statistical parsing work (Lari and Young 1990, de Marcken 1990, Magerman and Marcus 1991, Perreira and Schabes 1992, Bod 1995)

Rather than using a limited hand-fashioned lexicon, or even a corpus word list, two alternative lexical resources will be developed and employed: (i) a large-scale, 60,000 wordform English lexicon (CELEX) has been transformed into a format compatible with a corpus-based grammar, and (ii) a POW-corpus trained version of Eric Brill’s tagger (Brill 1992, 93, 94), both of which give the parser very good lexical tagging support.

Since Winograd’s early work (1972), there has been a significant increase in attempts to use a systemic framework for syntactic analysis, instead of the more standard NL generation (Atwell et al (1988), Kasper (1988, 1989)7, O'Donoghue (1993), Weerasinghe (1994) and O’Donnell (1994)). With the exception of the Leeds COMMUNAL work described in (Atwell et al 1988 and Souter and O’Donoghue 1991), each may potentially be linked to a corresponding SFGbased NL generator, achieving the desirable aim of using the same grammatical description in both language interpretation and generation. Kasper's parser however required the manual creation of some phrase-structure rules and the transformation of the Nigel generator grammar into a different formalism, Functional Unification Grammar (FUG: see Kay 1985). O’Donnell’s work is similarly tied to the Nigel grammar. The syntactic models for O'Donoghue and 7

Kasper was then working at the University of Southern California Information Science Institute (ISI) team developing a parser for their Nigel Grammar.

18 Weerasinghe's parsers are extracted automatically from a large sample of NL generator output8, which limits their lexical and grammatical coverage to that of the generator. None of these has been developed using the syntactic coverage of an unrestricted language corpus such as POW, or incorporated any significant lexical resources. In each of these cases, the grammar will therefore only contain the structures which have been intuitively designed into the generators by their authors, i.e. they will be competence grammars.

Instead, the current parser contains a performance grammar, derived from a large genuine sample of unrestricted spoken English (the POW Corpus), including the grammatically problematic features of ellipsis, repetition and unfinished sentences. More importantly, the probabilities taken from such a corpus will be realistic, rather than artificial. Indeed, the range of grammatical constructs and their frequencies can be used to inspire the production of a generator’s grammar, since it (currently, at least) goes beyond what the generator can handle.

Consequently, the parser being developed here is original in that it is derived from unrestricted English (albeit of children aged six to twelve), and in its very wide lexical and grammatical coverage.

1.8 The Structure of the Thesis. The remainder of the thesis will be organised in the following manner: (Where relevant work of mine has already appeared in published form, I provide a reference).

Chapter 2 will present the background to the use of corpora in computational linguistics (Souter 1989b, 1990c, Souter and Atwell 1992, 1993), the chosen systemic functional syntax model (Souter 1990a), lexical look-up in NLP systems (Souter and Atwell 1988b), and parsing algorithms (Atwell et al 1988, Souter and O'Donoghue 1991).

Chapter 3 will explain how the grammatical and lexical resources were developed for the parser (Atwell and Souter 1988a, Souter 1990a,b, Souter 1993). 8

The COMMUNAL NL generator, GENESYS (Fawcett and Tucker 1989), so called because it generates systemically, can be set to generate sentences randomly, to produce an artificial corpus or ark (Souter 1990a).

19

Chapter 4 discusses the chosen parsing algorithm, and how it was modified and improved to accommodate a systemic-functional syntactic model.

Chapter 5 presents an evaluation of the performance of two versions of the parser when tested with seen and unseen data, and discusses the nature of the parser’s failures, its (in)efficiency and restrictions imposed by current hardware.

Chapter 6 describes improvements which might be made to the parser with respect to the development method, the linguistic data, and its efficiency.

Finally, in chapter 7 conclusions are drawn on the parser development process, how generally transferable it is to other types of grammar and corpora, and whether the original aims have been met successfully.

20

Chapter 2. Background to the Thesis. Comments on related work will be divided into sections for each of the key elements contributing to parsing method: (i) corpus-based computational linguistics, (ii) syntactic description and formalism (which will itself depend on the availability of fully analysed corpora), (iii) associated lexical resources for the syntactic model, and (iv) suitable parsing techniques.

2.1 Corpus-Based Computational Linguistics. In chapter 1, the distinction between competence and performance grammar was introduced. This section will explain why the choice was made to adopt a performance syntactic model for parsing unrestricted English, and to derive such a model from a corpus.

There are essentially two options for the grammarian or computer scientist faced with the task of developing a large-scale grammar for a natural language;

1. Use native-speaker linguistic intuition to create the rules of the grammar; 2. Study parsed corpora as the inspiration for common and uncommon grammatical structures which occur in the language.

Although they will be described separately below, in practice the competence and performance approaches do not diverge as widely as has just been suggested. When trying to build a large competence grammar, the grammarian will tend to look for inspiration in example sentences. Whereas when annotating a corpus, a prototype competence grammar or a handbook of case law examples will tend to be used. The main difference in the end result tends to be the size of the grammar, and its formalism. A competence grammar will be expressed in one particular formalism, whereas the parsed corpus will yield syntactic information for several formalisms. Typically the formalisms in which it is possible to extract a corpus-based grammar are less powerful than that found in a competence grammar. The former will normally contain atomic categories related through finite-state or context-free phrase structure grammars, but the latter may include a variety of enhancements to context-free grammar, such as categories as sets of

21 features which are combined by unification, feature percolation constraints, metarules, and transformations.

2.1.1 The Intuition-Based Approach to Developing a Grammar. Chomsky's Transformational Grammar (Akmajian and Heny 1975, Radford 1981), which was prevalent in the 1960's and 70's, employed a phrase-structure rule component and a set of powerful transformational rules. More recently, grammarians have been concerned that the formalism should be psychologically plausible, and be able to be processed easily by computer programs. The result has been a whole host of feature augmented context-free grammars. In the mid to late 1980's, the grammar most in vogue was Generalised Phrase Structure Grammar (GPSG: Gazdar et al 1985), which replaced atomic category labels with sets of syntactic features, supplemented phrase structure rules with metarules, and, in common with a number of other grammars, had recourse to unification of syntactic features. The UK government's Alvey Programme sponsored a large scale project to produce a natural language toolkit (ANLT) for English containing a lexicon, morphological analyser, grammar and parser (Grover et al 1989, Pulman et al 1989, Philips and Thompson 1987). A close variant of GPSG was the chosen grammar for their project. The ANLT represents probably the largest and most advanced computer implementation of a competence grammar which is currently publicly available in the UK1. When expanded to its object grammar, it contains over 1,000 phrase-structure rules. Despite the obvious achievements which have been made in the ANLT project, the authors of its grammar are aware of some structures it does not describe, and assume that there are still more of which they are unaware (Grover et al 1989; 44).

This is a serious problem with the competence approach. Unless the language being covered is very restricted, the grammarian's intuition is likely to prove inadequate. It will be inadequate because it is idiolectic: being based on only one speaker's linguistic experience, and because it is highly likely that many structures in the language will be omitted, simply because they didn't come to the speaker's mind. The advantage of competence grammars is that the grammatical

1

Other large-scale competence grammars exist, eg. COMMUNAL, SRI's Core Language Engine, but are not publicly available with user support provided.

22 relations are usually made explicit and written in a formal fashion, making them computationally tractable.

2.1.2 The Corpus-Based Approach to Developing a Grammar. The second option, using corpora, stands a better chance of providing a comprehensive inventory of language structures, because it can be collected from several speakers over a period of time performing the language in different contexts. The problem with corpora, though, is that it is easy to assume they are exhaustive, when they are only as exhaustive as the corpus sample is representative of the language as a whole. A second problem with corpora comes precisely from their ‘performance’ nature. They contain language in all its natural beauty, warts and all! The mistakes and short cuts we make when using the language are all to be found in corpora, as well as some errors introduced by the process of collection itself, such as transcription, typing or optical character recognition errors. A further problem with the grammatical information in raw corpora is that it is only implicit. Fortunately, several corpora have already been annotated grammatically with word tags, and some even with full parse trees. This has been done either entirely manually, or with an automatic program based on a limited competence grammar or a manually annotated subsection of the corpus. In both cases, extensive proof-reading of the annotation is necessary to try to eliminate errors, and inevitably even the proof reading can be imperfect.

However the end product, a parsed corpus, contains a large sample of the language analysed according to a specific grammatical description, whose grammatical coverage usually far exceeds that possible using the competence approach.

Because of the enormous effort involved in annotating and post-editing text, only a handful of parsed corpora of English have been created at research sites in the UK and elsewhere. Those which are publicly available are:

1. The Gothenburg corpus (Ellegård 1978): 128,000 words of written American English, from the Brown corpus, analysed using a form of dependency grammar, including function labels; 2. The Nijmegen (CCPP) corpus (Keulen 1986): 130,000 words of written and spoken British English, including fiction and non-fiction texts, and 10,000 words of transcribed tennis commentaries, analysed using a context-free grammar derived from (Quirk et al 1972);

23 3. The Lancaster/Leeds treebank (Sampson 1987a): 45,000 words of written British English, from the LOB corpus, analysed using a specially devised surface-level phrase-structure grammar compatible with the CLAWS word-tagging scheme (Garside 1987); 4. The LOB corpus Treebank (Leech and Garside 1991): 144,000 words of written British English, a subset of the LOB corpus which was automatically parsed and manually postedited using a parsing scheme slightly more coarse-grained than the Lancaster/Leeds Treebank; 5. The Polytechnic of Wales corpus (Fawcett and Perkins 1980, Souter 1989): 65,000 words of children's spoken British English, hand-parsed using Fawcett's systemic functional grammar (Fawcett 1981) which includes formal and functional labels; 6. The Penn treebank (Marcus and Santorini 1991, Marcus et al 1993): 3 million words of written American English, automatically parsed using Don Hindle's Fidditch parser (Hindle 1983), with a simple skeletal scheme consisting of 36 terminals, 12 punctuation markers and 14 non-terminals. This scheme is currently being redesigned to allow a more delicate annotation; 7. The Susanne corpus (Sampson 1994): 128,000 words of written American English, a reworking of the Gothenburg corpus into a more accessible and usable resource, analysed using an enhanced version of the Lancaster/Leeds treebank scheme, by a team at Leeds University.

The first five of these are described by Sampson (1992) in his consumer guide to the analysed corpora of English. In addition, the following two parsed corpora are used in academic research at Lancaster (and Leeds) and Nijmegen, but not publicly available:

1. The IBM/Lancaster Spoken English Corpus (Knowles and Lawrence 1987): 50,000 words of spoken British English recorded from BBC radio broadcasts, transcribed orthographically, phonetically and prosodically, and parsed using a scheme referred to as skeletal parsing. The scheme is so called because it is marginally less delicate than that used in the Lancaster/Leeds treebank and involves building a skeleton tree structure, to be labelled with a fairly basic list of categories. The SEC forms part of a larger private enterprise in corpus annotation between IBM and Lancaster University, which has produced a parsed corpus of 3 million words, called The Skeleton Treebank (Leech and Garside 1991).

24 2. The TOSCA corpus (Oostdijk 1991): 1.5 million words of written British English, of which at least 250,000 words have been semi-automatically parsed using extended affix grammar, which includes formal and functional labels.

The uses of parsed corpora are reviewed in (Souter and Atwell 1994), and a selection of samples from some of the aforementioned corpora (no's 1, 3, 4, 5 and 6) is included in Appendix 1. Parsed corpora tend to be small, only in the region of 50,000 to 150,000 words. compared to the vast raw corpora which are now being collected, many of which contain several tens of millions of words (for example ICE, British National Corpus, Bank of English, ACL Data Collection Initiative, European Corpus Initiative). These larger collections are driven by the respective aims of collecting international varieties of English, providing an inventory for lexicographic development, and assembling archives for computational linguistic research. Despite their relative poverty from the viewpoint of lexical coverage, parsed corpora are (by definition) grammatically rich, providing the best available resource for a parser whose scope is unrestricted English.

2.1.3 Selecting a Parsed Corpus. Large, fully annotated corpora are still relatively few and far between, because of the tremendous manual effort required to perform the grammatical annotation. The annotators' response to the size of the task has been to follow one of two paths; either to use a skeletal manual parsing scheme which achieves more rapid progress and ultimately a larger unrefined corpus, or to use a very detailed scheme and be resigned to ending up with a small but highly refined annotated corpus. In the former category are the IBM/Lancaster treebank and the ACL Data Collection Initiative's Penn treebank, part of which is available on CD-ROM. In the latter category are the Lancaster-Oslo/Bergen (LOB) corpus treebank, the Polytechnic of Wales (POW) corpus, the Nijmegen corpus, and the Gothenburg and Susanne corpora (both annotated portions of the Brown corpus). Since the grammatical description of the corpora in the former category is only coarse-grained, these will be eschewed in favour of a richer description. In contrast to the other parsed corpora, at the outset of this project, the POW corpus had not been developed or ‘adopted’ by a research team investigating corpus-based parsing. It was originally collected as a resource for the study of child language development. Consequently, it has been chosen as the basis for an integrated lexicon, grammar and parser. Further reasons for this choice of corpus and

25 grammatical description are my familiarity with SFG from work on the COMMUNAL project, the fact that the grammar has been extended to handle features of unrestricted spoken English, because it contains both formal and functional labels, and because it has the potential to be reversible between NL interpretation and generation.

2.1.4 The Polytechnic of Wales Corpus. This section will introduce the background to the POW corpus, including an explanation of its notation and format. A brief summary of this information was published in the Lancaster Survey of English Machine-Readable Corpora (Taylor et al 1991) and is reproduced in Appendix 2. A short handbook was written to accompany the distributed version of the corpus (Souter 1989), both of which are available from the International Computer Archive of Modern English (ICAME) at Bergen2. The corpus was originally collected between 1978-84 for a child language development project to study the use of various syntactico-semantic constructs in children between the ages of six and twelve. A sample of approximately 120 children in this age range from the Pontypridd area in South Wales was selected, and divided into four cohorts of 30, each within three months of the ages 6, 8, 10, and 12. These cohorts were subdivided by sex (B,G) and socio-economic class (A,B,C,D). The latter was achieved using details of •

‘highest’ occupation of both the parents of the child, or one of them in single-parent families



educational level of the parents.

The children were selected in order to minimise any Welsh or other second language influence. The above subdivision resulted in small homogeneous cells of three children. Recordings were made of a play session with a Lego brick building task for each cell, and of an individual interview with the same ‘friendly’ adult for each child, in which the child's favourite games or TV programmes were discussed.

2.1.4.1 Transcription. The first ten minutes of each play session commencing at a point where normal peer group interaction began (i.e.: when the microphone was ignored) were transcribed by 15 trained 2

The International Computer Archive of Modern English, Norwegian Computing Centre for the Humanities, P.O. Box 53, Universitetet, N-5027 Bergen, Norway, or e-mail [email protected] for more information.

26 transcribers. Likewise for the interviews. Transcription conventions were adopted from those used in the Survey of Modern English Usage at University College London, and a similar project at Bristol. Intonation contours were added by a phonetician to produce a hard copy version, and the resulting transcripts published in four volumes (Fawcett and Perkins 1980). A short report on the project was also published (Fawcett 1980).

2.1.4.2 Syntactic Analysis. Again ten trained analysts were employed to manually parse the transcribed texts, using Fawcett's version of Systemic-Functional Grammar (SFG). The SF syntax used in the analysis handles phenomena such as raising, dummy subject clauses and ellipsis. Despite thorough checking, some inconsistencies remain in the text owing to several people working on different parts of the corpus. SFG in general, and the particular SF syntax found in the POW corpus are described in more detail in section 2.2. The parsed version is available in machine readable form but does not contain any of the prosodic information included in the paper version.

2.1.4.3 Corpus Format. The resulting parsed corpus consists of approximately 65,000 words3, in 11,396 (sometimes very long) lines, each containing a parse tree. The corpus of parse trees fills 1.1 Mb. There are 184 files, each with a reference header which identifies the age, sex and social class of the child, and whether the text is from a play session or an interview. The corpus is also available in wrapround form with a maximum line length of 80 characters, where one parse tree may take up several lines, but this makes it difficult to distinguish between numbers used for sentence reference and those which specify syntactic structure. The four-volume transcripts can be supplied by the British Library Inter-Library Loans System.

A short portion of the corpus in its original form is included in Figure 4. 3

Earlier papers quote the size of the corpus as being approximately 100,000 words. The latest automatic extraction of a wordlist from the machine readable corpus shows it to be just over 65,000 words, but this figure can only be approximate. Noise in the original typing of the corpus in the form of omissions of category labels, or of the spaces between such labels and the words in the text, makes it difficult to give an accurate figure. The difference between the two totals is almost certainly the difference between the total for the recorded spoken texts, and the total for those which have been hand-parsed.

27

Figure 4. A Sample Section of a POW Corpus File. **** 58 1 1 1 0 59 6ABICJ (filename) 1) [FS:Y...] Z 1 CL F YEAH 1 CL 2 S NGP 3 DD THAT 3 HP ONE 2 OM 'S 2 C NGP 4 DQ A 4 H RACING-CAR 2) Z CL 1 S NGP 2 DD THAT 2 HP ONE 1 OM 'S 1 C NGP 3 DQ A 3 MO QQGP AX LITTLE 3 H TRUCK 3) [HZ:WELL] Z 1 CL 2 S NGP HP I [RP:I] 2 AI JUST 2 HAD 2 C NGP 3 DQ A 3 MO QQGP AX LITTLE 3 H THINK 1 CL 4 & THEN 4 S NGP HP I 4 M THOUGHT 4 C CL 5 BM OF 5 M MAKING 5 C NGP 6 DD THIS 6 HP ONE 4) Z 1 CL 2 S NGP HP I 2 AI JUST 2 M FINISHED 2 C NGP 3 DD THAT 3 HP ONE 1 CL 4 & AND 4 S NGP HN FRANCIS 4 M HAD 4 C NGP 5 DD THE 5 H IDEA 5 Q CL 6 BM OF 6 M MAKING 6 C NGP 7 DQ A 7 RACINGCAR 5) [FS:THEN-I] Z CL 1 & THO 1 S NGP HP I 1 M MADE 1 C NGP DD THIS 6) Z CL 1 & THEN 1 S NGP HP FRANCIS 1 OX WAS 1 AI JUST 1 X GOING-TO 1 M MAKE 1 C NGP HP ONE 1 A CL 2 B WHEN 2 S NGP H YOU 2 M CAME 2 CM QQGP AX BACK 2 CM QQGP AX IN 7) [NV:MM] Z 1 CL F NO [FS:FRAN...] 1 CL 2 S NGP HP WE 2 M HAD 2 C NGP 3 DQ AN 3 H IDEA 3 Q CL 4 BM OF 4 M MAKING 4 C NGP 5 DQ FOUR 5 H THINGS 8) Z 1 CL F YEAH 1 CL 2 S NGP HP I 2 M PLAYED 2 C PGP 3 P WITH 3 CV NGP HP IT 2 A PGP 4 P AT 4 CV NGP H HOME 9) Z CL F YEAH

The tree notation employs numbers rather than the more traditional bracketed form to define mother-daughter relationships, in order to capture discontinuous units. The number directly preceding a group of symbols refers to their mother. The mother is itself found immediately preceding the first occurrence of that number in the tree. In Figure 4, the first tree shows a sentence (Z) consisting of two daughter clauses (CL), as each clause is preceded by the number 1. The long lines have been folded manually here for ease of reading. The first number in each tree is a sentence reference; after which I have inserted a closing bracket, ")", for ease of reading. These do not appear in the corpus itself. All alphabetic characters are in upper case. The only lower case alphabetical characters are in the sentence references, which have occasionally been subdivided into 24a, 24b etc., where what was initially analysed as one sentence was, on checking, re-analysed as two (or more).

Occasionally when the correct analysis for a structure is uncertain, the one given is followed by a question mark. Cases where unclear recordings have made word identification difficult are

28 treated similarly. Apart from the syntactic categories and the words themselves, the only other symbols in the tree are three types of bracketing: •

square [NV...], [UN...], [RP...], [FS...], for non-verbal, unclear/unfinished, repetition, false start, pragmatic element etc.



round (...) for ellipsis of items recoverable from previous text.



angle for ellipsis of items not so recoverable, eg: in rapid speech.

Filenames indicate precisely which age (6,8,10,12), social class (A,B,C,D), sex (B,G) and recording situation (play-session (PS) or interview (I)) is involved, followed by the child's initials. Hence, the text sample in Figure 4 is from file 6ABICJ, involving a six year old, of social class A, who is a boy, in an interview, with initials CJ.

2.1.5 The Edited Polytechnic of Wales Corpus. Because the original POW corpus contains various typographical and syntax errors which still remain after extensive proof reading, the automatic extraction of its grammar and lexicon is hampered, as described further in chapter 3. Consequently, Tim O'Donoghue has created an edited version of the corpus, which has become known as EPOW (O'Donoghue 1991b, 1991c). O'Donoghue used semi-automatic post-editing programs which searched for word and category patterns which broke the rules for a legitimate parse tree, and then amended them. He also ran a spelling checker over the words in the corpus, to find examples of typographical errors. The end product of his work is a much ‘cleaner’ resource with which to work, and is used as the training data for the parser under development here.

2.2 Systemic Functional Grammatical Description and Formalism. The aim of parsing relatively unrestricted English has necessitated a corpus-based, performance approach to grammar development, rather than the intuition-based competence approach. The choice of a parsed corpus comes hand in hand with the choice of a grammatical description, so must be made together, unless a new corpus parsing venture is to be embarked upon. I will now consider what effect on parsing the choice of systemic functional syntax will have, and how this

29 compares with other grammatical descriptions and formalisms. I will begin, however, by describing systemic functional grammar itself.

Systemic functional grammar differs markedly from many other language models in that it is a grammar which models the way language is produced, rather than the way language is interpreted. The syntactic structures with which we shall be dealing in virtually all of the rest of the thesis are not the centre of the language model at all, but are a product (via a process called realisation) of a series of semantic choices which are formalised in a system network, which is the heart of the model and is intended to represent the meaning potential of the language. The language model focuses on the choice of what sentence to generate, rather than how to understand a sentence that someone has already produced. Useful introductions to SFG are found in (Fawcett 1980, 1981, 1984, Halliday 1985, Butler 1985, Fawcett et al 1993, O'Donoghue 1993, Weerasinghe 1994). The SFG model of language we shall be dealing with here is that developed by Fawcett from the earlier work of Halliday, and has provided both the syntactic description found in the POW corpus, and the large grammar being built into the NL generator of the COMMUNAL project, called GENESYS (described in more detail in section 2.2.1).

SFG is heavily focused on semantic choices (called systems) made in generation, rather than the surface syntactic representations needed for parsing. When generating a sentence using systemic grammar, a number of choices of features are made in a system network. These vary from broad semantic and syntactic choices such as whether a sentence will be positive or negative, active or passive, to highly specific choices which select individual lexical items. Hence the grammar is often referred to as a lexico-grammar, since, in generation, the lexical, syntactic and semantic choices are all linked together. The choices may be linked together by disjunctive or conjunctive logic (choice of one feature may exclude or require the choice of others). Progress through the large networks may necessitate disjunctive or conjunctive entry (the prior selection of one or more features before the next can be chosen). A small fragment of a (simplified) system network for mood in English is illustrated in figure 5, taken from (Weerasinghe and Fawcett 1993). This example illustrates conjunctive and disjunctive systems (see the key), but not conjunctive and disjunctive entry conditions.

Figure 5: A fragment of a system network for MOOD in English. giver.............[S,O or S,M]..................................................................................................

Ivy has read it.

30 polarity......[O,S].............................................................

Has Ivy read it?

new-content................................................................

What has Ivy read?

...............................................................................................

Hasn’t Ivy read it?

seeker information confirmation seeker (others) MOOD

unmarked-sd simple-dir

pressing-sd

.....................................

.........................................

Read it! Do read it!

addressee-identified................... You read it!

directive

by-appeal-to-ability..................... Can/could you read it?

proposal for action

by-appeal-to-willingness......... Will/would you read it? request

KEY: OR

direct-rd indirect-rd

AND

proposal-for-action-by-self-and-addressee......................................... Let’s read it. proposal-for-action-by-self...............................................................................

Realise

Shall I read it?

(others)

Some of the choices cause realisation rules to become active, indicated by a downwards pointing arrow in figure 5. The realisation rules are normally numerically referenced and stored separately. In figure 5 the overall effect of just two rules has been informally shown, and reflects the placing of the Operator (O) before the subject (S) (with features [information], [seeker] and [polarity] chosen) or after (with [information] and [giver] selected). In the GENESYS implementation of this and other networks, each choice point has a probability attached to it, expressing the linguist’s intuition of the relative frequency of the feature, and allowing the generator to be set to produce sentences randomly. (It is here where the evidence from a parsed corpus such as POW can be useful in guiding the probabilities to be set, for systems directly affecting syntax and lexis, at least. It is the ability of the generator to work randomly which has enabled researchers such as O’Donoghue (1993) and Weerasinghe (1994) to generate a large sample of output; an artificial parsed corpus, or ark.) Realisation rules can have several effects, including 1. Causing the network to be re-entered after the current pass through has finished. This typically occurs to specify a functional role (realised as a (sub)constituent of a clause), after the clause itself has been created; 2. Causing some part of the structure of the sentence to be created, for example by specifying the ordering of the subject nominal group and the verb operator (auxiliary);

31 3. Causing a particular syntactic category to be expounded as a particular lexical item.

These components of the GENESYS lexico-grammar can be illustrated as follows (Fawcett et al 1993; 121):

the lexico-grammar

SEMANTICS:

the outputs

for each semantic unit

selection expression of semantic features

SYSTEM NETWORK

re-entry to fill role

REALISATION COMPONENT

for each syntactic unit

syntax tree, items, intonation/punctuation

POTENTIAL STRUCTURES

The output of such a lexicogrammar is a parse tree, the structure of which is then deleted to leave the sentence itself. The leaves (words) are finally processed by the generation system into either a punctuated sentence or an utterance with an intonation pattern. Since the focus of this thesis is on parsing and not generation, we will not give a full blown account of this generation process, nor other examples of the many systems in the network, or their associated realisation rules. Suffice it to say that SFGs such as that used in GENESYS represent some of the largest competence grammars yet to have been constructed by computational linguists (Fawcett et al 1993; 119). A good example following the process of generation through system networks and realisation rules for English personal pronouns is given in (Fawcett 1988b).

Whereas in language interpretation the grammar and lexicon tend to be stored as separate entities, in systemic generation they are stored together in the system network. However the syntactic description can be viewed independently of this formalism in the trees the generator produces, or indeed in the Polytechnic of Wales corpus. The syntactic part of (an early variant

32 of) the description is also informally presented in (Fawcett 1981), which acts as a case law of example sentences and their structures, which would have been invaluable in hand parsing the corpus. Were I to be explicitly trying to link the output of my parser with the COMMUNAL generator then “the generative model that is the core component of (the) SFG (would have to be) the source of the information that (would be) built into the parser” (Fawcett 1994; 398, my parentheses). However, instead I will consider the syntactic evidence obtainable from the annotated POW corpus to be the linguistic data (in their own right) on which the parser should depend, since they represent a performance rather than a competence model.

2.2.1 Systemic Grammar in NL Generation. Some well known implementations of systemic grammar generators have been produced: The first was Davey's Proteus (Davey 1974, 1978), which generated discourse for the game of noughts and crosses, and drew inspiration from earlier work by Winograd (1972). The second, which contains the Nigel grammar has been created by the Penman Project at the University of Southern California's Information Sciences Institute, has been used to generate both English and Japanese (Mann 1983, Matthiessen and Bateman 1991). A further generator developed at Edinburgh and derived from Nigel is Patten’s SLANG (Patten 1988). Houghton and Isard (1987) employed a systemic functional model for their FRED and DORIS system, in which two robots plan how to be together in the same room! GENESYS, a separate implementation of a NL generator containing a very large systemic grammar, has been produced by the COMMUNAL team at Cardiff (Fawcett and Tucker 1990, Fawcett 1990, Fawcett, Tucker and Lin 1993). A sample of sentences produced by GENESYS (prototype 1.5) is shown in Figure 6. (The latest ‘midi’ and ‘maxi’ versions produce a much greater variety of sentences and structures.) The figure shows the syntactic structures along with the sentences generated. The leaves of the trees (the words themselves) are also shown separately to aid the reader.

Figure 6. A Sample Section of GENESYS Output.

[Z [Cl [S/Af [ngp [dq some] [vq of] [dd that] [h paper]]][Xc isn't] [Xp be+ing] [M cook+ed] [e .]]] [Z [Cl [S/Af [ngp [h what]]] [Xf is] [G going_to] [Xc be] [M die+ing] [e ?]]] [Z [Cl [S/Ca [ngp [dq some] [vq of] [ds [qqgp [dd the] [a best]]] [vs of] [meme [qqgp [a happy]]] [h question]]] [Xr hasn't] [M glow+ed] [Ati [ngp [h today]]] [e .]]]

33 [Z [Cl [Xr have] [S/Ag [ngp [ds [qqgp [dd the] [a best]]]]] [M damage+ed] [C2/Af [ngp [h itself]]] [Ama [qqgp [a how]]] [e ?]]] [Z [Cl [O don't] [M be] [C2/At [qqgp [a easy]]] [e !]]] [Z [Cl [Xf are] [S/Ag [ngp [ds [qqgp [dd the] [a sad+est]]] [vs of] [h that]]] [G going_to] [Xc be] [M stand+ing] [Cm2 about] [e ?]]] [Z [Cl [S/Af [ngp [h [genclr [g its]]]]] [M kiss+s] [Ama [qqgp [dd the] [a best]]] [e .]]] [Z [Cl [Xr have] [S/Ca [ngp [h the_Chancellor_of_the_Exchequer]]] [M been] [C2/At [qqgp [a good]]] [e ?]]] [Z [Cl [S/Af [ngp [ds [qqgp [dd the] [a best]]] [vs of] [h it]]] [Xc was] [Xp be+ing] [M broken] [C2 [pgp [p by] [cv/Ag [ngp [h who]]]]] [Ama [qqgp [a fast]]] [e .]]]

some of that paper isn't be+ing cook+ed . what is going_to be die+ing ? some of the best of happy question hasn't glow+ed today . have the best damage+ed itself how ? don't be easy ! are the sad+est of that going_to be stand+ing about ? its kiss+s the best . have the_Chancellor_of_the_Exchequer been good ? the best of it was be+ing broken by who fast ?

These early examples of GENESYS generator output illustrate some of the key differences between the parse trees produced by the generator and those found in the POW corpus. Firstly, these have yet to be passed through a process converting them into the finished spoken or written output, so morphological rules have not been applied to damage + ed to produce damaged, for example4. Secondly, on the face of it, some of this output appears semantically anomalous, in the same way as Chomsky’s famous colourless green ideas sleep furiously. The generator, in normal use, would produce output in response to a previous utterance in line with some overall conversational plan, and would be subject to higher semantic constraints. Some of these examples are also syntactically imperfect, needing the application of subject verb agreement control, for example. More significantly, though, from the point of view of training material for a parser, the tree structures contain an extra level of labelling, called participant roles, which is conflated onto the nodes for elements of structure within the clause or lower down the tree. In the tree

[Z [Cl [S/Af [ngp [ds [qqgp [dd the] [a best]]] [vs of] [h it]]] [Xc was] [Xp be+ing] [M broken] [C2 [pgp [p by] [cv/Ag [ngp [h who]]]]] [Ama [qqgp [a fast]]] [e .]]]

4

It so happens that all of these examples result from the [written] feature having been chosen in the spoken/written system of the network.

34 the labels S/Af and cv/Ag indicate a subject acting as the affected element of the main verb, and a completive being the agent causing the action, in this case of something being broken. Such participant roles were not included in the labelling for the POW corpus, so will not be able to be automatically extracted in a syntactic formalism trained on the corpus. The trees also contain a less delicate syntactic description than that found in the corpus, (although obviously this difference is becoming less significant as the generator grammar grows). Nevertheless, the GENESYS trees also contain one further important feature - they are all well-formed with respect to headedness, a point we shall return to in section 2.2.4, where we see that the same feature certainly isn’t guaranteed for POW corpus trees.

2.2.2 Systemic Grammar in NL Parsing. We focus here on the use of SFG in parsing. Non-SFG approaches are described in section 2.4.

Parsing in SFG should ideally be part of a wider process of interpretation - taking a sentence and finding the set of semantic features which would have been chosen to generate it. Theoretically at least, it may be possible to derive the semantic features directly from the words in the sentence to be interpreted, given a mechanism for searching the system networks and realisation rules. However, most researchers have assumed that it is sensible to first find one or more syntactic structures for a sentence before mapping those structures onto the relevant semantic choices. Systemic functional syntax is normally represented in the form of realisation rules associated with choices in the system network of semantic features. Both of these formalisms have to be hand-crafted in a SFG model, and are not immediately amenable to standard parsing techniques. One might therefore be tempted to conclude that the goal of unrestricted NL parsing and the use of SFG were in some way at odds with each other, since one could only interpret as much as one could generate. The range of syntactic structures in a corpus of unrestricted English is likely to be considerably broader than that hand-crafted into the system network. The solution to this dilemma in the present work is to allow wider syntax than, say, the COMMUNAL generator can currently handle, and assume that the generator builders will attempt the task of expanding the lexico-grammar to approach that of a corpus by becoming more and more comprehensive, as Fawcett (1992; 23) proposes.

35 Despite its clear focus towards language generation, several computational implementations of systemic grammar for interpretation do exist. The earliest and best known of these is probably Winograd's SHRDLU (Winograd 1972), which could understand commands and queries in a simple blocks world. More recently, Kasper has developed a parser for the Nigel grammar which uses the Functional Unification Grammar formalism (Kasper 1988, 1989). The Nigel grammar has also been central to the systemic interpretation work of Patten (1988) and O’Donnell (1994). These researchers tend to focus on the difficult problem of semantic interpretation using system networks and realisation rules in reverse, as does O’Donoghue (1994). If we separate out the process of syntactic parsing from semantic interpretation, Fawcett and Tucker's COMMUNAL SF syntax has been used in the syntactic parsers of O'Donoghue (1991d, 1993) and Weerasinghe (1994).

O'Donoghue's parser uses a vertical strip grammar, which captures relations between a leaf (word) and its successive parent nodes working through the tree to the root (sentence) node, instead of traditional phrase structure rules, which relate an ordered horizontal set of daughters to one mother. The grammar is automatically extracted from an artificial corpus of English produced by GENESYS, The parser associated with the grammar, the vertical strip parser, achieves correct results 81% of the time on 1,000 test sentences which were generated by GENESYS (like those contained in the trees in Figure 6), but not used for the grammar extraction. The parser's success is limited to sentences which contain structures of the same depth (in terms of vertical strips) as those in the training material, and employs a lexicon of 940 items extracted from GENESYS PG 1.5 output.

Weerasinghe’s probabilistic on-line parser (POP) has been carefully developed (like O’Donoghue’s) to act as a robust syntactic parser to keep pace with the grammar used in the COMMUNAL generator. His results are very good (85% exact match success on a test comparable to O’Donoghue’s). His parser uses a modified chart parsing algorithm, with a combined probabilistic model derived from both the POW corpus and GENESYS output. The syntactic model, however, is obtained solely from GENESYS output. This level of success is achieved by a combination of modifications to a standard chart parser using a context-free grammar. Firstly, the syntactic model consists of a horizontal component capturing probability of transition from one element to another (akin to the linear precedence model in GPSG, but with added probabilities). There is a vertical component capturing the likelihood of a particular

36 mother for any daughter (akin to a probabilistic immediate dominance model in GPSG). The lexicon consists of only 355 items extracted from the GENESYS ‘maxi’ lexico-grammar. The parsing algorithm itself has been modified to control explosion in the agenda caused by largescale grammars. This has been achieved by a one-word look ahead before proposing a new chart hypothesis, by ordering the agenda according to likelihood, but favouring wider-spanning edges over those containing fewer words, and (crucially) by relying on the appearance of a head for each constituent before such a constituent can be built (Weerasinghe 1994; 110-11). These very promising results have yet to be replicated with large-scale lexicons and unconstrained conversational dialogue input.

At Leeds, a series of other parsing attempts based on a genuine corpus-trained syntax model taken from the POW corpus began in 1988, when Atwell and Souter (1988a) attempted to load a POW Definite Clause Grammar into POPLOG Prolog, but found that memory limitations prevented the built-in parser-generator function from being able to be applied to such a large grammar. Then, in the first phase of the COMMUNAL project (in which Leeds were partners with the Cardiff team), a simulated annealing parser called the RAP was developed, using a stochastic finite state automaton derived from POW as an evaluation function for the putative trees proposed by annealing (Atwell et al 1988, Souter 1989a, b, Souter and O’Donoghue 1991). The resulting parser was never fully tested, since it was extremely slow, and unreliable on sentences containing long-distance dependencies. However, O’Donoghue later trained the annealing parser (by then re-named the Dynamic Annealing Parser, DAP) on some GENESYS output, and achieved exact match success rates of around 30%, on tests for ‘unseen’ generator output (O’Donoghue 1993; 114). This forms the background to the present parser development. It is important to remember that the purpose of parsing need not solely be to progress towards a more robust systemic interpreter - parsers have many other uses in which a systemic description may be profitable. We now look at some of the difficulties a rich systemic-functional grammar presents for parsing programs.

2.2.3 Problems for a SFG Parser. From the point of view of parsing, several problematic features of SFG should be highlighted. Firstly, the grammatical description includes not only formal categories (units), but also functional labels (elements of structure) and a further set of labels for the participant roles which

37 are attached to the main verb in a clause, such as the agent who performs an action, and its affected entity, or the carrier of an attribute, or the location of an entity or action. When a constituent of a sentence is given a label specifying its form, it may also therefore be given a functional label and a participant role (depending on the particular constituent). As we have seen, the examples in Figure 6 include participant role labels Ag, Af and Ca, which are conflated onto node labels for elements of structure. The designer of a parsing program for SFG has to decide where to derive the grammatical model from, and whether to consider applying these different layers of labelling on separate nodes or have them conflated onto one level.

The fact that the grammar was developed for generation may have an adverse affect on parsing. For instance, a lexical item may for the purpose of generation be labelled differently according to the larger construct it is part of. Such multiple labellings exacerbate the problem of ambiguity in parsing. In recent versions of SFG different labels would be given to the wordform of in examples (1) to (3).

(1)

Some of the men

(2)

Part of the cake

(3)

A picture of John

The clear distinction between terminal and non-terminal categories in the grammar is not maintained in SFG. A non-terminal category which most frequently labels some higher level constituent in the sentence can exceptionally be employed as a terminal category if the lexical item it labels cannot be productively combined with other items within that constituent. For example, the label AM, (modal adjunct), would normally be given to each of examples (4) to (6).

(4)

Quite possibly

(5)

Almost certainly

(6)

Maybe

(4) and (5) would be given a constituent structure in (7), whereas (6) would have the simple structure of a terminal category (8), since the word maybe cannot be productively combined in a modal adjunct. Ideally, a large-scale lexicon should reflect this fact, but most lexical resources

38 (other than a lexicon extracted from the POW corpus) fail to capture these distinctions, so alternative solutions have to be devised. (7a)

[AM [T quite] [AX possibly]]

(7b)

[AM [T almost] [AX certainly]]

(8)

[AM maybe]

2.2.4 Systemic-Functional Grammar in the POW Corpus. From the point of view of parsing, the most comprehensive and explicit description of SF syntax is found in the POW corpus. The version of SF syntax used in analysing the POW corpus was produced in the early 1980's, and was therefore a forerunner of that used for the COMMUNAL project's generator. The latter is constantly being amended and refined by Robin Fawcett and Gordon Tucker. The corpus grammar is preferred to that contained in the generator because it is stable, offers the possibility of extracting an authentic probabilistic grammar and lexicon, and as yet contains a wider range of constructs than the generator can produce. However the POW corpus does not contain participant roles.

As a reader not familiar with systemic linguistics may have already discovered, the terminology of SFG is quite different to that of generative grammars such as Transformational Grammar or Generalised Phrase Structure Grammar. In the corpus, a syntax tree is characterised by having two alternating types of category labels. The first are called elements of structure, such as Subject (S), Complement (C), Adjunct (A), head (h), modifier (mo) and qualifier (q). Note that, in a hand-analysis, capital letters are used for elements of clause structure, and lower case letters for elements of group (and cluster) structure. In the machine-readable version of the corpus, capitals are used throughout. Elements of structure are filled by the second type of category, i.e.: units; elements of clause structure are filled by either subordinate clauses, groups, (cf. phrases in TG or GPSG) such as nominal group (ngp), prepositional group (pgp) and quantity-quality group (qqgp), or clusters such as genitive cluster (gc). Terminal elements of structure are expounded by lexical items. The top-level symbol is Z (sigma in the hand-written form) and is invariably filled by one or more clauses (Cl). Trees tend to be fairly flat, but richly labelled, immediately below the clause level, notably because of the absence of a predicate or verb phrase constituent. (This has a direct effect on the size and shape of the formal grammar which can be extracted from the parsed corpus, as illustrated in chapter 3). Some areas have a very elaborate description, eg: there

39 are 15 types of adjuncts, six types of modifiers, nine different determiners, and ten auxiliaries. Other categories are relatively coarse, eg: main-verb (M), head (h), and apex (ax). (The apex occurs in a quantity-quality group, and is typically expounded by an adverb or adjective). A comprehensive list of all the categories used in the hand parsing of the POW corpus is given in Appendix 3, with details of whether the symbol is used as a non-terminal or terminal category (or both), and some example lexical items which expound the terminal categories. The main componence relationships found in the corpus will now be exemplified in turn.

The Clause

The clause displays the greatest range of structural variation in the corpus. The declarative main clause right so tomorrow you are just going to wake up your father and I before eight o’clock might be labelled with following constituents:

CL

FR &

A

S O A

X

M CM

C

A

right so tomorrow you are just going to wake up your father and I before eight o’clock

where FR = frame, & = linker, A = adjunct, S = subject, O = operator, X = one or more auxiliary, M = main verb, CM = main verb completing complement, and C = complement. This by no means exhausts the potential lists of daughters in a clause, as Appendix 3 shows. Similar variations are possible for other clauses for imperatives, interrogatives and subordinate clauses, with most of the constituents being optional, some being able to be repeated at several positions, and some being mutually exclusive. Weerasinghe (1994; 111) states that the head of the clause in SF syntax is the main-verb (M), and uses this to build well-formed clause edges in his parser. The main verb is certainly the semantic head (being responsible for the pattern of participant roles in a clause) and is certainly responsible for some syntactic patterns such as potential complementation, but one may argue that the operator, as the first auxiliary, performs the job of syntactic head (being the element which agrees with the subject in person and number), or indeed the complete set of auxiliaries and modals, since they show the tense and mood of the clause. If we explore the evidence found in the POW corpus, and generously take the operator, main verb

40 or any auxiliary to be a possible head daughter for the clause, we obtain some interesting figures for headless clauses found in the corpus.

Mother

Alternative Heads

CL

M, O, OM, OMN, ON, OX, OXN, X, XM, XMN, XN

Of all the componence rules for clauses found in the corpus, 28.6 % of the tokens are found to be without such a head. Several of these will describe the structure of a clause containing only a formula, such as yes or no. Some others will be the result of ellipted verbs or auxiliaries, whilst others belong to neither of these categories. Most of the headless clauses are those containing only a subject, a complement, an adjunct or even just a conjunction.

The Nominal Group

A (particularly productive) nominal group and some of the greatest of the English world cup footballers who gained more than fifty caps might consist of the following daughters:

NGP

& DQ VO

DS VO DD MO MOTH

H

Q

and some of the greatest of the English world cup footballers who gained more than fifty caps

where & = linker, DQ = quantifying determiner, VO = of, DS = superlative determiner, DD = deictic determiner, MO = one or more modifiers, MOTH = one or more thing-modifiers, H = head, and Q = one or more qualifier. Again the variation is much greater than shown here, permitting pronouns and namelike heads of the nominal group. If we again look at the corpus, using the set of possible NGP heads to be {H, HN, HP, HPN, HSIT, HWH}, we find that 9.6 % of componence rule tokens appear without a head. The headless nominal groups tend to be just determiners, modifiers, or both.

41 The Preposition Group

The other units in Fawcett’s SF syntax for POW show somewhat less variation. The preposition group and straight to the top maximally involves four possible daughters:

PGP

&

T

P

CV

and straight to the top

where & = linker, T = temperer, P = preposition and CV = completive. The first two of these elements are optional, and there is an alternative label for the head (PM) where it is a main verb completing preposition (for prepositional verbs). One might assume that both the preposition and completive were compulsory, and certainly the preposition itself, but taking the head of a PGP to be either P or PM, we find 3.9% of componence rules for PGP occurring without such a head in the corpus. The headless preposition groups contain just completive NGPs (i.e. the prepositional complement on its own)

The Quantity-Quality Group This unit covers two constituents often dealt with separately in other grammatical descriptions, those of adjective group and adverbial groups, since the potential structure of the two is the same, despite their different functions as modifiers, qualifiers and adjuncts, for example. The structure of and very fast indeed at running is superficially the same as would be obtained by replacing the word fast (adj) with quickly (adv):

QQGP

& T

AX

FI

SC

and very fast indeed at running

42 Here, the labels are & = linker, T = temperer, AX = apex, FI = finisher an SC = scope. The use of AX to label heads of (quantity quality groups filling) modifiers, qualifiers and adjuncts introduces a great deal of syntactic ambiguity into the parser, which could be avoided if the tag labelling had been distinct in each of these cases. However in SFG the emphasis is on classifying functional roles in generation (at the expense of ease of syntactic analysis, in this case). The apex (AX) is the head of the QQGP, with alternative forms for superlative/comparative (AXT) and wh-variants (AXWH). If we treat all three of these as possible heads, when exploring the corpus, we find only 0.4% of QQGP componence rule tokens are lacking a head. The headless quantityquality groups contain temperers, scope and finishers each on their own or in combination.

The Genitive Cluster The genitive cluster is used in SFG to express belonging or possession, which, in English is found with the possessor being a full nominal group, or being pronominal, such as GC

PS G OWN

John ’s own (favourite)

GC

(PS)

G OWN

My own

(favourite)

Here the labels are PS = possessor, G = genitive element and OWN = owner. Note that in the case of a pronominal, the possessing nominal group is subsumed within the genitive element. In the case of nested possessors, the PS element is filled by a further GC. The genitive element is the head of the cluster.

The Text Unit

The final unit found in the POW corpus is the text unit, which labels citations, such as he said ‘shut up’. There is little variation in this rare structure, with the TEXT label consisting of a subsentence label, Z. Notice that this structure automatically licenses an alternative analysis (the top of which is shown by the broken lines in the example below) for almost any sentence, in which it is being uttered as a response to a question such as What did he say? Z CL

43

S M

C TEXT Z M C

he said go home

To conclude this description of the SF syntax found in the POW corpus, and in particular the data on headless units, we will estimate the proportion of the whole corpus which consists of utterances containing at least one constituent unit without a head. The percentage figures I have given are obtained by searching through all the componence rules extracted from the corpus (for more on this syntactic formalism see section 3.1). Each rule has an observed frequency, (which can be represented as a percentage), and by adding together the percentages for all clause rules without heads, we get an overall figure of 28.6%. It is not straightforward to associate this result directly with the number of sentences in the corpus which are ‘ill-formed’ in this way. Every tree would have to be checked individually, to see if it matched the list of headless rules. We can assume though that there will be some examples of sentences which contain more than one headless constituent (co-ordinated clauses for example). Consequently, we might estimate that about 25% of all sentences have at least one headless clause. However, some of the headed clauses would also contain headless subconstituents, so a figure between 25-30% of all sentences is a fair estimate of the number of sentences missing at least one head in some part of their substructure. This finding has significant repercussions for any method of parsing which relies on the notion of syntactic head, including the Alvey Natural Language Tools Parser, and that developed by Weerasinghe.

2.2.5 Choosing a Grammar Formalism and its Effect on Parsing. Several different grammatical descriptions and formalisms have been devised by linguists, logicians and computer scientists. As yet, there is no consensus as to which most adequately and elegantly captures all the complexity of unrestricted natural language. It is likely that this situation will persist, as grammars tend to be developed for varying purposes, most notably either

44 for language interpretation or language generation. However, two endeavours which are attempting to introduce an element of competition into grammar and parser development are the US DARPA sponsored Message Understanding Conference, and the Limerick Workshop on Industrial Parsing of Software Manuals. In the former conference, research teams are asked to automatically parse a set of unseen randomly chosen texts, to extract pre-defined fields of semantic content, and their results are graded and published. In the latter, the focus has been more on syntactic rather than semantic content. One of the main findings of such endeavours has been that it is very difficult to compare different parsing schemes in an objective manner, and that there is still little agreement as to what a parser should produce.

Having chosen the SFG contained in the POW corpus as a grammatical description, there remains the selection of a formalism for the grammar. Bound up in this decision is the choice of parsing algorithm, since different parsing algorithms work with different formalisms.

The grammar description can be extracted from the parsed corpus automatically in any formalism which is compatible with the way the parse trees in the corpus have been represented. Typically this will be limited to using a finite-state grammar, context-free grammar, or perhaps a vertical strip grammar (examples of each are given in chapter 3). None of the parsed corpora mentioned in section 2.1 have been annotated with category labels which are complexes of features, rather than being atomic (although many of the atomic labels are constructed from two or three letters which include the grammatical information some features contain). As a consequence, the corpus-based grammar will very possibly be less powerful (in terms of the Chomsky hierarchy) than its counterparts in the competence paradigm.

Geoffrey Sampson, who was involved in both the Leeds/Lancaster Treebank and the Susanne corpus-annotation projects, has argued that evidence from such corpora suggests that the number of rules needed to describe just the noun phrases in the Leeds/Lancaster treebank is open-ended (Sampson 1987b). Certainly, if the number of unique rules extracted from hand-parsed corpora are in a simple context-free phrase-structure rule formalism, they number several thousand (Atwell and Souter 1988a, Souter 1990) and are by no means exhaustive. Taylor, Grover and Briscoe (1989), in response to Sampson, argue that it is the very simple nature of the formalism which causes such open-endedness. They claim that, given rules which capture generalisations such as recursion, and categories as sets of features rather than atomic labels, the number of rules

45 needed to describe English noun phrases (and we are given to assume, English grammar in general) can be reduced to a much smaller, finite set. One example in the POW corpus where this is not the case is in the handling of agreement (between subject and main verb, for example). The POW corpus parse trees do not explicitly mark person and number agreement between categories, so introducing a feature which enabled this to happen would, although desirable from the parsing viewpoint, not reduce the categories in the grammar.

If we are to automatically take advantage of the grammatical information contained in the POW corpus (or any other parsed corpus), we will not be able to use non-atomic category labels, as these were not included in the original analysis. Taylor et al.'s grammar, derived from the Alvey Natural Language Toolkit (Grover et al. 1987), was manually constructed and successively modified in the light of the LOB corpus data. Unfortunately, no such formal grammar was created during the hand parsing of the POW corpus, and certainly not one which allowed for recursion in the rules, and non-atomic category labels. This is because the aim of the POW corpus compilation was for the study of child language development, and not natural language processing. Nevertheless, we are able to extract very large syntactic formalisms which are amenable to parsing, in the form of finite state models, sets of context-free rules, and dominance rules such as vertical trigrams and vertical strips. However, it is not possible to directly extract the principal SFG formalism which is amenable to NL generation, a system network, directly from the corpus text. The syntactic formalisms which are amenable to parsing are not necessarily at odds with the SFG model for generation, they merely have a different focus. They focus purely on the formal (and some functional) aspects of the SFG model, without addressing the semantic basis of SFG.

I am somewhat cautious as to the ultimate value of manually building a large rule-based syntax model, as there will always be some new sentences which contain structures not catered for in the grammar, so any parser using such a grammar will hardly be robust. Defenders of the competence paradigm would argue that corpora have gaps in too, which is a valid point. But the gaps in corpora will almost certainly be fewer. Furthermore, the endeavours of the TOSCA group working under the direction of Jan Aarts at Nijmegen University, Holland have been to incrementally build such a rule set using Extended Affix Grammar (EAG: Aarts and Oostdijk 1988), with reference to their TOSCA corpus. Their grammar now consists of several thousand rules, and parsing frequently results in several tens or even hundreds of ambiguous analyses. The

46 choice as to which analyses are the “right” ones must appeal to syntactic, semantic and pragmatic levels of information. Taylor et al. (1989: 258) also came across this kind of large-scale ambiguity problem. Rather than manually search through all the ambiguous parses for the semantically “correct” one, they decided only to manually apply the rules in the grammar to check that the semantically correct analysis could potentially be found. They also assume that spurious analyses may be filtered out by some sort of semantic component.

The scale of this multiple ambiguity problem lends weight to a final reason for adopting a corpus-based formalism. Corpus-based grammars provide the opportunity for recording the frequency of a wordform or construct. Such frequencies can be used to modify the search strategy of the parsing algorithm chosen for the grammar, and consequently order by likelihood the ambiguous analyses a large-scale grammar produces. If experiments show that (one of) the most likely parse(s) is the correct one, then it will not be necessary to have the parser produce all possible solutions. One implementation of a parser which adopts this approach using the POW corpus SFG exists: the Realistic Annealing Parser (see Atwell et al 1989, Souter and O'Donoghue 1991), which is described in section 2.4.

So far in this chapter I have argued in favour of a corpus rather than intuition-based approach to computational linguistics, and in particular for the use of parsed corpora as a source of grammatical information for parsing. As the primary source of linguistic data I have chosen the Polytechnic of Wales corpus, and the systemic functional description it contains, since there is more truth in the corpus than in an artificial corpus of generator output. Next, I will consider the lexical facilities that a wide-coverage corpus-based parser might need.

2.3 Lexical Resources for Corpus-Based Parsing. The development of lexicons for natural language processing has in many ways followed the same path as the development of grammars. In small scale systems, researchers were contented simply to choose a core list of words they would like to be able to deal with, and hand-craft the lexicon entries with phonological forms, syntactic categories and semantic fields and representations, etc. The lexicon would perhaps be adequate for the few sentences the researcher was interested in, but useless for anyone concerned with unrestricted English. In the present project, we would ideally like to be able to provide, for any wordform in the language,

47 appropriate grammatical tags from the terminal categories in the POW corpus, and preferably a probability measure of the occurrence of the wordform with each particular tag. For instance, given the wordform bricks, the lexical lookup process which initialises the parser might return

[[73 BRICKS H][1 BRICKS M]]

which would provide the parser with the information that bricks can be a noun (H) or main verb (M) with frequencies of 73 and 1 respectively. Such frequencies could be turned into probabilities by dividing the frequency of occurrence with a particular tag by the total frequency of occurrence with any tag (in this case 74). In a speech recognition application, this probability should then be multiplied by the probability of the wordform occurring in the language, which can be estimated from a raw corpus. However, in the current application, I use only tag probabilities, since the nature of the input is not in question.

There are at least three approaches one might adopt to the provision of a large-scale lexicon for robust parsing. Firstly, we could attempt to extract a list of wordform/wordtag correspondences with their frequencies from the chosen corpus itself, or use a corpus-trained probabilistic tagger akin to the constituent likelihood automatic word-tagging system (CLAWS) (Leech et al 1983, Atwell et al 1984). Alternatively, we could use a traditional dictionary-style morpheme list and a morphological analyser to strip off affixes before looking up a word. As a third option we could list all the morphological variants in the lexicon itself without taking advantage of the regular rules of English morphology, and thereby make the lexicon much larger. In practice, each of these options has its advantages and disadvantages. One problem common to them all is how to handle wordforms not covered by the lexicon5, such as proper nouns, neologisms, compounds and idioms, akin to the problem of grammatical undergeneration.

2.3.1 Corpus-Based Tag Assignment. Currently, wordlists with disambiguated frequency information can only be obtained from grammatically tagged or fully annotated corpora. These lexicons tend to be larger than could

5

The CLAWS approach has a set of affix rules it uses to help tag assignment, rather than lexical look up. However, a lexicon is used for the cases when the affix rules fail, and for idioms.

48 easily be produced by hand (see section 3.2 for an example from the POW corpus), but still not really adequate for a project aiming to handle unrestricted English. They have the advantage that they do provide disambiguated frequencies for wordforms in the corpus, and consequently can be used as lexicons for prototype probabilistic parsers which do not have pretensions of unrestricted lexical coverage. The POW corpus, which contains 65,000 words, yields a lexicon of 4,618 unique words with syntactic categories and their disambiguated frequencies. To obtain a larger wordlist would require an extremely large SFG tagged corpus (which sadly doesn't exist).

An alternative to straightforward lexical look-up from a corpus-derived lexicon is a tagging program based on the same SF syntax model. Two approaches to building such a tagger exist, based on either a probabilistic model, or on co-occurrence rules. A probabilistic method has been used to produce the 1 million tagged LOB corpus (Johansson et al 1986), which resulted from a semi-automatic approach to word tagging, called constituent likelihood automatic word-tagging system (CLAWS). A portion of the Brown corpus which had first been grammatically tagged by a rule-based program, and then corrected by hand was used to extract a probabilistic model of the relations between word tags in context. The program was then able to indicate unambiguously what the grammatical tag should be for some new word in the corpus, achieving 95-96% accuracy. Remaining errors were corrected by a manual post-editing phase to create a completely tagged corpus. The 1 million word tokens are examples of around 50,000 word types, which is a sizeable lexicon, although this does include many proper nouns and other items such as punctuation marks which would not normally be contained in a traditional dictionary.

The main aim of the CLAWS project was, however, not to produce a lexical look up procedure for a parser, but to create a tagging program which could also be re-used on other corpora (Garside 1987). To be used as a first step in SFG parsing, the tagger would need to be retrained to deal with the SFG grammar, by extracting tag co-occurrence frequencies from the POW corpus. Although an early version of CLAWS is now publicly available, it has not to my knowledge been designed to automatically accommodate other tagging schemes. Alternatively, the output of CLAWS as it stands might be mapped onto SFG categories. These options would probably reduce its accuracy, depending on how bound CLAWS is to the type of published, written language the LOB corpus contains.

49 A further probabilistic (Markov model) tagger has been developed by Church (1988), trained on the tagged Brown corpus, and employing corpus-based bigrams and trigrams, as well as lexical probabilities. The PARTS tagger has reported accuracy rates of between 95-99%, depending on text type and evaluation measure.

Recently, two different teams have developed tagging programs able to surpass (the lower end of) these success rates, using context rules, with limited use of probabilities. In Helsinki, a formalism to describe English grammar has been built called ENGCG (English Constraint Grammar, see for example Karlsson et al 1995). This model is essentially a hand-crafted approach using linguistic knowledge, in the form of a large lexicon and of morphological rules. As well as assigning parts of speech to each word, a skeletal parse outlining noun phrase structure is performed, and some functional elements such as a clause’s subject can be recognised. The same formalism is being used for English, Finnish, Swedish, German and Basque. It is not clear how one could make use of this tagger without also subscribing to the tagging scheme.

Another recent alternative to CLAWS, which can be trained on the POW corpus data, is the Brill tagger (Brill 1992, 93, 94). Instead of using a purely stochastic approach of modelling cooccurrence of parts-of-speech, Brill’s model instead learns a small set of context rules from a tagged corpus. He uses a technique called transformation-based error-driven learning to acquire these rules, which are like a hybrid between a rule-based and a pure stochastic model. The method consists of (i) extracting a lexicon from part of a manually tagged training corpus which he refers to as the truth; (ii) stripping off the tags from the truth, to make a raw corpus; (iii) making a first guess at the best tag by choosing the most frequent one from the lexicon; (iv) comparing the guessed tags to the truth, to calculate an error rate. (v) generating all possible transformation rules (from a set of about six templates). An example template is ‘change tag a to tag b when the preceding word is tagged y and the following word is tagged z’. An example transformation rule resulting from this might be ‘change the tag from noun to verb if the previous tag is modal’. (vi) apply each of the many rules in turn, and evaluate by comparing the updated tags to the truth.

50 (vii) select the rule that reduces the most errors, go back to (iii) and repeat until no new rule reduces the error rate, or a preset threshold is reached. Brill originally trained his model to tag the Penn Treebank, with success rates of 96-7% for single tag assignment (just in excess of CLAWS), and up to 99% success when the evaluation criteria are relaxed to allow an average of 1.5 tags per word. The original tagger described here now also contains facilities to permit tagging of words not found in the training corpus (although with a lower success rate of 85%), and to permit re-training on other tagged corpora. John Hughes has trained the Brill tagger trained on the POW corpus, among others (Hughes and Atwell, forthcoming). Examples of context rules and the lexical tagging rules produced by this training process are found in Appendices 10 and 11. The POW corpus is relatively small, compared to other tagged corpora used by Brill (of up to 1 million words), so we may expect a reduction in tagging success rate. In particular, its lexical coverage is very small, so the rules for tagging unknown words will be based on a limited training corpus.

The Brill tagger assigns just one tag per lexical item, so if it fails, it will have a significant impact on parsing. Equally, it has no mechanism for dealing with multi-word lexical items, unless these are tokenised together by means of hyphens. Consequently, we may wish to explore alternative lexical look-up techniques in parallel, such as using dictionary material.

2.3.2 Dictionaries and Morphological Analysers. Machine readable dictionaries (MRDs) such as the Longman Dictionary of Contemporary English (LDOCE: Procter 1978) and the Oxford Advanced Learners Dictionary (OALD: Hornby 1974) are lexical resources which offer some hope for computational linguists interested in robust parsing, but until very recently have developed using the lexicographer's competence, rather than a large scale observational survey of the language. One dictionary which has resulted from the analysis of a large corpus (of 20 million words) is COBUILD (Sinclair 1987), but unfortunately the machine readable version is not, to my knowledge, freely available for academic research.

Substantial reformatting and re-organisation of a MRD is often necessary before it becomes a lexicon tractable for NLP work (Atwell 1987, Boguraev and Briscoe 1987, 1989, Wilks et al

51 1988), but this work is undoubtedly time saving compared to compiling your own lexicon of the same size. An example of the reformatted bracketed LDOCE lexicon is shown in Figure 7.

Figure 7. Section of Lispified LDOCE. ((abate) (1 A0001600 !< a *80 bate) (3 E!"beIt) (5 v !