INVESTIGATING THE RELATIONSHIP BETWEEN

2 downloads 0 Views 1MB Size Report
the input and combines them to automatically generate a grammar. The system can be .... 5.5.2 Stage 2: Classifying non-local" examples . . . . . . . . . . . . . . . 92 .... 4.15 Two ways to handle a coordinated VP in the sentence John bought a book and read it four ...... tuation marks. 1 is an etree for non-peripheral NP appositive.
INVESTIGATING THE RELATIONSHIP BETWEEN GRAMMARS AND TREEBANKS FOR NATURAL LANGUAGES Fei Xia A DISSERTATION in

Computer and Information Science Presented to the Faculties of the University of Pennsylvania in Partial Ful llment of the Requirements for the Degree of Doctor of Philosophy 2000

Professor Martha Palmer and Aravind Joshi Supervisor of Dissertation

Gallier Graduate Group Chair

COPYRIGHT Fei Xia 2000

Abstract INVESTIGATING THE RELATIONSHIP BETWEEN GRAMMARS AND TREEBANKS FOR NATURAL LANGUAGES Fei Xia Supervisor: Professor Martha Palmer and Aravind Joshi Grammars and Treebanks both are useful resources for NLP applications. Grammars can either be built by hand or extracted from Treebanks, whereas Treebanks require human annotation. In this proposal, we will address several issues concerning grammar development, Treebank development and the relation between grammars and Treebanks. The framework we chose for grammars is Lexicalized Tree Adjoining Grammar (LTAG), where a grammar in this framework consists of hundreds of elementary trees. We rst introduce two systems which we have built to facilitate grammar development. The rst one (called LexOrg) is aimed at solving problems in creating hand-crafted grammars, namely, the redundancy caused by the reuse of substructures in elementary trees and the lack of explicitly expressed generalizations over the trees. LexOrg takes several types of speci cations as the input and combines them to automatically generate a grammar. The system can be further extended to include language-independent speci cations that can be tailored to speci c languages by eliciting linguistic information from native informants, thus partially automating the grammar development process. We have used the system to generate two LTAG grammars, one for English and the other for Chinese. The second system (called LexTract) extracts LTAGs from Treebanks and converts Treebanks into a format that can be used to train statistical parsers directly. In addition to creating Treebank grammars iii

and training material for parsers and Supertaggers, LexTract can also detect certain types of annotation errors, which makes it a valuable tool for corpus annotation. Furthermore, LexTract can retrieve data from Treebanks which can be used to test theoretical linguistic hypotheses, such as the Tree-locality Hypothesis in the LTAG formalism. Finally, since the core of LexTract is totally language-independent, LexTract can be applied to various Treebanks in di erent languages. This provides quantitative support for an investigation into the universal versus stipulatory aspects of di erent languages. Building a large high-quality Treebank is dicult. In the second part of the proposal, we address two major challenges we encountered while building a 100-thousand word Treebank for Chinese. The rst challenge is the creation of word segmentation, POS tagging and bracketing guidelines. We describe our methodology for guideline preparation and the approaches we adopted for particular issues on each set of guidelines. The second challenge is ensuring annotation consistency and accuracy. In order to achieve high quality, we have used LexTract to detect annotation errors automatically. We propose three tasks for the next stage. First, we plan to extend LexOrg to generate trees for modi cation and coordination relations. Second, we are going to apply LexTract to Treebanks for other languages. Third, we will exploit the possibilities of combining the strengths of both systems (LexOrg and LexTract) to produce high-quality grammars and Treebanks.

iv

Contents Abstract

iii

1 Introduction

1

1.1 Issues and approach . . . . . . . . . . . . . . 1.1.1 Raw data ) Grammars . . . . . . . . 1.1.2 Grammars , Treebanks ) NLP tools 1.1.3 Raw data ) Treebanks . . . . . . . . 1.1.4 Linguistic analyses . . . . . . . . . . . 1.2 Chapter summaries . . . . . . . . . . . . . . .

2 Overview of LTAG

2.1 Basic of the LTAG formalism . . . . . . . 2.1.1 Elementary trees . . . . . . . . . . 2.1.2 Two operations . . . . . . . . . . . 2.1.3 Derived trees and Derivation trees 2.1.4 Multi-anchor trees . . . . . . . . . 2.1.5 Feature structures . . . . . . . . . 2.1.6 Properties of LTAGs . . . . . . . . 2.2 Multi-component TAGs (MC-TAGs) . . .

. . . . . . . .

. . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

3 Towards Semi-automatic Grammar Development

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

. . . . . . . . . . . . . .

4 4 4 7 8 9

11

12 12 12 13 14 15 16 18

21

3.1 Redundancy in LTAGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.1.1 Tree templates and lexicon . . . . . . . . . . . . . . . . . . . . . . . 22 v

3.2

3.3 3.4 3.5

3.6

3.1.2 Tree families . . . . . . . . . . . . . . . . . . . . 3.1.3 Substructures shared in templates . . . . . . . . LexOrg: an LTAG development system . . . . . . . . . . 3.2.1 Input to the system: three types of speci cations 3.2.2 Tree generation from the speci cation . . . . . . Building grammars . . . . . . . . . . . . . . . . . . . . . Eliciting language speci c information . . . . . . . . . . Comparison with other work . . . . . . . . . . . . . . . 3.5.1 Becker's Meta-rules . . . . . . . . . . . . . . . . 3.5.2 DATR system . . . . . . . . . . . . . . . . . . . . 3.5.3 Candito's use of partial tree descriptions . . . . . Proposed work . . . . . . . . . . . . . . . . . . . . . . . 3.6.1 Building other trees . . . . . . . . . . . . . . . . 3.6.2 Inferring language speci cations from Treebanks

4 Extracting LTAGs from Annotated Corpora 4.1 4.2 4.3 4.4

4.5

4.6 4.7 4.8 4.9

Overview of LexTract . . . . . . . . . . . Overview of the English Penn Treebank . The form of target LTAGs . . . . . . . . . Treebank-speci c information . . . . . . . 4.4.1 Tagset table . . . . . . . . . . . . . 4.4.2 Head percolation table . . . . . . . 4.4.3 Argument table . . . . . . . . . . . 4.4.4 Modi cation table . . . . . . . . . Extracting an LTAG from a Treebank . . 4.5.1 Stage 1: Fully bracketing the ttrees 4.5.2 Stage 2: Building etrees . . . . . . Filtering out implausible etrees . . . . . . Converting ttrees to derivation trees . . . Building multi-component tree sets . . . . Some special cases . . . . . . . . . . . . . vi

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

23 24 25 25 28 29 30 33 33 34 34 36 37 38

39

41 41 42 46 47 47 48 48 49 49 53 54 58 59 64

4.9.1 Coordination . . . . . . . . . . . . . . 4.9.2 Empty categories . . . . . . . . . . . . 4.9.3 Punctuation . . . . . . . . . . . . . . . 4.9.4 Predicative auxiliary trees . . . . . . . 4.10 Comparison with other work . . . . . . . . . 4.10.1 Srinivas' heuristic approach . . . . . . 4.10.2 Neumann's lexicalized tree grammars 4.10.3 Chen & Vijayshanker's approach . . . 4.11 Proposed work . . . . . . . . . . . . . . . . .

5 Applications of LexTract

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

. . . . . . . . .

5.1 Treebank grammars as stand-alone grammars . . . . . . . 5.1.1 Two Treebank grammars . . . . . . . . . . . . . . 5.1.2 Coverage of Treebank grammars . . . . . . . . . . 5.2 Treebank grammars being combined with other grammars 5.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . 5.2.2 Stage 1: Extracting templates from Treebanks . . 5.2.3 Stage 2: Matching templates in the two grammars 5.2.4 Stage 3: Classifying unmatched templates . . . . . 5.2.5 Stage 4: Combining two grammars . . . . . . . . . 5.2.6 Summary . . . . . . . . . . . . . . . . . . . . . . . 5.3 Treebank grammars as sources of CFGs . . . . . . . . . . 5.3.1 Comparison with other CFG extraction algorithms 5.4 Lexicons as training data for Supertaggers . . . . . . . . . 5.5 MC sets for testing the Tree-locality Hypothesis . . . . . . 5.5.1 Stage 1: Finding \non-local" examples . . . . . . . 5.5.2 Stage 2: Classifying \non-local" examples . . . . . 5.5.3 Stage 3: Studying \non-local" constructions . . . . 5.6 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . 5.6.1 Training a statistical LTAG parser . . . . . . . . . 5.6.2 Error detection . . . . . . . . . . . . . . . . . . . . vii

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

64 65 67 69 72 72 73 73 74

75

76 76 77 80 80 81 81 85 86 87 87 88 89 91 91 92 93 94 94 95

5.6.3 Language comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 95

6 Chinese Penn Treebank

6.1 Project Inception . . . . . . . . . . . . . . . . 6.2 Annotation Process . . . . . . . . . . . . . . . 6.2.1 Word segmentation and POS tagging . 6.2.2 Syntactic bracketing . . . . . . . . . . 6.3 Methodology for Guideline Preparation . . . 6.4 Segmentation Guidelines . . . . . . . . . . . . 6.4.1 Notions of word . . . . . . . . . . . . . 6.4.2 An experiment . . . . . . . . . . . . . 6.4.3 Tests of wordness . . . . . . . . . . . . 6.5 POS Tagging Guidelines . . . . . . . . . . . . 6.5.1 Criteria for POS tagging . . . . . . . . 6.5.2 Syntactic distribution and POS tagset 6.6 Syntactic Bracketing Guidelines . . . . . . . . 6.6.1 Representation scheme . . . . . . . . . 6.6.2 Syntactic constructions . . . . . . . . 6.6.3 Ambiguities . . . . . . . . . . . . . . . 6.7 Quality Control . . . . . . . . . . . . . . . . . 6.8 Proposed work . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . .

96

96 98 99 99 100 102 102 104 106 107 107 109 113 113 113 114 114 116

7 Summary and Proposed Work

117

Bibliography

121

7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.2 Proposed Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

viii

List of Tables 3.1 Major features of English and Chinese grammars . . . . . . . . . . . . . . . 30 3.2 Settings for relative clauses in four languages . . . . . . . . . . . . . . . . . 31 4.1 4.2 4.3 4.4 4.5 4.6 4.7 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10 5.11

Treebank tags which appear in this chapter . . . . . . . . . . . . . . . . . . Algorithm for nding head-child of a node . . . . . . . . . . . . . . . . . . . Algorithm that marks a node as either argument or modi er . . . . . . . . . Algorithm for fully bracketing ttrees . . . . . . . . . . . . . . . . . . . . . . Algorithm for building etrees from a fully bracketed ttree . . . . . . . . . . . Algorithm for building derivation trees . . . . . . . . . . . . . . . . . . . . . Algorithm for determining whether the coindexation between a pair of nodes is tree-local . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

43 50 51 52 55 60

Tags in PTB are converted to tags used in the XTAG grammar . . . . . Two extracted grammars from PTB . . . . . . . . . . . . . . . . . . . . The top 40 words with highest numbers of Supertags in G2 . . . . . . . The types of unknown (word, template) pairs in PTB section 23 . . . . Matched templates and their frequencies . . . . . . . . . . . . . . . . . . Matched templates when certain annotation di erences are disregarded Classi cations of 289 unmatched templates . . . . . . . . . . . . . . . . CFGs derived from extracted LTAGs . . . . . . . . . . . . . . . . . . . . Supertagging results based on three di erent conversion algorithms . . . Numbers of tree sets and their frequencies in PTB . . . . . . . . . . . . Classi cation of 999 extended MC sets that look non-tree-local . . . . .

76 77 78 79 84 85 86 88 91 92 93

ix

. . . . . . . . . . .

. . . . . . . . . . .

63

6.1 Comparison of word segmentation results from seven groups . . . . . . . . . 105 6.2 A method for creating POS guidelines . . . . . . . . . . . . . . . . . . . . . 111

x

List of Figures 1.1 Relations between grammars, Treebanks and NLP tools . . . . . . . . . . . 1.2 The new relations between grammars, Treebanks and NLP tools after adding LexOrg and LexTract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

2.1 The substitution operation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 The adjunction operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Elementary trees, derived tree and derivation tree for underwriters still draft policies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Two derivation trees for a derived tree . . . . . . . . . . . . . . . . . . . . . 2.5 Multi-anchor trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 The substitution operation with features . . . . . . . . . . . . . . . . . . . . 2.7 The adjunction operation with features . . . . . . . . . . . . . . . . . . . . 2.8 Features for the subject-verb agreement . . . . . . . . . . . . . . . . . . . . 2.9 Cross-serial dependencies in Dutch . . . . . . . . . . . . . . . . . . . . . . . 2.10 Trees for the wh-question what does John like . . . . . . . . . . . . . . . . . 2.11 Trees for the wh-question what does Mary think Mike believes John likes . . 2.12 Tree-local MC-TAG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 12

3.1 3.2 3.3 3.4 3.5

22 24 26 26 26

Elementary trees, templates and lexicon . . . . A tree family . . . . . . . . . . . . . . . . . . . The framework of the system . . . . . . . . . . Two subcategorization frames for the verb buy The passive LRR . . . . . . . . . . . . . . . . . xi

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

3

13 14 15 15 15 16 17 18 19 19

3.6 3.7 3.8 3.9 3.10

Some subcategorization blocks selected by the frame (NP0 V NP1) Transformation block for extraction . . . . . . . . . . . . . . . . . The possible meta-blocks for relative clause . . . . . . . . . . . . . The blocks for relative clauses in four languages . . . . . . . . . . . Templates for modi cation and coordination . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

4.1 Architecture of LexTract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 The Treebank annotation for the sentence Supply troubles were on the minds of Treasury investors yesterday, who worried about the ood. . . . . . . . . . 4.3 Two LTAGs which can generate the same ttree . . . . . . . . . . . . . . . . 4.4 The three forms that extracted etrees should have . . . . . . . . . . . . . . 4.5 The notions of head in X-bar theory, GB-theory and our target grammar . . 4.6 Lexical items percolate from heads to higher projections . . . . . . . . . . . 4.7 The e ect of fully bracketing a ttree . . . . . . . . . . . . . . . . . . . . . . 4.8 The etree set is a partition of the ttree. . . . . . . . . . . . . . . . . . . . . . 4.9 The extracted etrees from the ttree. . . . . . . . . . . . . . . . . . . . . . . . 4.10 A frequent, incorrect etree template . . . . . . . . . . . . . . . . . . . . . . 4.11 The LTAG derivation trees for the sentence . . . . . . . . . . . . . . . . . . 4.12 The ttree as a derived tree. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.13 Etrees for co-indexed constituents . . . . . . . . . . . . . . . . . . . . . . . . 4.14 The coindexation between two nodes may or may not be tree-local . . . . . 4.15 Two ways to handle a coordinated VP in the sentence John bought a book and read it four times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.16 Handling a sentence with ellipsis: (Mary came yesterday.) John did too . . 4.17 Handling a sentence with wh-movement from adjunct positions . . . . . . . 4.18 Etrees with punctuation marks . . . . . . . . . . . . . . . . . . . . . . . . . 4.19 A sentence with quotation marks . . . . . . . . . . . . . . . . . . . . . . . . 4.20 An example in which a predicate auxiliary tree should be factored out: the person who Mary believed bought the book . . . . . . . . . . . . . . . . . . . 4.21 Two alternatives for the verb believed when there is no long-distance movement 4.22 The etree for gerund in the XTAG grammar . . . . . . . . . . . . . . . . . . xii

27 28 31 32 37 41 42 44 44 46 48 50 54 56 57 58 61 62 62 65 67 67 68 69 70 71 71

4.23 An example where a predicate auxiliary tree should be not factored out: the person who believed Mary bought the book . . . . . . . . . . . . . . . . . . . 72 5.1 Template types and template tokens in G1 . . . . . . . . . . . . . . . . . . . 5.2 The trees for pure intransitive verbs and ergative verbs in XTAG t-match the tree for all intransitive verbs in G2 . . . . . . . . . . . . . . . . . . . . . 5.3 Templates in XTAG with expanded subtrees t-match the one in G2 when the expanded subtrees are disregarded . . . . . . . . . . . . . . . . . . . . . 5.4 The compound trees in XTAG t-match the one in G2 when the expanded subtrees in the former are disregarded . . . . . . . . . . . . . . . . . . . . . 5.5 An example of c-match . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.6 Templates for adjectives modifying nouns . . . . . . . . . . . . . . . . . . . 5.7 CFG rules derived from an etree . . . . . . . . . . . . . . . . . . . . . . . . 6.1 6.2 6.3 6.4 6.5

The rst phase: segmentation and POS tagging . . . . . . . . . . The second phase: bracketing and data release . . . . . . . . . . Words, POS tagset and positions . . . . . . . . . . . . . . . . . . Three alternatives for de ning the mappings f1 and f2 . . . . . . Accuracy and inter-annotator consistency during the second pass

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

79 82 83 83 84 85 87 98 98 110 112 116

7.1 The relations between grammars, Treebanks and NLP tools . . . . . . . . . 117 7.2 The new relations with the links provided by LexOrg and LexTract . . . . . 118 7.3 The new relations after adding proposed extension on LexOrg and LexTract 119

xiii

Chapter 1

Introduction With the rapid growth of the information highway and the global economy, a vast amount of information is available from numerous resources in di erent languages. This provides a golden opportunity for the NLP eld to apply its techniques to various applications. In many of these applications such as machine translation, parsing, and text generation, grammars and Treebanks are very useful. Grammars (including lexicons) provide a concise representation of natural language, and they include phonological, morphological, syntactic, and even semantic information. Treebanks, which are collections of naturally-occuring text with phrase structures, provide data for deriving statistical information that can be used to train various NLP tools, such as Part-of-speech (POS) taggers and parsers. The relations between raw data, grammars, Treebanks and NLP tools (such as parsers) are shown in Figure 1.1. According to this diagram, grammars can be divided into two types: hand-crafted grammars and Treebank grammars. Hand-crafted grammars are manually built by grammar developers, who analyze constructions in raw data and incorporate their analyses as rules in the grammars. Treebank grammars are automatically extracted from Treebanks. The extraction process takes little human e ort as the burden of linguistic analysis and quality control is shifted from grammar developers to Treebank developers. A Treebank can be solely annotated by human annotators, or be processed by some NLP tools (such as POS taggers and parsers) rst and then corrected by annotators. Dashed lines in Figure 1.1 indicate connections that may or may not exist for some particular systems. 1

For instance, a NLP parser may use a grammar, a Treebank, raw data or any combination of these three. Grammars can be built on various frameworks, such as Context-free grammars (CFGs), Head-driven Phrase Structure Grammars (HPSGs), and Categorial Grammars (CGs). The framework we use is Lexicalized Tree-adjoining Grammar (LTAG), a brief introduction of which can be found in Chapter 2. NLP tools

(e.g. parsers)

grammar extraction

grammars

Treebanks

linguistic analyses

linguistic analyses quality control

quality control

raw data

Figure 1.1: Relations between grammars, Treebanks and NLP tools In this proposal, we will address several issues concerning this diagram:

raw data ) grammars: Given a collection of raw data, how should the data be analyzed

to make a grammar? what should a grammar look like and how can it be built in a consistent and ecient way?

grammars , Treebanks: Given a Treebank, how can a Treebank grammar be extracted

from it? In the opposite direction, could grammars help Treebanks in any possitive way? Also, if, for a particular language such as English, both hand-crafted and Treebank grammars exist, how can these two types of grammars be compared? How can they be combined to generate a new grammar with better quality and wider coverage?

Treebanks ) NLP tools: Certain NLP tools, such as LTAG parsers, can not use Tree-

banks directly. Instead, the phrase structures in Treebanks must rst be converted to a format these NLP tools can use. What kind of information in a Treebank should be converted and how should the conversion be done? 2

raw data ) Treebanks: Two of the major challenges in building a large high-quality Treebank are the preparation of guidelines and quality control. What should be included in the guidelines? And what steps should be taken to ensure the Treebank is of high quality and consistency?

To tackle these issues, we have built two systems, LexOrg and LexTract. As can be seen from the diagram below, LexOrg and LexTract play a central role in de ning precisely the relationships between Treebanks and grammars and bringing them closer together. LexOrg takes a set of language speci cations and automatically generates a LTAG, thus reducing redudency which is common in large-scale LTAGs. The second system, LexTract, has several functions: rst, it extracts LTAG from Treebanks; second, it converts Treebanks into the form that can be used to train LTAG parsers and Supertaggers (Srinivas, 1997) directly; third, it decomposes rules in Treebank grammars into a set of language speci cations, which in turn can be used by LexOrg to generate a LTAG. In addition to these two systems, we have also built both a 100K-word Chinese Treebank and a hand-crafted grammar for Chinese independently. Each of these issues and our approach is described in more detail below. NLP tools (e.g. parsers)

LexTract LexTract

grammars LexTract LexOrg linguistic analyses quality control

language specification

Treebanks linguistic analyses quality control

linguistic analyses

raw data

Figure 1.2: The new relations between grammars, Treebanks and NLP tools after adding LexOrg and LexTract

3

1.1 Issues and approach 1.1.1 Raw data ) Grammars The rst issue is grammar development: what should a grammar look like and how can it be built (or extracted from a corpus) in a consistent and ecient way. Until recently, most LTAG-based grammars have been hand-crafted. A LTAG normally consists of hundreds of elementary trees. As the size of the grammar grows, developing and maintaining those grammars manually faces two major problems: rst, the reuse of substructures in many elementary trees creates redundancy. To make certain changes in the grammar, all the related trees have to be manually checked. The process is inecient and cannot guarantee consistency (Vijay-Shanker and Schabes, 1992); second, the generalization over elementary trees is not expressed explicitly. As a result, from the grammar itself (i.e. a set of thousands of elementary trees), it is dicult to grasp the characteristics of a particular language, to compare languages, and to build a grammar for a new language given existing grammars for other languages. We have developed a system, called LexOrg, which is aimed at solving those problems. It requires the grammar developers to state the linguistic information explicitly in the form of three types of speci cations: subcategorization frames, lexical redistribution rules (LRRs), and partial tree descriptions. To produce the grammar, our system takes those speci cations as the input and combines them to automatically generate the elementary trees. The system can be further extended to include language-independent speci cations that can be tailored to speci c languages by eliciting linguistic information from native informants, thus partially automating the grammar development process. We have used the system to develop two LTAG grammars, one for English and the other for Chinese.

1.1.2 Grammars , Treebanks ) NLP tools Grammars can be roughly divided into two types according to the way they are constructed: hand-crafted grammars are built either manually or semi-automatically using a grammar development system such as LexOrg, whereas Treebank grammars are extracted automatically from Treebanks. With respect to NLP tasks such as parsing, each type of 4

grammar has its strengths and weaknesses. Hand-crafted LTAGs use rich representations (such as feature structures) and tend to be precise. However, using this kind of grammar for statistical supervised parsing faces a major challenge; namely, it requires a grammar-dependent corpus, which is annotated according to the elementary trees in the grammar. Building such a corpus is not only time-consuming and labor-intensive,1 but also dicult to maintain because every time the grammar is revised, the annotation of the corpus has to be modi ed as well. Given the fact that most statistical parsers (e.g. (Collins, 1997; Goodman, 1997; Charniak, 1997; Ratnaparkhi, 1998)) have already been trained and tested on Treebanks (such as the Penn Treebank (Marcus et al., 1994)) which are not annotated based on a particular handcrafted grammar, it is dicult to compare the performance of those parsers with a LTAG parser trained on a grammar-dependent corpus because the training and testing data are di erent. Treebank grammars, on the other hand, have certain advantages over hand-crafted grammars. Firstly, given a Treebank and an extraction tool, Treebank grammars can be extracted with little human e ort.2 Secondly, the source Treebank is not directly grammardependent. Di erent parsers can be trained and tested on the same corpus and their performances can be easily compared. Thirdly, various extraction methods can extract Treebank grammars based on certain assumptions of the forms of the target grammars and particular interpretations of the annotations in the Treebanks. Changing those assumptions and interpretations, which can be done by simply modifying a few lines of code or a couple of tables, and re-running the extraction tools on the same corpus can yield di erent Treebank grammars. This exibility allows di erent grammars to be extracted and compared with respect to their usefulness to grammar developers and statistical parsers. Fourthly, a Treebank grammar is guaranteed to cover the Treebank from which it is extracted from and given a Treebank grammar and a new Treebank, it is easy to calculate how much of In general, there are two ways to build such a corpus. One is to use a LTAG parser to generate all possible parses for each sentence in the corpus, and then have linguistic experts select the correct one from hundreds of parses. Another way is to have linguistic experts who are very familiar with the grammar to manually create a parse tree for each sentence in the corpus. Both ways are costly and time-consuming. 2 Notice that much e ort went into the hand annotation of the Treebank in the rst place. 1

5

that Treebank is covered by the grammar. However, a Treebank grammar may include incorrect structures due to annotation errors in the original Treebank. It also lacks sophisticated representation such as feature structures used in hand-crafted LTAGs. The central question is: how to combine the strengths of both types of grammar and produce a new grammar with better coverage and higher quality. We have developed a system (called LexTract) which extracts LTAGs from Treebanks and also converts Treebanks into a format that can be used to train statistical LTAG parsers directly. After an introduction to the system and the extraction algorithms, we present ve types of tasks for LexTract and some experimental results on the rst three tasks:

 LexTract can be used to extract LTAGs and CFGs from Treebanks. We will also

report the preliminary results of combining a Treebank grammar extracted from the Penn English Treebank with the XTAG grammar (XTAG-Group, 1998), which is a hand-crafted large-scale English grammar.

 LexTract produces lexicons and derivation trees which are useful for training Supertaggers and parsers. We have re-trained Srinivas' Supertagger (Srinivas, 1997) with satisfactory results.

 LexTract can retrieve the data from Treebanks which are used to test theoretical

linguistic hypotheses, such as Tree-locality Hypothesis in LTAG formalism (Xia and Bleam, 2000).

 LexTract can detect certain types of annotation errors, and this makes it a valuable tool for corpus annotation. We are in the process of using LexTract in our Chinese Penn Treebank Project (Xia et al., 2000) for the nal clean-up.

 The core of LexTract is totally language-independent, and therefore, it can be applied to various Treebanks in di erent languages. This provides quantitative support for an investigation into the universal versus stipulative aspects of di erent languages.

6

1.1.3 Raw data ) Treebanks Treebanks are a necessary prerequisite for running LexTract and any supervised NLP tools such as POS taggers and parsers. Building a large high-quality Treebank is a dicult task. In the past two years, we have been developing a 100-thousand word Treebank for Chinese. During the process, we have encountered many challenges. We will address two major challenges and our response to them.

Guideline preparation The rst challenge is the creation of word segmentation, POS tagging and bracketing guidelines. The task is especially hard for Chinese due to the nature of the language. First, Chinese writing does not have a natural delimiter between words, and the notion of word is very hard to de ne. Second, Chinese has very little, if any, in ectional morphology. Words are not in ected with number, gender, case, or tense. This fuels the debate on whether the POS tags should be based on meaning or on syntactic distribution in Chinese NLP communities. Third, there are many open questions in Chinese syntax. To further complicate the situation, Chinese, like any other language, is under constant change. With its long history, a seemingly homogeneous phenomenon in Chinese (such as long and short bei-construction) may be, in fact, a set of historically related but syntactically independent constructions (Feng, 1998). Last, Chinese is widely spoken in areas as diverse as China, Hong Kong, Taiwan, and Singapore. There is a growing body of research in Chinese natural language processing, but little consensus on linguistic standards along the lines of the EAGLES initiative in Europe. In Chapter 6, we describe our methodology for guideline preparation. We also address particular issues on each set of guidelines and the approaches we adopted.

Quality control The second challenge to providing a high-quality Treebank is ensuring annotation consistency and accuracy. Carefully documented guidelines and linguistically trained annotators annotator support tools are prerequisites to creating a high quality corpus with acceptable production rates. In addition, we also adopt several measures to ensure annotation quality: 7

 The annotation is divided into two phases: a sentence is segmented into words and marked with POS tags in the rst phase, and bracketing structures are added in the second phase. In each phase, the whole Treebank is annotated at least twice, with the second annotator correcting the output of the rst annotator.

 A lexicon of (word, POS tag) pairs is compiled out of the Treebank and manually checked. This process has revealed many POS tagging errors.

 About 22% of the corpus is double re-annotated3 . The results of the double re-

annotation are compared and any discrepancies are carefully examined and the annotation is revised. The corrected, reconciled annotation is considered the Gold Standard, and each of the two original re-annotations is then compared with it and with each other, to provide a measure of individual annotator accuracy and interannotator consistency.

 We are running LexTract on the whole corpus, which has detected additional errors that were missed by other methods.

1.1.4 Linguistic analyses Previously, we have introduced two methods for grammar development. One (LexTract) extracts grammars automatically from bracketed corpora, the other (LexOrg) semi-automatically builds grammars with input for native speakers. Both will speed up the grammar development process. Nevertheless, they can not replace the linguistic study of the language. Running LexTract to extract grammars requires a high-quality Treebank. As just mentioned, one challenge in building such a Treebank is the creation of a set of annotation guidelines. The process of creating guidelines is very similar to manually creating a grammar in the sense that the guideline designers (or the grammar developers) have to decide what analyses should be adopted for linguistic phenomena, based on their knowledge in morphology and syntax. For LexOrg, grammar developers need to study the input from the native speaker and decide what kinds of description should be used for each phenomenon. Double re-annotation means both annotators re-annotate the same les from the output of the rst pass. 3

8

In the past few years, we have been developing a Chinese LTAG grammar using both LexOrg and LexTract. In Chapter 7, we will introduce some syntactic constructions in Chinese, such as ba-construction and bei-construction. We will describe the analysis we adopt for each construction and brie y explain the rationale behind each chosen analysis. Also included is a classi cation of verbs according to the syntactic frames in which they can occur. In the nal thesis, we will also provide treatments for words outside verbs.

1.2 Chapter summaries The proposal is organized as follows:

Chapter 2: In this chapter, we introduce the LTAG formalism and its main properties.

We discuss the relevance of those properties to the representation of natural languages. We also introduce an extension of the formalism, namely, multi-component TAGs, which will be used in later chapters.

Chapter 3: In this chapter, we present the grammar development system (called LexOrg).

The system takes three types of speci cations of a language (namely, subcategorization frames, lexical redistribution rules, and blocks), and automatically generates an LTAG grammar by combining these speci cations. The system can be further extended to include language-independent structures that can be tailored to speci c languages by eliciting language-speci c information from native speakers, thus partially automating the grammar development process. So far, we have used the system to build a grammar for English and another for Chinese. We also compare our approach with other related work including DATR (Evans et al., 1995), Candito's system (Candito, 1996) and meta-rules (Becker, 1994).

Chapter 4: In this chapter, we present the system (called LexTract) which extracts a

LTAG grammar from an existing Treebank and converts the Treebank into a format that can be used to train statistical parsers directly. We compare the system with other grammar extraction algorithms including Srinivas' heuristic approach (Srinivas, 1997), Neumann's lexicalized tree grammars (Neumann, 1998), and Chen & 9

Vijayshanker's algorithms (Chen and Vijay-Shanker, 2000).

Chapter 5: In this chapter, we present several types of applications for LexTract and

related experimental results. First, we report the results of comparing a Treebank grammar with the XTAG grammar. Second, we have re-trained Srinivas' Supertagger and compared the results with the results based on other extraction algorithms. Last, we show that LexTract can retrieve data from Treebanks to test theoretical linguistic hypotheses, such as Tree-locality Hypothesis in LTAG formalism.

Chapter 6: In this chapter, we discuss two challenges in creating a Chinese Treebank,

namely, creating annotation guidelines and ensuring annotation accuracy and our response to them.

Chapter 7: This chapter will summarize the work we propose for the next stage. There

are mainly three tasks: the rst one is to extend LexOrg to generate modi cation trees and coordination trees; the second one is to run the experiments described in Chapter 5 on two other Treebanks: one for Chinese, the other for Korean, and compare the results from these three Treebanks; the last one is to develop a language model for a LTAG statistical parser, train and test the parser with the converted data provided by LexTract, and compare the results with other state-of-the-art parsers.

10

Chapter 2

Overview of LTAG There are various grammar frameworks proposed for natural languages: Context-free grammars (CFGs), Head Grammars (HGs), Head-driven Phrase Structure Grammars (HPSGs), Combinatory Categorial Grammars (CCGs) and so on. For a discussion of the relations among these formalisms, see (Weir, 1988; Kasper et al., 1995) among others. We take Lexicalized Tree-adjoining Grammars (LTAGs) as representative of a class of lexicalized grammars. LTAG is appealing for representing various phenomena in natural languages due to its linguistic and computational properties. In the last decade, LTAG has been used in various aspects of natural language understanding (e.g. parsing (Schabes, 1990; Srinivas, 1997), semantics (Joshi and Vijay-Shanker, 1999; Kallmeyer and Joshi, 1999; Kipper et al., 2000) and discourse (Webber and Joshi, 1998; Webber et al., 1999)) and a number of NLP applications (e.g. machine translation (Palmer et al., 1998), information retrieval (Chandrasekar and Srinivas, 1997), generation (Stone and Doran, 1997; McCoy et al., 1992), and summarization applications (Baldwin et al., 1997)). In this chapter, we introduce the LTAG formalism and its main properties. Due to the large amount of work based on the LTAG formalism, our introduction is not intended to be comprehensive. Instead, we will focus on the aspects that are most relevant to this thesis. For a more comprehensive discussion on the formalism, see (Joshi et al., 1975; Joshi, 1985; Joshi, 1987; Joshi and Schabes, 1997). This chapter is organized as follows. Section 2.1 gives a brief introduction of the basic LTAG formalism. Section 2.2 introduces an extension of the LTAG formalism, namely, Multi-component TAG (MC-TAG). 11

Y

X

X

=> Y

Y

Figure 2.1: The substitution operation Y

X

X

=>

Y

Y

Y Y

Figure 2.2: The adjunction operation

2.1 Basic of the LTAG formalism LTAGs are based on the Tree Adjoining Grammar (TAG) formalism developed by Joshi, Levy, and Takahashi (Joshi et al., 1975; Joshi and Schabes, 1997).

2.1.1 Elementary trees The primitive elements of a LTAG are elementary trees. An LTAG is lexicalized as each elementary tree is associated with at least one lexical item (called the anchor of the tree) on its frontier. The elementary trees are minimal in the sense that all and only the arguments of the anchor are encapsulated in the tree. The elementary trees of LTAG possess an extended domain of locality. The grammatical constraints are stated over the elementary trees, and are independent of all recursive processes. There are two types of elementary trees: initial trees and auxiliary trees. Each auxiliary tree has a unique leaf node, called the foot node, which has the same label as the root. In both types of trees, leaf nodes other than anchors and foot nodes are called substitution nodes.

2.1.2 Two operations Elementary trees are combined by two operations: substitution and adjunction. In the substitution operation (Figure 2.1), a substitution node in an elementary tree is replaced by another elementary tree whose root has the same label as the substitution node. In 12

α 1:

α 2:

β1:

VP

NP

ADVP N underwriters

α 3: S

VP*

NP

NP0

VP

ADV

V

still

draft

N NP1 policies

(a) elementary trees

S

α 2 (draft)

VP

NP ADVP

N

ADV

VP V

NP

draft

N

α 1(underwriters)[1]

underwriters still

β (still) [2] 1

α 3(policies) [2.2]

policies

(b) derived tree

(c) derivation tree

Figure 2.3: Elementary trees, derived tree and derivation tree for underwriters still draft policies. an adjunction operation (Figure 2.2), an auxiliary tree is inserted into another elementary tree. The root and the foot nodes of the auxiliary tree must match the node label at which the auxiliary tree adjoins. The resulting structure of the combined elementary trees is called a derived tree. The combination process is recorded as a derivation tree. In Figure 2.3, the four elementary trees in (a) are anchored by words in the sentence underwriters still draft policies. 1 | 3 are initial trees, while 1 is an auxiliary tree. Foot and substitution nodes are marked by , and # respectively. To parse the sentence, 1 and 3 substitute into the nodes NP0 and NP1 in 2 respectively, and 1 adjoins to the VP node in 2 , thus, forming the derived tree in (b). The arrows between the elementary trees illustrate the combining process, which is also expressed in the derivation tree in (c).1

2.1.3 Derived trees and Derivation trees Unlike in Context-free Grammars, the derived trees and derivation trees in the LTAG formalism are not identical, that is, several derivation trees may produce the same derived In a derivation tree, a dashed line is used for an adjunction operation and a bold line for substitution. The number within square brackets is the address of the node at which the substitution/adjunction operation took place. The number is useful when there are more than one node with the same label in an elementary tree. For the sake of simplicity, from now on we will drop these numbers from derivation trees. 1

13

α 1:

NP N

S

α : 2

α3 :

S VP

NP V

VP

NP N

V

VP

can John

John

α (can) 2

VP V α 1 (John)

VP

can

(d) derivation tree for G1

(b) G 1

V α 1: swim

α 3 (swim)

swim

α4: NP

α 4 (swim)

β1 : S

VP

(a) derived tree N

NP

John

VP

V

V

can

α 1 (John)

β1(can)

VP* (e) derivation tree for G2

swim (c) G 2

Figure 2.4: Two derivation trees for a derived tree tree. For instance, in Figure 2.4, the derived tree in (a) can be produced by combining either the elementary trees in (b) or the ones in (c). The corresponding derivation trees are shown in (d) and (e). This property becomes relevant in Chapter 4, where we will discuss a system which, given derived trees, automatically constructs elementary trees and derivation trees.

2.1.4 Multi-anchor trees An elementary tree is normally anchored by a unique lexical item, but multi-anchor trees are used in a number of cases. Two of them are shown in Figure 2.5(a) and 2.5(b). The rst one is for idioms such as kick the bucket, which means someone dies, and the second one is for expressions with light verbs, such as take a walk.2 In each case, the multi-anchors form the predicate. By having multi-anchors, each tree can be associated with semantic representations directly, which is an advantage of the LTAG formalism. Notice that the sentence He kicked the bucket now will have two correct parses, one for the idiomatic meaning, the other for the literal meaning. Since only a handful elementary trees in a LTAG have multi-anchors, from now on, we will assume each tree has a unique anchor unless speci ed otherwise. Notice that the determiner in Figure 2.5(a) is a co-anchor, whereas it is not part of the tree in Figure 2.5(b). This is because in the former the determiner has to be present and it has to be the word the, whereas in the latter the determiner is optional, and it can be any determiner (e.g. take a walk, take several walks). 2

14

NP0

NP 0

VP V

NP 1

kick

S

S

S

D

N

the

V

NP1

take

N

bucket

(a) idioms sem: die(NP0 )

NP0

VP

VP V

NP 1

kick/take

walk (b) light verbs

(c) transitive verbs

sem: walk(NP0 )

sem: kick(NP0, NP1 ) take(NP0, NP1 )

Figure 2.5: Multi-anchor trees tr br

Y

X

X

=> t

Y

Y

t U tr br

Figure 2.6: The substitution operation with features

2.1.5 Feature structures In an elementary tree, a feature structure is associated with each node (Vijay-Shanker, 1987). This feature structure contains information about how the node interacts with other nodes in the tree. It consists of a top part and a bottom part. When two elementary trees are combined, the feature structures of corresponding nodes from two trees are uni ed, as shown in Figure 2.6 and 2.7. For a derived tree to succeed, the top and bottom features for each node in the tree must unify. Features are used to specify linguistic constraints. For instance, in Figure 2.8 the X t b

tr br

Y

X

>

Y

Y

tf Y*

t U tr br

bf Y

tf b U bf

Figure 2.7: The adjunction operation with features 15

α1: NP N

α 2:

S

NP0

[arg:] []

α 3:

[] [agr:] [agr: ] [agr: 3rdsg]

VP [agr:] [agr:]

[agr:]

V [agr:3rdsg] NP 1 John

likes

NP [][agr:] N

[agr: ] [agr: 3rdpl]

dogs

Figure 2.8: Features for the subject-verb agreement feature is introduced to enforce subject-verb agreement in English. Let X.t (X.b, resp.) denote the top (bottom, resp.) feature structures of the node X. A feature equation such as X.t: = Y.b: means the feature f1 in X.t and the feature f2 in Y.b must have the same value. In 1 and 3 , N.t:= NP.b:, as both are marked with index . Similarly, the feature propagates from V.t to VP.b in 2 , and NP0 .t:=VP.t:. Given a grammar which consists of only these three trees, it will correctly accept the sentence John likes dogs and reject the sentence Dogs likes John, because for the latter sentence, the value of NP0 .t: in 2 is 3rdsg (third person singular) which comes from the verb likes, and the value of NP0 .b: is 3rdpl (third person plural), which comes from the subject dogs. As a result, the uni cation of the top and bottom features of NP0 fails, indicating the subject-verb agreement constraint has been violated. From now on, we will not show the feature structures in elementary trees unless it is necessary.

2.1.6 Properties of LTAGs LTAG is a constrained mathematical formalism, not a linguistic theory. It is appealing for representing various phenomena, especially syntactic phenomena, in natural languages due to its linguistic and computational properties, some of which are listed below:

 Lexicalized grammar: A grammar in the LTAG formalism is fully lexicalized in the sense that lexical items are associated with a list of elementary trees in the grammar. It has generally been agreed that lexicalized grammars are prefered over non-lexicalized grammars for NLP tasks such as parsing (Schabes, 1990).

 Extended Domain of Locality (EDL): Every elementary tree must contain all and 16

... Jan Piet Marie zag helpen zwemmen

α 2:

α 1:

β 1:

S

NP

... Jan Piet Marie saw help swim S

N NP

(... Jan saw Piet help Marie swim)

(a) an example

Jan/Piet/Marie

zwemmen (swim)

V

S

V VP

S

NP

S*

VP

V

V

ε

ε

zag/helpen (saw/help)

(b) a grammar that handles (a)

Figure 2.9: Cross-serial dependencies in Dutch only the arguments of the anchor, thus, elementary trees provide extended locality over which the syntactic and semantic constraints can be speci ed. This is in contrast with CFGs, where the arguments of the predicate may appear in separate rules. For example, the subject and the object of a transitive verb appear in two rules: S ! subject VP and VP ! V object.

 Generative capacity: Recent research (Bresnan et al., 1982; Higginbotham, 1984;

Shieber, 1984) has found that CFG is not expressive enough to handle a number of syntactic constructions. In this aspect, LTAG is appealing because it is more powerful than CFG, but only \mildly" so (Joshi et al., 1975; Joshi, 1985; Joshi, 1987). For instance, as shown in (Joshi, 1985), LTAG, but not CFG, can handle cross-serial dependencies in Dutch. An example of cross-serial dependencies and its treatment in LTAG are shown in Figure 2.9.3

 Constrained formalism: LTAG is a constrained mathematical formalism. As Kroch

(1987) argues, \the exploration of restrictive mathematical formalisms as metalanguages for natural language grammars can produce results of value in empirical linguistics."

 Polynomial parsability: The parsing time for a LTAG is O(n6), where n is the sentence length.

Both the example and the LTAG in Figure 2.9 come from (Joshi, 1985). We changed the grammar slightly to be consistent with current conventions. 3

17

α1: NP N

α 3:

S

β1:

S

S NP

S*

V what α2:

NP

S

S V

S

N NP0

VP

does what

NP

V

NP 1

N

like

ε

does NP0 N John

VP V

NP 1

like

ε

John

(b) derived tree

(a) elementary trees

Figure 2.10: Trees for the wh-question what does John like

2.2 Multi-component TAGs (MC-TAGs) A number of extensions of LTAGs have been proposed to handle constructions that cause problems for basic LTAG. In this section, we will discuss one of them, namely Multicomponent TAGs (MC-TAGs), as it will be used in Chapter 5. MC-TAGs are proposed to handle various types of syntactic movement that are dicult for basic LTAG. In basic LTAG, syntactic movement is restricted to a single elementary tree, that is, the ller and the gap should appear in the same elementary tree. For example, in Figure 2.10, the ller NP and the gap NP1 are both in 3 . Substituting 1 into NP and 2 into NP0 and adjoining 1 into S will yield the correct parse for a simple whquestion what does John like. A sentence with long-distance movements, such as what does Mary think Mike believes John likes, is handled similarly, as shown in Figure 2.11. This type of movement is \unbounded" in the sense that in the sentence there can be an unbounded number of verbs between the ller and the gap, but the treatment in the LTAG formalism is still elementary tree bound in the sense that the gap and the ller are in the same elementary tree. This analysis runs into problems when the constituent is an adjunct. In Figure 2.12(a), the NP trace and the ller NP can not be in the same elementary tree because the PP (the parent of the NP trace) is a modi er of the VP and therefore should not be part of the elementary tree for the verb stay. A MC-TAG is required to handle such cases. MC-TAG extends the basic LTAG so that the adjunction operation in MC-TAG is de ned 18

α1:

β1:

S

NP

β 2:

β3: S

S NP

VP V likes

S

S NP

NP0

S

NP

N

S*

V S*

V

S*

S

V

V

N

what

does

VP

does

VP V

NP 1

VP

S

NP

Mary think think

ε

S

N

believes

VP

NP

S

V

NP Mike believes N

VP V

John likes

(a) elementary trees

NP ε

(b) derived tree

Figure 2.11: Trees for the wh-question what does Mary think Mike believes John likes β S

S

β

1: NP i

2:

S*

VP PP

VP*

S

NP i which hotel V did

N

P

NP i

hotel

in

ε

S NP

VP

you VP

PP

V

P

NP i

stay

in

ε

α 1:

S NP

VP V stay

(a) derived tree

(b) elementary trees

Figure 2.12: Tree-local MC-TAG on elementary tree sets rather than on single trees. Weir (1988) gives four ways of de ning the adjunction operation. Two of them are commonly used in the literature: one is called tree-local MC-TAG, which requires the elementary trees in a multi-component set (MC set) to be adjoined into distinct nodes of a single elementary tree. The other is set-local MC-TAG, which requires the elementary trees in a MC set to be adjoined into distinct nodes of trees in another MC set. The extraction from PP in Figure 2.12 can be easily handled by tree-local MC-TAG, where 1 and 2 form a MC set and both adjoin to the nodes in 1 , forming the derived tree in (a). Weir (1988) has shown that tree-local MC-TAG does not change the generative capacity of the LTAG formalism, while set-local MC-TAG does. The former has been used for whmovement and extraposition (Kroch, 1989), while the latter is used to handle clitic-climbing in Spanish (Bleam, 1994). 19

To summarize this chapter, LTAG is a tree generating system. It is appealing for the representation of various phenomena in natural languages due to its linguistic and computational properties. An LTAG consists of a set of elementary trees, which are combined via two operations (substitution and adjunction) to form a parse tree (derived tree) for a sentence. To handle various phenomena in natural language, a number of extensions of basic LTAG have been proposed. MC-TAG is one of them, which is aimed at handling various types of syntactic movement.

20

Chapter 3

Towards Semi-automatic Grammar Development Manually building a grammar is dicult and time-consuming. In this chapter we present a system that automatically generates an LTAG grammar from an abstract speci cation of a language. Our approach uses language-independent speci cations that can be tailored to speci c languages by eliciting linguistic information from native informants, thus partially automating the grammar development process. In addition to providing obvious bene ts with respect to performing maintenance and ensuring consistency, we believe this approach has exciting potential for partially automating the LTAG development process. We have found that the abstract speci cation that lends itself most readily to automatic tree generation also corresponds closely to a division into language independent and language dependent properties. The chapter is organized as follows: Section 3.1 lists four types of redundancy in LTAGs. Section 3.2 describes an LTAG development system (LexOrg) we developed. Section 3.3 gives an overview of two grammars generated by our system. Section 3.4 presents a process that tailors language-independent structures to a speci c language. Section 3.5 compares our system with related work. Section 3.6 proposes two tasks as future work.

21

α 11:

α 12: S

α 13:

β

S

α

11: VP

NP

VP

NP

V

VP

NP

VP

V

V

swim

walk

21:

α

V

VP*

can

α

22:

S

23:

NP

NP

NP

N

N

N

John

apples

apple

swum

(a) elementary trees (b) templates

(c) lexicon syntactic database

α 1: NP

α

β1:

S

V

V

(walk, α ) 1 (swim, α1 ) (can, β 1 ) (John, α2 ) (Mary, α2 )

NP

VP

VP

2:

VP*

N

morphological database walk: walk, =base swim: swim, =base swum: swim, =ppart can: can, =base John: John, =3rdsg apples: apple, =3rdpl apple: apple, =3rdsg

Figure 3.1: Elementary trees, templates and lexicon

3.1 Redundancy in LTAGs In this section, we will discuss three types of redundancy among elementary trees in an LTAG. The rst two types have been eliminated by reorganizing a grammar (XTAGGroup, 1998), while eliminating the last type is the focus of this chapter.

3.1.1 Tree templates and lexicon As mentioned before, each elementary tree is anchored by a lexical item. If the lexical item is moved from the tree, the remaining part is called a tree template, or template for short. An elementary tree is equivalent to a (lexical-item, template) pair. Many words can anchor the same templates, and this is the source of the rst type of redundancy. For instance, all the intransitive verbs can anchor 1 in Figure 3.1(b).1 In order to avoid storing these templates more than once, we divide an LTAG into two parts: the rst one is a set of tree templates, and the second part is a lexicon, which associates words with templates which they can anchor. Another factor we need to consider to word in ection. For instance, the verb swim has four in ected forms swims/swimming/swam/swum, and all of them plus the base form 1

The anchor of a template is marked by .

22

swim can anchor the template 1 in Figure 3.1(b). To eliminate this type of redundancy, we split the lexicon into two parts: a syntactic database, which associates a stem with the templates, and a morphological database, which associates an in ected word with its stem and a list of features, as shown in Figure 3.1. This strategy of storing an LTAG as a set of templates and a lexicon was introduced by the XTAG grammar (XTAG-Group, 1998) and later was adopted by other grammars, such as the LTAGs for French (Abeille et al., 2000), Chinese, and Korean. The XTAG grammar is a large-scale English grammar, which has been developed at University of Pennsylvania since the early 1990s. It has 1004 tree templates. The lexicon lists templates for 40 thousand in ected words. In total, there are conceptually 1.8 million elementary trees. The templates of the grammar are built by hand, while the lexicon is extracted from two dictionaries (Oxford Dictionary for Contemporary Idiomatic English and Oxford Advanced Learner's Dictionary) and then manually checked.

3.1.2 Tree families A word can anchor many templates. For instance, a transitive verb can anchor all the templates in Figure 3.2. Among these templates, the top ve templates are for active voice and they share the subcategorization frame (NP0 V NP1 ), while the bottom three are for passive, with a subcategorization frame of (NP1 V). The two frames are closely related and for each frame there are separate templates for declarative clause, wh-questions and relative clauses, etc. It would be very redundant to list all these templates for every transitive verb in the lexicon. To avoid this type of redundancy, we group tree templates into tree families, where a tree family is de ned as a set of templates with the same or related subcategorization frames. Now in the lexicon we only need to associate each word with the appropriate tree families. For instance, the templates in Figure 3.2 form a tree family, and every transitive verb will select this tree family. The 1004 templates in the XTAG grammar are grouped into 53 tree families.

23

(NP0 V NP1):

NP0 V

S

NP

VP

NP 0

NP 1

ε

NP

S

S

S

VP V

NP*

S

NP NP 0

NP

VP

NP 1

V

NP NP*

S

NP 0

(a)

(b)

(c)

NP 1

(f)

NP

NP

V

ε

(e)

S NP

NP 1

NP 1

1

ε

NP*

S

VP

VP V

NP

S S

ε

(d)

(NP1 V):

NP 0

VP V

ε

S

NP

S

NP 1

S

VP V

(g)

S NP 1 ε

VP V

(h)

Figure 3.2: A tree family

3.1.3 Substructures shared in templates The last type of redundancy is caused by the reuse of tree substructures in many templates in one or more tree families. Figure 3.2 shows a few templates in the tree family for a transitive verb such as buy in English. Each individual template includes two types of grammatical information: one is the subcategorization frame: buy takes two arguments in tree (a)-(e), and one in tree (f)-(h); the other type of information is the transformational information: template (a) and (f) share the structure for declarative, template (b), (c) and (g) for wh-movement, and (d), (e) and (h) for relativization. As the size of the grammar grows, developing and maintaining these templates by hand faces two major problems: rst, the reuse of tree substructures in many templates creates redundancy. To make certain changes in the grammar, all the related templates have to be manually checked. The process is inecient and can not guarantee consistency(VijayShanker and Schabes, 1992); For a discussion of other approaches that address this issue, (Becker, 1994; Evans et al., 1995; Candito, 1996), see Section 3.5. Second, the underlying linguistic information is not expressed explicitly. As a result, from the grammar itself (i.e. hundreds of templates plus the lexicon), it is hard to grasp the characteristics of a particular language, to compare languages, and to build a grammar for a new language 24

given existing grammars for other languages. We have built a system, called LexOrg, to solve these problems.

3.2 LexOrg: an LTAG development system The framework of LexOrg is shown in Figure 3.3. Conceptually, each verb has a lexical semantic representation, which includes the thematic roles that the verb has. Those thematic roles can be realized in various ways in syntax, resulting in di erent but related subcategorization frames. A subcategorization frame speci es how many arguments a predicate takes and their relative positions with respect to the predicate. When this information is combined with the description of transformations such as wh-movement, templates are generated by the system. Our current system does not include the lexical semantic structure and the mapping from the structure to the subcategorization frames. Instead, we assume there is a canonical subcategorization frame for each verb, and other frames can be derived from it by applying Lexical Redistribution Rules(LRRs). In Figure 3.3, the inputs to LexOrg are marked by * and in bold font. The output of the system is a set of templates. To be more speci c, LexOrg requires the grammar developers to state the linguistic information about a predicate as three types of speci cations: subcategorization frames, lexical redistribution rules (LRRs), and tree descriptions. Tree descriptions are also called blocks in our system. Blocks are further divided into subcategorization blocks and transformation blocks according to their functions. Our system takes those speci cations as the input and combines them to automatically generate the templates.

3.2.1 Input to the system: three types of speci cations Three types of speci cations are de ned more precisely below.

Subcategorization frames: Subcategorization frames specify the category of its anchor, the number of its arguments, each argument's category and other information such as feature equations. 25

agent theme instrument

Lexical semantic structure:

......

agent => NP subject theme => NP object

theme => NP subject

*LRR: (NP0 V NP1) => (NP1 V) Subcategorization Frames:

*(NP0 V NP1)

(NP1 V)

*subcat-blocks Subsets of subcat-blocks: *Transformation blocks: a Tree family declarative

wh-movement

relative-clause

(f)

(a) (b), (c)

(g)

(d), (e)

(h)

Figure 3.3: The framework of the system (a) (NP0 V NP1)

(b) (NP1 V)

Figure 3.4: Two subcategorization frames for the verb buy

Lexical Redistribution Rules (LRRs): Lexical Redistribution Rules (LRRs) specify the relations between subcategorization frames. An LRR is a pair of subcategorization frames. It can be seen as a function that takes a subcategorization frame as the input and generates a new frame as the output. For example, The LRR shown in Figure 3.5 creates the subcategorization frame (NP1 V) when it is applied to the frame (NP0 V NP1). (NP0 V NP1) => (NP1 V)

Figure 3.5: The passive LRR

26

URoot Subject

VP

VP V(’V’)

AnchorP Object(’NP’)

Subject(’NP’) V,Anchor

AnchorP

Anchor Object

Anchor

(a) is_main_frame

(b)subj_is_NP

(c)anchor_is_V (d)anchor_has_obj

(e)obj_is_NP

Figure 3.6: Some subcategorization blocks selected by the frame (NP0 V NP1)

Blocks Blocks are tree descriptions speci ed in a logical language patterned after (Rogers and Vijay-Shanker, 1994). A block speci es categorical labels of nodes, feature value assignments, and structural relationships between nodes. There are four types of structural relations: dominance, immediate dominance (i.e. parent), strict dominance, and precedence. Blocks are similar to elementary trees except that the former can leave some information unspeci ed. For example, when x dominates y, the number of intermediate nodes between x and y is unspeci ed. Elementary trees can be seen as a combination of blocks where all the structural relations between each pair of nodes are totally speci ed. Blocks are divided into two types according to their functions: subcategorization blocks and transformation blocks. The former describes the structural con guration incorporating the various arguments in a subcategorization frame in their base positions.2 Some of the subcategorization blocks used in the development of the English grammar are shown in Figure 3.6. Dotted lines, solid lines and dash-dotted lines denote dominance, immediate dominance and strictly dominance relation respectively. An arc (^) between nodes indicates that the precedence order of the nodes is unspeci ed. For each node, the label outside the parentheses is the name of the node and the one enclosed in parentheses is the syntactic category of the node. A node can have di erent node names, but it has exactly one syntactic category. For instance, in Figure 3.6, the block in (a) describes the spine of a clause. While in most cases the verb will be the anchor, we do not equate the anchor with the verb in this block. This allows for the analysis used in the XTAG grammar where a noun or an adjective can serve as an anchor in the templates for small clauses. 2

Here base positions roughly correspond to the positions in deep structure in the GB-theory.

27

NPRoot(’NP’)

ExtRoot(’S’) NewSite

URoot(’S’)

NPFoot(’NP’)

ExtRoot(’S’) NewSite

URoot(’S’)

ExtSite ExtSite

ε ε (a) Extraction

(b) Relative Clause

Figure 3.7: Transformation block for extraction The transformation blocks are used for various transformations such as wh-movement.3 Figure 3.7(a) depicts our representation of phrasal extraction. This can be specialized to give the blocks for wh-movement, relative clause formation, etc. For example, a relative clause, as in Figure 3.7(b), is de ned by further specifying that the ExtRoot modi es an NP node. wh-movement is the same as phrasal extraction except the node NewSite has a +wh feature, which is not shown in the Figure 3.7(a).

3.2.2 Tree generation from the speci cation Once we introduce the notion of LRRs, a tree family can be de ned as the set of elementary trees with the same \canonical" subcategorization frame.4 Given a subcategorization frame f and sets of LRRs and blocks, to generate a tree family for that frame, the system takes several steps:

1. Derive subcategorization frames :

Apply sequences of LRRs to f and generate a set F of the related subcategorization frames.

2. Select subcategorization blocks :

For each frame fi in F , select a set of subcategorization blocks SBi .5

These transformation blocks do not encode rules for modifying trees, but rather describe the properties of a particular syntactic construction. 4 As mentioned before, if we introduce the notion of lexical semantic representation, we don't have to assume the existence of a canonical subcategorization frame. Rather, we can de ne tree family as the set of elementary trees with the same underlying lexical semantic representation. 5 The system uses a default mapping from the information in the frame to the name of subcategorization 3

28

3. Combine with transformation blocks :

For each subset TBj of transformation blocks,6 combine it with SBi to form a new set of blocks Bi;j .

4. Generate trees :

For each Bi;j , generate a set of templates that are consistent with the tree description in Bi;j . If there is more than one template, choose the ones with the minimal number of nodes.7

Given the frame (NP0 V NP1) in Figure 3.4(a), the LRR in Figure 3.5, and blocks in Figure 3.6 and 3.7, the system will generate the same trees as the ones in Figure 3.2. For instance, the tree in Figure 3.2(h) is automatically generated by rst applying the LRR to the frame (NP0 V NP1) to get the frame in 3.4(b), then choosing the subcategorization blocks in Figure 3.6, and nally combining them with the relative-clause block in 3.7(b).

3.3 Building grammars We have used our system to develop grammars for English and for Chinese. The major features of these two grammars are summarized in Table 3.1. The speci cations of LRRs and blocks in our system highlight the similarities and di erences between languages. For example, both languages have relative-clauses and blocks. For example, if the anchor in the frame takes an NP subject, the system will select the blocks pred has subj and subj is NP. The default mapping can be easily modi ed. 6 Theoretically, the system can try all the the subsets of transformational blocks. However, some subsets are incompatible in that the combinations of them will fail to produce any elementary tree. For example, a template may use the block for wh-movement or relative-clause, but it will never use both at the same time. The system can rule out those combinations in tree generation stage, but for the purpose of eciency, instead of letting the system try those combinations and fail, the grammar developer can partition the transformation blocks into several sections such that the blocks within the same section are incompatible. The system will take the partition and only try the subsets where each block in the subset comes from a di erent section. 7 When blocks are combined, nodes with the same names are merged. Nodes with di erent names can also be merged if after merging them the resulting tree description is still viable. A tree description is viable if there exists at least one template that is consistent with that description.

29

English LRR passive examples dative-shift ergative trans block wh-question examples relativization gerund # LRRs 6 # subcat blocks 34 # trans blocks 8 # tree families 43 # templates 638

Chinese short bei-const causative ergative topicalization relativization arg-drop 8 24 15 35 482

Table 3.1: Major features of English and Chinese grammars passive. As a result, both grammars have similar LRRs or blocks for these phenomena. For phenomena that occur in one language, only that language will have the corresponding LRRs or blocks, such as the argument-drop block in Chinese, and the dative-shift LRR and the gerund block in English.

3.4 Eliciting language speci c information We have just described the procedure used to automatically build a grammar by combining subcategorization frames, LRRs, and blocks. It presumes that the user provides this information to the system. De ning such information from scratch for a new language is easier than building all the templates by hand, but it can still be dicult and timeconsuming. Bracketed corpora such as the English Penn Treebank(Marcus et al., 1994) can be used for extracting this information, but quite often such corpora are not available for low di usion languages. A central hypothesis in the eld of formal syntax is the existence of a universal grammar. The hypothesis says the di erences among languages can be captured by di erent settings of a parameter list. Based on this hypothesis, we have extended our system to include language-independent structures. This would allow us to couple these languageindependent structures with an interface which would elicit language-dependent details from a native speaker. These language-dependent details instantiate certain parameter 30

NPRoot(’NP’)

NPRoot(’NP’)

CP(’S’) ExtRoot(’S’)

NewSite

NPFoot(’NP’)

NPFoot(’NP’)

NewSite CBar(’S’) COMP

URoot(’S’)

IP(’S’)

ExtSite

ExtSite

ε

ε

(a)

(b)

Figure 3.8: The possible meta-blocks for relative clause position of NPFoot? overt wh-movement? has overt RelPron? RelPron can be dropped? position of COMP? COMP can be dropped? COMP and RelPron co-occurs? COMP and RelPron both be dropped?

English left yes yes yes* left yes* no yes*

Portuguese left yes yes yes* left yes* no no

Chinese right no no right yes* -

Korean right no no sux no -

Table 3.2: Settings for relative clauses in four languages settings, and thus generate blocks tailored to the speci c language. The grammar developer may still need to add additional details to these blocks, but the development time should be shortened signi cantly. In this section, we illuminate how transformation blocks could be built by this process. Other types of speci cations could be elicited similarly. To build a transformation block, we start with the de nition of the corresponding phenomenon, which is language-independent. For example, a relative clause can be roughly de ned as an NP is modi ed by a clause in which one constituent is extracted (or co-indexed with an operator). We build a tree description (for clarity, we will call it meta-block) according to the de nition. Notice the exact shape of the meta-block often depends on the theory. For example, both meta-blocks in Figure 3.8 are consistent with the de nition of relative clause. The former follows the XTAG grammar in treating the complementizer (COMP) as adjunct, while the latter is similar to a phrase structure in GB-theory where 31

NPRoot(’NP’) NPFoot(’NP’)

ExtRoot(’S’) NewSite

NPRoot(’NP’)

NPRoot(’NP’)

URoot(’S’) ExtSite

NPFoot(’NP’)

ExtRoot(’S’) NewSite ε

ExtSite

ε

(a) English and Portuguese with Relative Pronoun

URoot(’S’)

ExtRoot(’S’) NewSite ε

ε

(b) English and Portuguese without Relative Pronoun

NPFoot(’NP’)

URoot(’S’) ExtSite ε

(c) Chinese and Korean

Figure 3.9: The blocks for relative clauses in four languages COMP is the functional head of CP. The meta-blocks must be general enough to be language-independent. Next, the system will recognize parts in the meta-block that are not fully speci ed and prompt the user for answers. Then, the system adds this information to the meta-block so it is tailored to our target language. Meta-blocks plus language-speci c information will then form the transformation blocks for that language. For the meta-blocks in Figure 3.8, Table 3.2 lists the questions about these meta-blocks that should be asked and the correct answers for the four languages. In a relative clause, a relative pronoun (RelPron) occupies the position marked by NewSite. If we adopt the meta-block in 3.8(a), only the rst four questions would be needed, and LexOrg would produce the corresponding blocks in Figure 3.9. If we adopt the meta-block in 3.8(b), all the questions should be used.8 Several points are worth noting. First, the setting of some parameters follows from higher-level generalizations and some parameters are related. For example, the position of NPFoot follows from the head position in that language. Korean is an SOV language, so we can infer the position of the NPFoot without asking native speakers. Second, the setting of the parameters provides a way of measuring the similarities between the languages. According to the settings, Chinese is more similar to Korean than to English with respect to relative clauses. A word of caution is also in order. Both the construction of the meta-block, the The blocks for a relative clause in English and Portuguese are the same as shown in Figure 3.9(a) and 3.9(b) but English and Portuguese di er in one aspect: when the moved constituent is not the subject, in English, both COMP and NewSite are optional, but in Portuguese, one of them must be present. The di erence is captured by features which are not shown in the gures. 8

32

questions for meta-blocks and the correct answers to the questions require some degree of linguistic expertise. Also, certain language speci c details can not be easily expressed as yes-no questions. For example, the answers marked with * mean they are true only under certain conditions which need more speci cation, e.g. in English, COMP and RelPron can be dropped at the same time only when the relativized NP is not the subject.

3.5 Comparison with other work The redundancy caused by shared structures among templates has been observed by others, who have proposed various means of dealing with it.

3.5.1 Becker's Meta-rules A meta-rule (Becker, 1994) connects two templates at a time. For instance, a meta-rule mapping from a base declarative tree to an imperative tree describes how the declarative must be changed to get the imperative; the rule encodes a di erential description of the two trees, showing how they are related. By contrast, using tree descriptions, there will be a tree description that both the declarative and imperative trees share, and additional tree descriptions that tell how to specialize the common description into either the declarative tree or the imperative tree. There are several di erences between meta-rules and LexOrg.

 Meta-rules can be non-additive, allowing structure to be erased during a mapping, so an active tree can be mapped to the corresponding passive tree, erasing the object NP. In LexOrg, LRRs can be non-additive, but blocks are strictly additive.

 Meta-rules require the user to be quite explicit about both the old information and

the new information. For example, with both passive and ergative, the deep-object (NP1) is moved to subject position. Both of these meta-rules have to erase the subject-verb agreement features between NP0 (the deep subject) and V and add the same features between NP1 and V, even though the agreement has nothing to do with passive or ergative, but is a part of the relation between the surface subject and the VP. In LexOrg, the change of subcategorization frames (e.g. remove NP0 and 33

move NP1) is expressed as a LRR, while the subject-verb agreement is expressed as a block, which is used by declarative and wh-movement etc.

 Since meta-rules are independent, they cannot inherit from common ancestors. For

example, wh-extraction and relative-clause extraction meta-rules have to be de ned completely separately, although both involve similar movement, as shown in Figure 3.7. In LexOrg, the blocks for a wh-movement and a relative-clause will both inherit information from the extraction block.

3.5.2 DATR system Evans, Gazdar and Weir (Evans et al., 1995) also discuss a method for organizing the trees in a TAG hierarchically, using an existing lexical representational system, DATR (Evans and Gazdar, 1989). There are two major di erences from our approach: 1) our blocks are constrained to be strictly monotonic, whereas DATR allows non-monotonicity; 2) DATR can only capture relationships between nodes in a tree, (such as the parent-child relationship or precedence) directly, and must use feature-equations to simulate tree relations. This means that an abstract concept like dominance can only be speci ed by spelling out explicitly all of the di erent possible path lengths for every possible dominance relationship. Whereas in some ways the DATR implementation is more powerful, the lack of natural tree relations results in somewhat cumbersome tree descriptions. With the development of large-scale grammars for di erent languages in both systems, one could do a more thorough assessment of the relative ease with which cross-linguistic generalizations and variations could be stated.

3.5.3 Candito's use of partial tree descriptions There are similarities between our approach and Candito's approach. Both systems have been built upon the basic ideas expressed in (Vijay-Shanker and Schabes, 1992) for organizing trees hierarchically and the use of tree descriptions that encode substructures found in several trees. In both approaches there is a separation of the di erent types of linguistic generalizations. Candito's hierarchy, for instance, has three separate dimensions: 1) the 34

hierarchy for the di erent canonical subcategorization frames; 2) the redistribution of arguments/functions; and 3) syntactic realizations of arguments/functions. Clearly there is a parallel between her subcategorization dimension and our subcategorization blocks, between her redistribution dimension and our LRRs, and between her realization dimension and our transformation blocks. However, there are two major di erences between these two systems. The rst di erence is re ected in how Candito uses her dimensions in generating the trees. Her system does not consider every possible combination as ours does. Instead it imposes explicit conditions on how the classes appearing in the hierarchy can be combined, based on which dimension they are in. For example, one condition states that only a terminal node (leaf node of a hierarchy) of the second dimension can be used in constructing a tree. Therefore two redistributions (such as passive and causative) can be used in a single tree only when a new passive-causative terminal node is rst created manually. Since, as stated above, the order in which redistributions are applied a ects the outcome, the Italian grammar has terminal nodes for passive, causative, passive-causative, and causative-passive. In contrast, our approach automatically considers all possible applications of LRRs, and discards those that are incompatible. Another di erence has to do with the application of transformations. In Candito's third dimension, each terminal node speci es the way a speci c function is realized. Therefore each argument/function in a subcategorization frame requires a di erent terminal node for every di erent syntactic realization. For example, the to-PP object (i.e. the PP whose head is the preposition to) of a ditransitive has a di erent terminal node for the canonical position, for wh-extraction, for relative-clauses, etc. Similarly for the by-PP. This means the de nition of terminal nodes in the third dimension must correspond to the possibilities de ned in the other dimensions, and is not a truly independent task as it is in our system. Our transformation blocks can be de ned completely independently of the LRRs, (or indeed in some cases of the language being generated). The system will automatically consider the combination of the wh-extraction block with all the subcategorization blocks de ning the possible arguments (regardless of whether a phrase was an NP, a to-PP or a by-PP). 35

There are similar types of redundancies forced by other conditions that are used in Candito's system. One must note however that these conditions do express explicitly linguistic intuitions and can be quite useful in guaranteeing that only combinations which are possible in a language are considered which makes for ecient tree generation. While we have chosen the approach of allowing the system to freely explore all combinations and checking which of them are compatible, this can potentially place a burden on the grammar developer to ensure that any combination not possible in the language would turn out to be incompatible via feature or structure speci cations in the blocks. So far we have never needed to introduce special features to rule out combinations. The inherent characteristics of the blocks themselves provide sucient constraints for compatible combinations.

3.6 Proposed work In summary, we have presented an LTAG development system that shows interesting promise with respect to semi-automating the grammar development process. The LTAG generation is driven by an abstract speci cation of di erent types of linguistic information: subcategorization frames, blocks and lexical redistribution rules. An appropriate elicitation process can glean this linguistic information from the user, thus allowing the system to semi-automatically begin the de nition of abstract speci cation, and actively supporting the user during the development process. The abstract level of representation for the grammar both necessitates and facilitates an examination of the linguistic assumptions. This can be very useful for gaining an overview of the theory that is being implemented and exposing gaps that remain unmotivated and need to be investigated. The grammar development then becomes an interactive process between the system and the language expert, with the system assisting in the precise de nition of the linguistic categories, and then highlighting areas that need further de nition. Once an LTAG is built by the system, all the information processing technologies developed for LTAG become readily available. The system has potential bene ts building translation tools as well as grammar development, since the language dependent properties of a language will be clearly speci ed,

36

β

β3:

β2:

1: VP

VP

VP VP*

PP P

VP*

Conj

NP

V

VP 2 NP

VP*

Conj

α1:

VP

V (a) template for modification

VP 2

NP

(b) templates for coordination

Figure 3.10: Templates for modi cation and coordination and can be contrasted with those of other languages. By focusing on syntactic properties at a higher level, our approach allows new opportunities for the investigation of how languages relate to themselves and to each other. We propose the following two tasks for future work.

3.6.1 Building other trees There are two types of templates which are not covered by our current system. The rst type is the modi cation template, such as the one in Figure 3.10(a). To generate such a template, the subcategorization blocks need to be combined with a block that speci es the modi cation relations.9 The second type of templates that are not produced by the current system are the ones for coordination. There are several ways to handle coordination, two of them are shown in Figure 3.10(b). In the rst approach (see 2 ), the V P2 is expanded and its head is the anchor of the whole tree. In the second approach (see 3 and 1 ), the conjunction is the anchor of 3 , while 1 will substitute into V P2 during parsing. In either case, the subject of V P2 is missing from one template. This is di erent from non-coordination templates. The XTAG grammar adopts the rst approach (i.e. 2 ) and the templates are generated on the y while parsing (Sarkar and Joshi, 1996). That is, they are not part of 1004 templates in the grammar. For LexOrg, either we can follow the XTAG grammar and let the parser generate them on the y, or we need to incorporate a new module into LexOrg The trees for relative clause can be seen as modi cation templates. In that case, we can extend the de nition of transformation blocks to include modi cation blocks. 9

37

to generate these templates.

3.6.2 Inferring language speci cations from Treebanks The input to LexOrg is a set of language speci cations such as LRRs, subcategorization frames and blocks. This linguistic information can be elicited from native speakers, as discussed in Section 3.4. As large-scale Treebanks become available in a number of languages, it would be ideal if these speci cations could be extracted from these Treebanks. We believe this can be done in two steps: rst, an LTAG is extracted from a Treebank; second, the templates in the LTAG are decomposed into a set of language speci cations. We have completed the rst step, as will be discussed in the next chapter, and we are still working on the second step. In addition, LexOrg can produce the set of templates for an LTAG, but it can not produce the lexicon. In the next chapter, we will introduce a new system that extracts both the lexicon and the template sets from Treebanks.

38

Chapter 4

Extracting LTAGs from Annotated Corpora Hand-crafted LTAGs (including the ones created by LexOrg) use rich representations (such as feature structures) and tend to be precise. However, using this kind of grammar for statistical supervised parsing has a major problem, namely, it requires a \grammardependent" corpus. That is, the corpus has to be annotated according to the elementary trees in the grammar. Building such a corpus is not only time-consuming and laborintensive,1 but also dicult to maintain because every time the grammar is revised, the annotation of the corpus has to be modi ed as well. Given the fact that most statistical parsers (e.g., (Collins, 1999; Ratnaparkhi, 1998)) have already been trained and tested on general-purpose corpora such as Penn Treebank (Marcus et al., 1994),2 it is hard to compare the performance of those parsers with an LTAG parser because the training and testing data are di erent. Treebank grammars, which are extracted from annotated corpora, have certain advantages over hand-crafted grammars. Firstly, given a source corpus and an extraction tool, In general, there are two ways to build such a corpus. One is to use an LTAG parser to generate all possible parses for each sentence in the corpus, and then have linguistic experts select the correct one from hundreds of parses. Another way is to have linguistic experts who are very familiar with the grammar to manually create a parse for each sentence in the corpus. Both ways are costly and time-consuming. 2 By general-purpose, we meant the corpora are not annotated according to some pre-existing grammars. 1

39

Treebank grammars can be extracted with little human e ort, as the burden of linguistic analysis and quality control is shifted from grammar developers to Treebank developers. Secondly, Treebank grammars include statistical information, which is useful for statistical parsing. Thirdly, Treebank grammars are extracted based on certain assumptions about the forms of the target grammars and on particular interpretations of the annotations in Treebanks. Changing those assumptions and interpretations, which could be done by simply modifying a few lines of code or a couple of tables, and re-running the extraction tools on the same corpus will yield di erent Treebank grammars. This exibility allows di erent grammars to be extracted and compared with respect to their usefulness to grammar developers and statistical parsers. Fourthly, a Treebank grammar is guaranteed to cover the Treebank from which it is extracted, whereas estimating the coverage of a hand-crafted grammar is dicult. On the downside, Treebank grammars often include implausible rules due to annotation errors in the source corpora. They also lack the rich representations that are commonly used in hand-crafted grammars. We will discuss possible ways to combine both types of grammars in the next chapter. There has been much work done on extracting Context-Free Grammars (Shirai et al., 1995; Charniak, 1996; Krotov and others, 1998) and LTAGs (Srinivas, 1997; Neumann, 1998; Xia, 1999; Chen and Vijay-Shanker, 2000) from large corpora. In this chapter, we will describe our grammar extraction system, called LexTract. The system not only extracts LTAGs and CFGs from Treebanks, but also converts the Treebanks into a set of derivation trees that can be used directly by LTAG parsers. In addition, LexTract can retrieve data which can be used to test theoretical linguistic hypotheses, such as the Tree-local Hypothesis. Last but not least, LexTract can detect certain types of annotation errors in a Treebank. This chapter is organized as follows: Section 4.1 is an overview of the LexTract system. Section 4.2 is a brief introduction to the English Penn Treebank, which we use to test LexTract. Section 4.3 describes the form of the grammars that LexTract is aimed at extracting from Treebanks. Section 4.4 lists the corpus-speci c information that is part of the input to LexTract. Section 4.5 describes an extraction algorithm that LexTract uses. Section 4.6 presents a method to lter out implausible templates from the extracted grammar. 40

Section 4.7 gives an algorithm to convert each bracketed structure into a derivation tree. Section 4.8 demonstrates the process for building the multi-component sets from derivation trees. Section 4.9 lists several constructions that require special treatment. Section 4.10 compares LexTract with other grammar extraction methods. Section 4.11 summarizes the chapter and proposes work for the future.

4.1 Overview of LexTract We have built a system, called LexTract, for grammar extraction. The architecture of LexTract is shown in Figure 4.1. In this chapter, we will discuss the core components of the system, which are marked in bold in the diagram, and leave the discussion of LexTract's applications to the next chapter. matched subgrammar

LexTract System another LTAG

compare two LTAGs

read context-free rules off etrees

CFGs

LTAGs Treebanks Treebank-specific information

extract LTAGs from Treebanks

derivation build trees derivation trees derivation trees

LTAGs

build MC tree sets

mapping between ttree nodes and etree nodes

filter out implausible etrees

estimate and improve coverage train CFG parsers

train supertaggers

train statistical LTAG parsers

MC tree sets test tree-locality hypothesis implausible etrees detect Treebank annotation errors

Figure 4.1: Architecture of LexTract

4.2 Overview of the English Penn Treebank The English Penn Treebank (PTB for short) is widely used in the NLP community for training and testing statistical tools such as POS taggers and parsers. It is also the rst Treebank on which LexTract was tested. We will, therefore, use examples from PTB 41

( (S (NP-SBJ (NN Supply) (NNS troubles)) (VP (VBD were) (PP-LOC-PRD (IN on) (NP (NP (DT the) (NNS minds)) (PP (IN of) (NP (NP (NNP Treasury) (NNS investors)) (SBAR (-NONE- *ICH*-2) ))))) (NP-TMP (RB yesterday)) (, ,) (SBAR-2 (WHNP-1 (WP who) ) (S (NP-SBJ (-NONE- *T*-1) ) (VP (VBD worried) (PP-CLR (IN about) (NP (DT the) (NN flood) )))))) (. .) ))

Figure 4.2: The Treebank annotation for the sentence Supply troubles were on the minds of Treasury investors yesterday, who worried about the ood. throughout this chapter to demonstrate how LexTract works. PTB was developed at the University of Pennsylvania in the early 1990s. It includes about 1 million words of text from the Wall Street Journal annotated in Treebank II style (Marcus et al., 1994). Its tagset has 85 syntactic labels { 48 POS tags, 27 syntactic category tags, 10 tags for empty categories { and 30 function tags. Each bracketed item is labeled with one syntactic category, zero to four function tags and reference indices if necessary. For details, please refer to (Santorini, 1990) and (Bies et al., 1995). The meaning of the tags appearing in this chapter are listed in Table 4.1. Figure 4.2 is a simple example that will be used throughout this chapter. For the sake of clarity, from now on, we will call the bracketed structures in the Penn Treebank ttrees, and the elementary trees in LTAGs etrees.

4.3 The form of target LTAGs As mentioned in Chapter 2, the LTAG formalism is a framework, not a linguistic theory. Various linguistic theories can be implemented in the LTAG framework. The LTAG formalism does not impose constraints on the shapes of the etrees as long as the etrees satisfy a set of basic requirements, such as each etree having at least one lexical anchor node.

42

CC DT IN NN NNP NNS PRP RB VBD WP

POS tags conjunction determiner preposition or subordinating conjunction noun, singular or mass proper noun, singular noun, plural pronoun adverb verb, past tense wh-pronoun

NP PP VP S SBAR WHNP -NONE-

function tags CLR closely related LOC locative PRD predicate SBJ surface subject TMP temporal

syntactic tags noun phrase preposition phrase verb phrase simple declarative clause embedded clause wh-noun phrase empty category

Types of empty categories T trace for A'-movement ICH interprete constituent here EXP \trace" in it-extrapostion 0 zero complementizer

Table 4.1: Treebank tags which appear in this chapter As a result, given a corpus C , there may be more than one LTAG that can generate C .3 For example, both G1 and G2 in Figure 4.3 are legitimate LTAG grammars and both can generate the ttree in Figure 4.3(a). To ensure that the target grammar, i.e., the grammar to be extracted, is linguistically sound, we require that any target grammar must have the following properties: rst, each phrase in an etree should have a head, which determines the main properties (such as syntactic category) of the phrase it belongs to. The notion of head is similar to the ones in X-bar theory, GB-theory, and HPSG among others. A head X 0 projects to X 1 , X 2 , and so on. The head and its projections form a spine. Treebanks often use di erent labels for the head and its projections, such as V B , V P , and S in PTB. S ! V P ! V B is a spine with the V B as the head. The second necessary property of the target grammar is that each etree in the grammar should fall into one of three types according to the relations between the anchor of the etree and other nodes in the etree, as shown in Figure 4.4: An LTAG G is said to be able to generate a corpus C if each ttree in C can be generated by combining the etrees in G with substitution and adjunction operations. 3

43

α 1:

NP

α : 2

V VP

NP V

N

John VP

can

α (can) 2

VP VP

NP

N

S

α3 :

S

V

α 3 (swim)

α 1 (John)

VP

swim

can

(d) derivation tree for G1

(b) G 1

V

John

α 1: swim

α4:

α 4 (swim)

β1 :

NP

S

VP

(a) derived tree N

NP

John

VP

V

V

can

α 1 (John)

β1(can)

VP* (e) derivation tree for G2

swim (c) G 2

Figure 4.3: Two LTAGs which can generate the same ttree

X Y

k

Xm

Wq

m

W q*

X m-1

Yk

X1 X0

Xm

Zp

Xm * X m-1 X

Xm

CC Yk

X m-1

1

X0

X1 Zp

lexical item

X0

lexical item lexical item

(a) spine-etree

(b) mod-etree

(c) conj-etree

Figure 4.4: The three forms that extracted etrees should have

44

Zp

 spine-etree for a predicate-argument relation: The etree is formed by a spine X m ! X m,1 ! :: ! X 0 and the siblings of the nodes on the spine. X 0 is the anchor of the etree and nodes which are not on the spine are arguments4 of X 0 .

 mod-etree for a modi cation relation: The root of the etree has two children, one is a

foot node with the label W q , the other node X m is a modi er of the foot node. The modi er node is further expanded into a spine-etree whose head X 0 is the anchor of the whole mod-etree.

 conj-etree for a coordination relation: In a conj-etree, the children of the root are two

conjoined constituents and one conjunction.5 One conjoined constituent is marked as the foot node, and the other is expanded into a spine-etree whose head is the anchor of the whole tree. Structurally, a conj-etree is the same as a mod-etree except that the root has one extra conjunction child.

In most cases, spine-etrees are initial trees, while mod-etrees and conj-etrees are auxiliary trees. In spine-etrees, the anchor is the head of the whole phrase, while in mod-etrees, the anchor is not. The similarity between these forms and rules in X-bar theory is obvious: A spineetree can be seen as a tree which combines the rst and the third types of rules in X-bar theory (see Figure 4.5(a)). Similarly, a mod-etree incorporates all three types of rules in X-bar theory. A spine-etree is also very similar to a phrase structure in GB-theory, shown in Figure 4.5(b). There are two types of heads in the GB-theory: lexical heads (e.g., verbs) and functional heads (e.g., complementizers). LexTract allows its users to decide the head of a particular phrase. This information is given to LexTract in the form of a head percolation table, which will be explained next. If the user of LexTract does not want to treat functional heads (such as complementizers) as adjuncts, those functional heads will also appear as the sisters of nodes on a spine-etree. 5 Some Treebanks treat both words in both ... and ... (either ... or ... and so on) as conjunctions. As a result, the root of corresponding conj-etree will have four children: two XPs and two CCs. LexTract recognizes this kind of conj-etrees. However, for the sake of simplicity, we omit this detail from all the algorithms in this chapter. 4

45

CP

functional projection C

C

IP I

(1) XP -> YP X (2) X -> X WP (3) X -> X YP

I

VP NP

lexical projection

(a) rules in X-bar theory

V V

NP

(b) a phrase structure in GB-theory

Figure 4.5: The notions of head in X-bar theory, GB-theory and our target grammar

4.4 Treebank-speci c information The previous two sections describe the source corpus (i.e., PTB) and the forms of the target grammar. There are two important di erences between the Treebank and the target grammars in terms of what information is made explicit. Firstly, the anchors of etrees are heads of spine-etrees, but heads are not marked in the corpus. Secondly, the target grammar distinguishes arguments from modi ers: arguments are sisters of nodes on the spines of etrees, while modi ers are sisters of the foot nodes in mod-etrees. In PTB, arguments and modi ers are not structurally distinguished | both are siblings of the nodes they depend on | mainly because making the argument/adjunct distinction is notoriously dicult. Instead, PTB supplements syntactic category tags with function tags. This allows the users of the Treebank to decide on their own what tag combinations correspond to arguments. To make LexTract corpus-independent so that it can be easily applied to other Treebanks in other languages, we do not include Treebank-speci c information or decisions in the source code of LexTract. Instead, we require the users of LexTract to provide such information in the form of several tables as input to LexTract. In this section, we will 46

introduce the contents of those tables and explain why they are needed by LexTract.

4.4.1 Tagset table The tagset table provides types and attributes of each tag in the Treebank's tagset, which will be used in many modules of LexTract. This table speci es:

 the type of each tag in the tagset, namely, POS tag, syntactic tag, function tag, empty category tag

 the tag(s) for conjunctions (e.g., CC in PTB)  the tag(s) for unlike coordinated phrase (e.g., UCP in PTB)  the tag(s) for elided material (e.g *?* in PTB)  the Null element tags that mark \movement" (e.g., *T* for trace)  the function tags that always mark arguments (e.g., SBJ for subject)  the function tags that always mark modi ers (e.g., TMP for temporal)  the function tags that always mark heads (e.g., PRD for predicate)

4.4.2 Head percolation table Head is an important notion in the target grammar, but it is not marked in PTB. To help LexTract to identify heads, the user needs to make a head percolation table. The head percolation table was introduced in a statistical parser called SPATTER (Magerman, 1995), and later used in Collins' parsers (Collins, 1997) among others. It is called a head percolation table because for those parsers to extract lexicalized context-free rules from a Treebank the lexical items are percolated like features from the heads to their various projections, as marked by dashed lines in Figure 4.6(a). Given a node X and its head Y in a ttree, each node on the path from X to Y is called the head-child of its parent. For example, in Figure 4.6(a), VBP is the head-child of VP, VP is the head-child of S, and NNP is the head-child of NP. LexTract uses a head percolation table to select the head-child of a node, An entry in the table is of the form (x 47

S(likes) NP(John)

VP(likes)

S(likes) -> NP(John) VP(likes) NP(John) -> NNP (John)

NNP(John) VBP(likes) NP(apples) John

likes

VP(likes) -> VBP(likes) NP(apples) NP(apples) -> NNS(apples)

NNS(apples) apples (b) a lexicalized CFG

(a) a ttree

Figure 4.6: Lexical items percolate from heads to higher projections

direct y1 =y2=:::=yn ), where x and yi are syntactic category tags, direct is either LEFT or RIGHT . fyi g is the set of possible tags of x's head-child. For instance, the entry for V P

in PTB is (VP left VP/VB/VBN/VBP/VBZ/VBG/VBD). The algorithm for selecting the head-child is in Table 4.2 in Section 4.5.

4.4.3 Argument table The argument/modi er distinction is not explicitly marked in the PTB. LexTract marks each sister of a head6 as either an argument or a modi er according to the tag of the sister, the tag of the head and the position of the sister with respect to the head. An argument table informs LexTract about the types of arguments a head can take. An entry in an argument table is of the form (head tag left arg num right arg num arg tag1 =arg tag2 =:::=arg tagn ). head tag is the syntactic tag of the head, farg tagi g is the set of possible tags for the head's arguments, left arg num (right arg num, resp.) is the maximal number of arguments to the left (right, resp.) of the head. For example, the entry for IN (preposition) in PTB is (IN 0 1 NP/S/SBAR), which means a preposition can take at most one argument which has the label NP, S, or SBAR, and the argument appears after the preposition. The algorithm for distinguishing arguments from modi ers is shown in Table 4.3 in Section 4.5.

4.4.4 Modi cation table A modi cation table speci es the types of modi ers a constituent can take. An entry of the table is of the form (mod tag ltag1 =:::=ltagn rtag1 =:::=rtagm ), which means a mod tag 6

The head in this section refers to either the head of a phrase or the head-child of a node.

48

can be modi ed by ltagi from the left and by rtagi from the right. The table is useful for ltering out implausible etrees from the extracted grammar (see Section 4.6).

4.5 Extracting an LTAG from a Treebank So far we have introduced two types of input to LexTract and the form of the target grammar. In this section, we discuss the extraction algorithm. The algorithm has two stages: rst, a ttree is fully bracketed; second, the ttree is decomposed into a set of etrees.

4.5.1 Stage 1: Fully bracketing the ttrees As just mentioned, PTB does not explicitly mark heads, arguments and modi ers, whereas the target grammar does. Another di erence between PTB and the target grammar is that in the target grammar each mod-etree includes exactly one modi er and each conj-etree has two coordinated phrases, while in PTB the numbers can be higher. To account for this di erence, we rst fully bracket the ttrees by adding intermediate nodes so that at each level, one of the following holds:

(head-argument relation) there are one or more nodes: one is the head, the rest are its arguments;

(modi cation relation) there are exactly two nodes: one node is modi ed by the other; (coordination relation) there are three nodes: two nodes are coordinated by a conjunction.

The main task at this step is to nd the head-child and mark other children as either arguments or modi ers. The two algorithms are shown in Tables 4.2 and 4.3. The algorithm for fully bracketing a ttree is shown in Table 4.4. Given the example in Figure 4.2, repeated as Figure 4.7(a), its fully bracketed version is in Figure 4.7(b), where the nodes inserted by the algorithm are circled.7 The bracketing process may eliminate the potential ambiguity which exists in the original ttrees, but in most cases, it will not a ect the extracted etrees. 7

49

Input: node N , tagset table TagsetTb, head percolation table HeadTb Output: head-child hc of N Algorithm: ttree node* FindHeadChild(N , TagsetTb, HeadTb) (A) x = syn-tag(N ); /* syn-tag(N) is the syntactic category of the node N */ (B) Find the entry (x dir y1 =y2 =:::=yn ) in HeadTb; (C) check each child ch of N starting from the leftmost child (or rightmost child according to dir) f if ( (syn-tag(ch) 2 fy1 ; y2 ; :::yn g) or (according to TagsetTb, one of func-tags(ch) always marks \head")) then hc = ch; return

g

(D) if (dir == LEFT) /* choose the head-child by default */ then hc = leftmost-child(N ) else hc = rightmost-child(N ) Table 4.2: Algorithm for nding head-child of a node

( (S (NP-SBJ (NN supply) ( (NNS troubles)) (VP (VBD were) (PP-LOC-PRD (IN on) (NP (NP (DT the) (NNS minds)) (PP (IN of) (NP (NP (NNP Treasury) (NNS investors)) (SBAR (-NONE- *ICH*-2) ))))) (NP-TMP (RB yesterday)) (, ,) (SBAR-2 (WHNP-1 (WP who) ) (S (NP-SBJ (-NONE- *T*-1) ) (VP (VBD worried) (PP-CLR (IN about) (NP (DT the) (NN flood) )))))) (. .) ))

( (S (NP-SBJ (NN supply) (NP (NNS troubles))) (VP (VP (VP (VBD were) (VP (PP-LOC-PRD (IN on) (NP (NP (DT the) (NP (NNS minds) )) (PP (IN of) (NP (NP (NNP Treasury) (NP (NNS investors) )) (SBAR (-NONE- *ICH*-2) ))))) (NP-TMP (RB yesterday))) (, ,) (SBAR-2 (WHNP-1 (WP who) ) (S (NP-SBJ (-NONE- *T*-1) ) (VP (VP (VBD worried) ) (PP-CLR (IN about) (NP (DT the) (NP (NN flood) ))))))) (. .) ))

(a) an original ttree

(b) fully bracketed ttree

Figure 4.7: The e ect of fully bracketing a ttree

50

Input: a head-child hc, a sister sist of hc, position pos of sist with respect to hc, tagset table TagsetTb, argument table ArgTb Ouput: mark sist as either argument or modi er of hc Algorithm: void MarkSisterOfHead(sist, hc, pos, TagsetTb, ArgTb) /* sist is marked as an argument i it can be an argument according to ArgTb and */ /* according to TagsetTb, none of sist's function tag indicate that sist is a modi er */ (A) head tag = syn-tag(hc); (B) Find the entry (head tag left arg num right arg num arg tag1 =arg tag2 =:::=arg tagn ) in ArgTb (C) if ( ((pos == LEFT ) and (left arg num == 0)) or ((pos == RIGHT ) and (right arg num == 0)) ) then mark sist as modi er; return; /* hc does not have left (right, resp.) arguments */ (D) if (the tag of sist does not match any arg tagi ) then mark sist as modi er; return (E) for each (func tag 2 func-tags(sist)) if (func tag always marks modi ers according to TagsetTb) then mark sist as modi er; return (F) mark sist as argument. Table 4.3: Algorithm that marks a node as either argument or modi er

51

Input: a partially bracketed ttree T , tagset table TagsetTb, head percolation table HeadTb, argument table ArgTb Output: T is fully bracketed Algorithm: void FullyBracketTtree(T , TagsetTb, HeadTb, ArgTb) /* TargetList is a list of nodes the subtrees below which need to be fully bracketed */ (A) TargetList = fRootg, Root is the root of T ; (B) for (each node R 2 TargetList) f hc = FindHeadChild(R, TagsetTb, HeadTb); /* (see Table 4.2) */ if (one of R's children is a conjunction) then f /* R's children are a list of coordinated nodes */ use conjunction(s) to partition other children into m groups; for (each group gr) f if (gr has more than one node) then insert a node R1 with label syn-tag(R) as the new root of the nodes in the group; TargetList = TargetList [ fR1 g else TargetList = TargetList [ fchg, where ch is the only child in gr

g

g

if (m > 2) then insert m , 2 new nodes with label syn-tag(R) so that each level has exactly two groups plus one conjunction

else f /* R's children consists of a head, 0 or more arguments, 0 or more modi ers */ add each child of R to TargetList; mark each child other than hc as \argument" or \modi er" /* (see Table 4.3); */ if (at least one child is marked as \modi er") then insert new nodes fRi g between R and hc so that at each level between R and hc exactly one of the following holds: { there are exactly one new node Ri and one modi er, where Ri has the same syntactic tag as its parent, or { there are a node (hc or a new node Ri ) with label syn-tag(hc) and zero or more arguments of hc

g g /* end of \if" */

g /* end of \for each node in TargetList" */ Table 4.4: Algorithm for fully bracketing ttrees 52

4.5.2 Stage 2: Building etrees The fully bracketed ttree created in the previous step is in fact a derived tree of the sentence if the sentence were parsed with the target grammar. This stage essentially removes recursive structures, which will become mod-etrees or conj-etrees, from the fully bracketed ttrees and builds spine-etrees for the non-recursive structures. Starting from the root of a fully bracketed ttree, the algorithm rst nds a unique path from the root to its head. It then checks each node n on the path. If a sibling of n in the ttree is marked as a modi er, the algorithm factors out the part that includes n, n's siblings and n's parent from the ttree and builds a mod-etree (or a conj-etree if n has another sibling which is a conjunction). In this mod-etree n's parent is the root node, n is the foot node, and n's siblings is the modi er. Next, the algorithm creates a spine-etree with the remaining nodes on the path and their siblings. It repeats the process for the sibling nodes. In the stage, each ttree node X is split into two parts: a top part (X.t) and a bottom part (X.b). The reason for the splitting is as follows: when a pair of etrees are combined during parsing, the root of one etree is merged with a node in the other etree. Extracting etrees from a ttree is a reverse process of parsing. Therefore, during the extraction process, the nodes in the ttree are split into the top and bottom parts. To see how the algorithm works, let us look at an example. Figure 4.8 shows the same ttree as the one in Figure 4.7(b) except that some nodes are numbered. For the sake of simplicity, we show the top and the bottom parts of a node only when the two parts will end up in di erent etrees. In this gure, the path from the root S to the head IN8 is S ! V P1 ! V P2 ! V P3 ! V P4 ! PP ! IN . Along the path the SBAR is a modi er of V P2 , therefore, V P1 :b, V P2 :t and the spine-etree rooted at SBAR are factored out and form a mod-etree #13. #11 and #3 are built in similar way. For the remaining nodes on the path, V P1 :t and V P4 :b are merged and the spine-etree #4 is created. Repeating this process for other nodes will generate more trees such as trees #1 and #2. Notice that We follow the XTAG grammar in choosing the preposition as the head of a small clause. If the user of LexTract prefers to have the verb were as the head of the clause, he can simply change the tables which are mentioned in Section 4.4. To be more speci c, the user just needs to change the entry for the function tag -PRD in the tagset table so that the tag no longer marks the head of a phrase. 8

53

#4

S

NP 1.t

#1

NP 1.b

NN

#13

VP 2.t

NP 2.t

#11

NP 2.b

Supply

SBAR

VP 2.b VP 3.t VP 3.b

NNS #2

VP 1.t VP 1.b

troubles

#3

NP

VBD

VP 4.t

were

PP

RB

VP 4.b

IN #4

worried

yesterday NP 3.t NP 3.b

on

Figure 4.8: The etree set is a partition of the ttree. the tree #4 is broken into two parts by intervening auxiliary trees. The whole ttree yields fteen etrees, as shown in Figure 4.9. The algorithm is in Table 4.5. For the sake of simplicity, the algorithm is written as if nodes from a ttree are copied into etrees, which is equivalent to decomposing the ttree and gluing some components together to form etrees. The lines labeled (EC1) and (EC2) will be explained in Section 4.9.2. For each ttree node, LexTract will copy the top and bottom parts either to the same etree node or to two etree nodes in two distinct etrees.

4.6 Filtering out implausible etrees Annotation errors in ttrees will result in linguistically implausible etrees. For example, in the sentence in Figure 4.7(a), the word yesterday is incorrectly tagged as RB(adverb). As a result, one of the etrees, #11 in Figure 4.9, is created. The etree is linguistically implausible because RB(adverb) should not be the head of an NP (noun phrase). To detect implausible etrees, LexTract has a built-in lter which decomposes each etree into several parts and checks each part. An etree is plausible if and only if every part is plausible. The etrees are decomposed in the following ways: 54

Input: a ttree T, tagset table TagsetTb, head percolation table HeadTb, argument table ArgTb Output: a set of etrees EtreeSet Notions: Given a node x in T , x:top and x:bot are the top and bottom part of x. f (x:top) (f (x:bot), resp.) is the etree node copied from x:top (x:bot, resp.). Algorithm: etree-list BuildEtrees(T , TagsetTb, HeadTb, ArgTb) (A) EtreeSet = fg; R = Root(T ); (B) Choose the head-child hc of R; if the head-child does not dominate any lexical word then choose its sibling to be hc; | (EC1) (This will ensure that no spine etree is anchored by empty categories) (C) Based on the relation between hc and its sisters, go to one of the following: (1) predicate-argument relation (c.f. Figure 4.4(a)): /* built a spine-etree Ts , which is formed by a predicate and its arguments */ nd a head-path p from R to a leaf node A in the T . for (each non-link nodes x on p) f /* a non-link node is a node whose head-child and other children form a head-argument relation */ copy x:bot to Ts ; for each child yi of x f copy each yi :top to Ts , as f (x:bot)'s child. if yi does not dominate any lexical words copy the whole subtree rooted at yi to Ts | (EC2) g g mark f (A:top) as the anchor of Ts . EtreeSet = EtreeSet [ fTsg;

(2) modi cation relation (c.f. Figure 4.4(b)):

/* build a mod-etree Tm , which is formed by a modi er-modi ee pair and a spine-etree */ At this stage, hc should have only one sister, call it mod; Find a head-path p from mod to leaf node A in the ttree; Build an etree Ts from the path as stated in (1); Copy R:bot, hc:top, mod:top to Tm ; Create a mod-etree Tm in which f (R:bot) is the root and has two children: f (hc:top) and f (mod:top). f (hc:top) is the foot node; EtreeSet = EtreeSet [ fTmg;

(3) coordination relation (c.f. Figure 4.4(c)):

/* build a conj-etree Tc. Tc is the same as Tm except that * * the root of Tc has one extra conjunction child */ At this stage, hc should have two sisters. One is conjunction, call it conj and call the other sister mod. Build an etree the same as in (2) except f (R:bot) will have three children: f (hc:top), f (mod:top) and f (conj:top). EtreeSet = EtreeSet [ fTcg

(D) Repeat step (B)-(C) for each child x of R if x:bot has not been copied to any etree. (E) return EtreeSet

Table 4.5: Algorithm for building etrees from a fully bracketed ttree 55

#1

#2

#3 NP

NP

#4 VP NP

NN

NP*

NNS

VBD

#5

S

DT VP

NP*

IN

NP

NP PP

NNS IN

the

#10

NP

#11

#12

#13

NP *

SBAR

VP *

VP*

NP

*ICH*

Treasury

NP*

#15 VP

SBAR

WHNP WP

investors

NP

#14 VP

WHNP NNS

NNP

of

VP

NP

*

minds

on #9

#8

NP

NP *

PP

were

#7 NP

VP

supply troubles

#6

NP

VP *

NP PP

NN

S NP

VP

*T*

VBD

IN

NP flood

RB who

about

yesterday worried

Figure 4.9: The extracted etrees from the ttree.

spine-etree: a spine-etree is decomposed into a spine and a subcategorization frame.

A spine is plausible if every (parent, head-child) pair on the spine is in the head percolation table. That is, if the syntactic label of the the parent is x and (x dir y1=y2 =:::=yn ) is an entry in the head percolation table, then the label of the headchild must be one of fyi g.9 A subcategorization frame is plausible if and only if each argument in the frame can be an argument of the anchor according to the argument table.10

mod-etree: a mod-etree is decomposed into a spine, a subcategorization frame and a

modi er-modi ee pair. The rst two are checked the same way as in spine-etrees. The latter, a modi er-modi ee pair, is plausible if and only if the pair appears in the modi cation table.

conj-etree: a conj-etree is decomposed into a spine, a subcategorization frame and a

coordination pair. The rst two are checked the same way as in spine-etrees. As a result of fully bracketing ttrees, the two conjoined nodes of the root have the same tags as the root, therefore, the coordination pair is always plausible.

In other word, a spine is plausible if no head-child is chosen by default (see the line (D) in Table 4.2). When LexTract fully brackets the ttree, it uses the argument table to distinguish arguments and modi ers (see Table 4.3). Therefore, the subcategorization frame in the etree is always plausible and LexTract does not need to check it again. We mention this because other extraction methods which do not use argument tables may create etrees with implausible subcategorization frames. 9

10

56

QP

(NP (QP (IN about) (CD 160)) (NNS workers))

IN

(a)

QP* (b)

Figure 4.10: A frequent, incorrect etree template This lter works well when the three tables (the head percolation table, the argument table and the modi cation table) are correct and complete (i.e., there are no missing entries in the tables). In that case, the etrees that are marked as implausible by the lter are de nitely incorrect etrees, but not vice versa.11 That is, it is possible that some incorrect etrees are marked as plausible by the lter. This could happen, even though unlikely, when an etree made up of plausible substructures is not linguistically sound. In our example, among the 15 etrees in Figure 4.9, tree #11 is correctly ltered out because a RB (adverb) can not be the head of an NP according to the head percolation table. Our lter performs better from a grammar developer's point of view than a lter that uses a threshold to throw away infrequent etrees because the latter will keep frequent but implausible etrees while throwing away infrequent but plausible ones. In fact, about 12% of the etree template12 types in PTB that are marked as implausible by our lter occur 10 or more times in the corpus, and for all the etrees that occur only once, 42.8% of them are marked as plausible by our lter. Many of those frequent and implausible templates are caused by part-of-speech errors in the Treebank.13 For instance, in the ttree shown in Figure 4.10(a), the adverb about is mis-tagged as a preposition (IN). As a result, the template in Figure 4.10(b) was created by LexTract. The threshold-based lter cannot lter it out because the template occurs 1832 times in the Treebank. Fortunately, our lter will mark it as implausible because it knows preposition (IN) cannot modify a quanti er phrase (QP) according to the modi cation table. By incorrect, we meant that the etree is not linguistically sound. The term template has been de ned in Section 3.1. 13 We suspect the main reason that those POS errors remain in the Treebank is that when the Treebank was bracketed by annotators, POS tagging annotation was already completed but the POS tags were not shown to annotators, therefore the annotators were unaware that syntactic tags they added to phrases and the existing POS tags within the phrases might be incompatible. 11

12

57

on (#4) troubles(#2) Supply(#1)

were(#3) yesterday(#11) worried(#13)

who(#12)

on (#4) minds(#6) the(#5) of(#7)

troubles(#2) were(#3) yesterday(#11) worried(#13) Supply(#1) minds(#6) the(#5) of(#7)

about(#14) investors(#9) flood(#15)

who(#12) about(#14)

investors(#9)

flood(#15)

Treasury(#8)

the(#5)

the(#5)

Treasury(#8) *ICH*(#10)

*ICH*(#10)

(a) multi-adjunction is not allowed

(b) multi-adjunction is allowed

Figure 4.11: The LTAG derivation trees for the sentence

4.7 Converting ttrees to derivation trees For the purpose of grammar development, a set of etrees may be sucient. However, to train a statistical LTAG parser, derivation trees, which indicate how etrees are combined to form derived trees, are required. Recall that, unlike CFG, the derived trees and derivation trees are not identical in the LTAG formalism. Derivation trees are also useful for nding multi-component sets, as will be discussed in the next section. In this section, we will give an algorithm for building derivation trees. First of all, there are two sightly di erent de nitions for LTAG derivation trees. The rst one adopts the no-multi-adjunction constraint, and the second does not (Schabes and Shieber, 1992). The no-multi-adjunction constraint says that when etrees are combined to form derivation trees, at most one adjunction is allowed at any node in an etree. As a result, if a phrase XP in an etree Eh has several modi ers (each of the modi ers belongs to a mod-etree), according to the rst de nition, the mod-etrees will form a chain, with one mod-etree adjoining to Eh and the rest adjoining to one another; while according to the second de nition, all the mod-etrees will adjoin to Eh . The two derivation trees for the example in Figure 4.7 are in Figure 4.11. Notice that mod-etrees #3, #11, and #13 all modify #4; they form a chain as circled in Figure 4.11(a), while they are siblings in Figure 4.11(b). In the following discussion, we assume that the rst de nition is used.14 Our algorithm in Table 4.6 works for either de nition, and the user can set a parameter to inform LexTract which de nition he wants to use. 14

58

As mentioned in Section 4.5, given a fully bracketed ttree T and a set ESet of etrees which are extracted from T , ESet can be seen as a decomposition of T . In other words, T would be one of the derived trees for the sentence if the sentence were parsed with the etrees in ESet. Given T and ESet, there may be more than one derivation tree which generates T by combining etrees in ESet. This is because when a phrase has several modi ers, the modi ers will form a chain and the order of the modi ers on the chain is not xed. For instance, switching the order of tree #3, #11 and #13 in Figure 4.11(a) will yield six di erent derivation trees. All six trees would generate the same derived tree. However, if we add the no-adjunction (NA) constraint to the foot nodes of all the auxiliary trees, as is the case in the XTAG grammar, the derivation tree would be unique. Alternatively, if we add the NA constraint to the root nodes of all the auxiliary trees, the derivation tree would be unique as well. To summarize, with the two constraints { (1) no adjunctions are allowed at the foot nodes of auxiliary trees and (2) at most one adjunction is allowed at any other node { each ttree corresponds to a unique derivation tree. The derivation tree is built in two steps: rst, for each etree e extracted from T , nd the etree e^ which e substitutes/adjoins into.15 e^ will be the parent of e in the derivation tree. We call e^ the d-parent of e, d stands for derivation trees. Second, from those (e, e^) pairs, build a derivation tree. The algorithm is given in Table 4.6. For instance, in Figure 4.8 (repeated as Figure 4.12), the algorithm rst decides that #1 adjoins to #2 at NP2 :b, and #2 substitutes into #4 at NP1 :t, and then it generates the derivation tree in Figure 4.11(a).

4.8 Building multi-component tree sets So far, we have described algorithms for extracting LTAGs from Treebanks and building derivation trees for each ttree. There is one type of information still missing from the extracted LTAGs, namely, coindexation. In the Treebank, reference indices are used either to mark various movements such as wh-movement or to indicate where a constituent should In this section and the next section, the etrees in ESet should be regarded as etree tokens, therefore they are all distinct even if some etree tokens are of the same etree types. 15

59

a fully bracketed ttree T , an etree set ESet which is extracted from T , a function f that maps the nodes in T to nodes of etrees in ESet Output: a derivation tree D Notion: Two nodes Y1 and Y2 in two etrees are called buddies wrt. T i there exists a node X in T such that f (X:top) = Y1 and f (X:bot) = Y2 . i.e., they are copied from the top and the bottom part of the same node in T . Input:

An etree e1 is called t-parent of another etree e2 in T if the buddy of the root of e2 is in e1 , i.e., e1 is on top of e2 in T when ESet is seen as a partition of T .

ex is called a t-ancestor of ey in T if there is a sequence (e0 = ex; e1 ; :::en = ey ) of etrees where ei (0  i < n) is t-parent of ei+1 . Algorithm: void BuildDerivTree(T , ESet, f , g) (A) for each (e in ESet) f /* nd the d-parent e^ of e, i.e., the etree that e substitutes/adjoins to */ if (e is an initial tree) then f /* e substitutes to e^ which is immediately above e in T * if we ignore all the mod-etrees and conj-etrees between them */ e^ is the closest t-ancestor of e in T which is either an initial tree or an auxiliary tree whose foot node does not dominate the root of e in T .

g

else f /* e adjoins to e^ which is immediately below e */ Let fn be the foot node of e, and its buddy in T is bud; e^ is the etree whose root is bud;

g

g

(B) Find eR 2 ESet such that eR has no d-parent, make eR the root of D. (C) Find all the d-children of eR , make them children of eR in D. (D) Repeat (C) for each child of eR until D covers all the etrees in ESet. Table 4.6: Algorithm for building derivation trees

60

#4

S

NP 1.t

#1

NP 1.b

NN

#13

VP 2.t

NP 2.t

#11

NP 2.b

Supply

SBAR

VP 2.b VP 3.t VP 3.b

NNS #2

VP 1.t VP 1.b

troubles

#3

NP

VBD

VP 4.t

were

PP

RB

VP 4.b

IN #4

worried

yesterday NP 3.t NP 3.b

on

Figure 4.12: The ttree as a derived tree. be interpreted. For instance, in Figure 4.2 (repeated as Figure 4.13), the reference index -1 marks the wh-movement of who and the reference index -2 indicates that the relative clause who worried about the ood is interpreted as a modi er of the NP Treasury investors. The reference indices in ttrees are very useful for sentence interpretation. Therefore, we want to keep them in etrees, so that when etrees are combined the reference indices will be passed to the derived trees. A pair of co-indexed nodes (i.e., nodes with identical reference indices) in a ttree do not always map to nodes in the same etree. An important issue in the LTAG formalism is whether syntactic movement satis es certain locality constraints. One hypothesis is that tree-local MCTAG is powerful enough to handle syntactic movement, that is, the two etrees, one for the ller and the other for the gap, always substitute/adjoin to a single etree. This hypothesis, which we will call the Tree-locality Hypothesis from now on, has been investigated extensively in the literature (Weir, 1988; Kulick, 1998; Heycock, 1987; Becker et al., 1992; Bleam, 1994; Joshi and Vijay-Shanker, 1999; Kallmeyer and Joshi, 1999)) Treebanks provide naturally-occuring data for testing this hypothesis. The strategy we employ to address this issue has two stages: rst, use LexTract to nd all the examples that seem to violate the hypothesis, second, classify the examples according to 61

( (S (NP-SBJ (NN supply) (NP (NNS troubles))) (VP (VP (VP (VBD were) (VP (PP-LOC-PRD (IN on) (NP (NP (DT the) (NP (NNS minds) )) (PP (IN of) (NP (NP (NNP Treasury) (NP (NNS investors) )) (SBAR (-NONE- *ICH*-2) ))))) (NP-TMP (RB yesterday))) (, ,)

NP NP *

SBAR *ICH*

(b)

(SBAR-2 (WHNP-1 (WP who) ) (S (NP-SBJ (-NONE- *T*-1)) (VP (VP (VBD worried) ) (PP-CLR (IN about) (NP (DT the) (NP (NN flood) )))))))

VP VP*

SBAR

WHNP

S NP

VP

*T*

VBD worried

(a)

(c)

Figure 4.13: Etrees for co-indexed constituents the underlying constructions (such as extrapositions), and determine whether each class would become tree-local if an alternative analysis for the construction is adopted. In this section, we will address the rst stage, i.e., nding all the examples that seem to be \nontree-local". In the next chapter, we will discuss all the \non-tree-local" cases found in PTB. The algorithm for nding non-tree-local examples is in Table 4.7. For each pair (X1 ; X2 ) of co-indexed nodes in a ttree T , assuming the top parts of the two nodes map to some nodes in two etrees e1 and e2 respectively, the algorithm will check whether the two etrees substitute/adjoin to a single etree. ea

ea

e1 = e2 e1

X1.top

e2

X .top 2

X1.top

X .top 2

e1

e2

X1.top

(a) within one tree

(b) tree-local

X .top 2

(c) not tree-local

Figure 4.14: The coindexation between two nodes may or may not be tree-local 62

Input: a ttree T , the etree set Eset which is extracted from T , a derivation tree D for T , the mapping f from nodes in T to nodes in Eset, and two co-indexed nodes X1 and X2 in T Output: 1 if the coindexation between X1 and X2 is tree-local 0 otherwise Algorithm: int TestTreeLocality(T , ESet, D, f , X1 , X2 ) (A) Let e1 be the etree that f (X1 :top) belong to; (B) Let e2 be the etree that f (X2 :top) belong to; (C) if (e1 == e2 ) then /* f (X1 :top) and f (X2 :top) are in the same etree, see Figure 4.14(a) */ return 1 (D) Find the closest common ancestor ea of e1 and e2 in D; (ea might be identical to e1 or e2 ) (E) Find the path p1!a from e1 to ea and the path p2!a from e2 to ea in D; (F) For each pair (e, e^) on each path f if (e and e^ modify the same etree) mark e^;

g

(G) if (neither p1!a nor p2!a has unmarked etrees besides e1 , e2 and ea ) then /* e1 and e2 will form a MC set and both join to ea , see Figure 4.14(b) */ return 1 else /* e1 and e2 do not join to the same tree, see Figure 4.14(c) */ return 0 Table 4.7: Algorithm for determining whether the coindexation between a pair of nodes is tree-local

63

In Figure 4.13(a), the WHNP-1 and *T*-1 will map to nodes in the same etree in (c), so the movement can be handled by tree-local MCTAG. Nevertheless, etrees for *ICH*-2 and SBAR-2 | ones in (b) and (c) | do not adjoin to a single etree (see the derivation trees in Figure 4.11), so the \movement" can not be handled by tree-local MCTAG according to the current annotation in ttree.16

4.9 Some special cases In this section, we will discuss four constructions that require special treatment from LexTract.

4.9.1 Coordination In this chapter, we de ne a conj-etree as an etree with the conjunction represented as a substitution node, one XP as the foot node and the other XP expanded. An alternative is to treat the conjunction as the anchor, one XP as the foot, and the other XP as a substitution node. Figure 4.15 shows how these two approaches handle a VP coordination in the sentence John bought a book and has read it four times. In the rst approach, the second verb read anchors the conj-etree 1 , and the singleton etree 1 substitutes into the CC node in 1 . In the second approach, the conj-etree 2 is anchored by the conjunction, and the etree 2 substitutes into the VP node in 2 . In either approach, the conj-etree adjoins to the etree 3 . Currently, we adopt the rst approach for two reasons: rstly, it can easily capture the dependency between the two verbs bought and read, as shown in 1 ; Secondly, ideally, an etree should include all the arguments of the anchor. Both 1 and 2 miss the subjects of the verb read. The di erence is that in the rst approach the etree with the missing subject is a conj-etree, while in the second approach it is an independent spine-etree. In the future, we will try both approaches for parsing and see which one yields better performance. In the next chapter, we will argue that this type of example (called NP-extraposition) is not syntactic movement. 16

64

(a)

The first approach

VP

β1:

((S (NP-SBJ (NNP John)) (VP (VP (VBD bought) (NP (DT a) (NN book)) (CC and) (VP (VBP has) (VP (VBD read) (NP (PRP it)) (QP (CD four) (NNS times)))))))

VP*

VP

CC VBD

NP

γ 1:

read

α 1:

bought

John book read

CC

has and it and

(b)

α 3: NP

The second approach

S

β2: VP

VBD

VP

VP NP

bought

VP*

γ 2:

α 2:

CC

VP

VBD read

and

bought

John book and NP

read

has it

Figure 4.15: Two ways to handle a coordinated VP in the sentence John bought a book and read it four times

4.9.2 Empty categories There are two places where empty categories (EC for short) such as *T* for A'-movement require special treatment. First, when the coindexations between ECs and their antecedents represent syntactic movement, they should be captured by multi-component (MC) tree sets. We have shown how this can be done in Section 4.8. Second, we do not want to create etrees anchored by ECs because the existence of these etrees will slow down LTAG parsers. We'll address the second issue in this section. To parse a sentence, an LTAG parser rst selects all the etrees anchored by the words in a sentence, and then generates parses by combining those etrees. ECs are invisible in the input sentence. If we allow them to be anchors of some etrees, the parser has to \predict" where ECs \appear" in the sentence and then select the etrees anchored by those ECs. This will complicate the parsing algorithm and slow down the parser. An exception is: if an etree with an EC anchor belongs to a MC set in which at least one of the etrees is anchored by a lexical word, the parsing eciency will be not a ected, because the MC set will not be selected unless the non-EC anchor appears in the input sentence. 65

The algorithm for building etrees (see Table 4.5) has taken this into account. The two lines that handle ECs are marked by (EC1) and (EC2) respectively. Let XP be the parent of an EC in a ttree and XP:top is mapped to a node Y in an etree e. There are three possible positions for Y :

 Y is a node on the spine of e. A common example is ellipsis, where a verb or a verb

phrase is omitted, as in Figure 4.16(a). Without special treatment, the algorithm would build 1 and . To avoid generating 1 , which is anchored by an EC, we add the line (EC1) to the building-etree algorithm in Table 4.5. The line requires the head-child to dominate at least one lexical words. In this case, with (EC1), the algorithm will choose ADV P , not V P3 , as the head-child of V P2 . As a result, the algorithm will build 2 , not 1 and . Notice that 2 could be generated by adjoining to the VP node in 1 .17

 Y is a sister of a node on the spine of e. A common example is wh-movement of an

argument. LexTract will copy the whole subtree rooted at XP:bot to e, as shown in line (EC2) in Table 4.5. For instance, the subtree rooted at NP-SBJ (in bold face) in Figure 4.13(a) is copied to the etree in 4.13(c).

 Y is a modi er and maps to a sister of the foot node in a mod-etree or a conj-etree.

An example for this is wh-movement of a modi er. LexTract will group the etrees for the gap and the ller into a MC set. Since at least one of the etrees in the MC set will be anchored by lexical words, the parsing eciency will not be a ected. Nothing special is required for this case.18 In Figure 4.17, PP is a modi er of the VP and it

This fact implies the ECs in this position can be handled as follows: remove (EC1) from the algorithm, but add a step at the end of algorithm. In this step, each etree which is anchored by an EC is merged with its \neighboring" etree in the ttree. 18 In some rare cases, either there is no coindexation for the EC, or the MC set is not tree-local. If the LTAG parser does not predict the existence of ECs in the input sentence, the etree e with an EC anchor will never be selected. Nevertheless, since e does not have any substitution nodes, other etrees for the sentence will still t together and generate parses without e. The only thing missing from those parsers is the EC. Since most state-of-the-art parsers ignore ECs when calculating parsing accuracy, the missing ECs will not a ect parsing accuracy. 17

66

is moved to the beginning of the sentence. The etrees for the ller and the gap form an MC set, and both etrees adjoin to 1 . ((S (NP-SBJ (NNP John)) (VP1 (VBD did) (VP 2 (VP3 (-NONE- *?*)) (ADVP (RB too))))))

(a) a ttree (b) without (EC1) α 1:

β:

S

NP

α 2:

VP VP*

VP

(c) with (EC1) S VP

NP

ADVP

VP

ADVP

*?*

RB

RB

*?*

too

too

Figure 4.16: Handling a sentence with ellipsis: (Mary came yesterday.) John did too β S

PP i

β

2:

S*

VP

VP*

S

PP i at which hotel

S 1:

P V

NP

PP i ε

S at

did

NP

VP

you VP

PP i

V

ε

α 1:

S NP

stay

VP V stay

(a) a ttree

(b) extracted etrees

Figure 4.17: Handling a sentence with wh-movement from adjunct positions

4.9.3 Punctuation Punctuation helps humans to comprehend sentences. It could help NLP tools as well if used appropriately. In the XTAG grammar, there are 47 templates that contain punctuation. Doran (1998) 67

discusses all of the templates in detail. She divides punctuation marks into three classes: balanced, structural, and terminal. The balanced punctuation marks are quotes and parentheses, structural marks are commas, dashes, semi-colons and colons, and terminals are periods, exclamation points and question marks. Figure 4.18 shows four etrees with punctuation marks. 1 is an etree for non-peripheral NP appositive. 2 is for a sentence such as John will come, I think. The comma is the anchor of 1 , but a substitution node in 2 . Both etrees come from (Doran, 2000). We build 3 and 4 for balanced and terminal punctuation respectively. β 1: NP*

β 2:

NP Punct ,

NP

Punct ,

S*

β3:

S Punct NP

Punct

S VP

β4:

S

‘‘

S*

Punct ’’

S S*

Punct .

V think

Figure 4.18: Etrees with punctuation marks This approach is appealing when building a grammar by hand. However, automatically extracting those etrees from Treebanks is not trivial. For example, when LexTract sees a comma in a sentence, it would be dicult for the system to decide whether the comma should anchor an etree like 1 in Figure 4.18, or it should be a substitution node as in 2 . Another problem is that the usage of punctuation marks in real corpora is very complicated. For instance, balanced punctuation marks such as quotation marks are supposed to appear in pairs and enclose a constituent in a ttree, but they are often not annotated that way in the PTB for various reasons. For example, the sentence He said , \S1 . S2 . S3 ." (Si 's are clauses), is broken into three ttrees in PTB, according to the positions of three full stops. The left quotation mark belongs to the rst ttree and the right quotation mark belongs to the third ttree. Another example is in Figure 4.19, where the left quotation mark is a child of the VP node, while the right quotation mark is forced to attach higher because the full stop is attached to the clause. Due to these problems, our current system does not include punctuation in the extracted grammars. However, this does not mean that the NLP tools which use our extracted grammars cannot take advantage of punctuation in the data. In other words, an 68

((S (NP-SBJ (PRP he)) (VP (VBD said) (, , ) (‘‘ ‘‘) (S (NP-SBJ (NNP John)) (VP (VBZ likes) (NP (PRP her))))) (. .) (’’ ’’) ))

Figure 4.19: A sentence with quotation marks NLP tool can still include punctuation marks in its language model. For example, Doran (2000) shows that including punctuation in the training and testing data for Supertagging achieved an error reduction of 10.9% on non-punctuation tokens.19 In the data she used, the templates (including the ones for punctuation marks) come from the XTAG grammar. Based on this experiment, she claims that having etrees for punctuation marks in an LTAG will improve the performance of Supertagging. We have re-trained the same Supertagger but with data extracted by LexTract. For punctuation marks, we use dummy etrees named after punctuation (e.g., an etree for comma will be named comma , and the one for semicolons semicolon). Interestingly, when we conduct the same experiment but with our data, the Supertagging performance for non-punctuation tokens shows roughly same rate of improvement. From the results of both experiments, we conclude that punctuation marks are de nitely useful for NLP tools such as a Supertagger, but there are three ways to take them into account: rst, they can be included in the templates of an LTAG; second, they can be part of a language model; third, they can be used in both places. Which approach is the best really depends on applications and the particular approach used in NLP tools.

4.9.4 Predicative auxiliary trees Predicative auxiliary trees, such as the one in Figure 4.20(c) for a verb believe, represent the head-argument relation, and therefore are spine-etrees. Nevertheless, the structures of the trees are recursive in that the root and one leaf node have the same label. Following 19

More discussion on Supertagging can be found in Section 4.10 and in the next chapter.

69

(Kroch and Joshi, 1985; Kroch, 1989), the XTAG grammar treats this type of tree as an auxiliary tree to account for long-distance movement, e.g., what does Mary think Mike believes John likes (see Section 2.2). (NP (NP (DT the) (NN person)) (SBAR (WHNP-1 (WP who)) (S (NP-SBJ (NNP Mary)) (VP (VBD believed) (SBAR (-NONE- 0) (S (NP-SBJ (-NONE- *T*-1)) (VP (VBD bought) (NP (DT the) (NN book)))))))))

NP NP*

SBAR SBAR

0 NP *T*

NP VP

VBD bought

(a)

S

WHNP S

(b)

VP VBD SBAR*

NP believed

(c)

Figure 4.20: An example in which a predicate auxiliary tree should be factored out: the person who Mary believed bought the book Several things are worth noting:

 When there is no long-distance movement, it does not make much di erence whether

such an etree is treated as an auxiliary tree or not. For instance, to parse the sentence Mary believed John would come, either G1 or G2 in Figure 4.21 will work although the orders between come and believed are ipped in the derivation trees. For more discussion on this topic, see (Joshi and Vijay-Shanker, 1999).

 The shape of the etree alone can not determine whether it should be an initial tree or

an auxiliary tree. For instance, the etree in Figure 4.22 comes from XTAG grammar and it is used to handle gerund phrases such as Mary having a child in the sentence Nobody heard of Mary having a child. The root of the etree is an NP because the gerund phrase has the same distribution as other NPs. In this tree, although the object of the anchor has the same syntactic tag as the root, the etree is treated as an initial tree.

 Even for ttrees which are identical except for the positions of ECs, we do not always

want to factor out what looks like a predicative auxiliary tree. For example, the ttree for the noun phrase \the person who Mary believed bought the book" in Figure 4.20(a) looks identical to the ttree for the noun phrase \the person who believed Mary bought the book" in Figure 4.23(a) except the positions for Mary and *T* are swapped. But 70

α 1:

α 2:

S

NP

VP VBD

γ

S

NP

1: believed

VP

S

come

VB

believed

come G1

β 1:

α 3:

S

NP

VP VBD

γ 2:

S

NP

come

VP believed

S*

VB come

believed G2

Figure 4.21: Two alternatives for the verb believed when there is no long-distance movement NP NP

VP VBG

NP

having

Figure 4.22: The etree for gerund in the XTAG grammar the etrees needed to parse the two sentences are very di erent, as in (b) and (c) of Figure 4.20 and Figure 4.23. Only the ttree in Figure 4.20(a) needs a predicative auxiliary tree. To conclude, it is not true that every spine-etree whose root and a leaf node have the same syntactic tag should be treated as auxiliary tree. We'll leave the task of detecting predicative auxiliary trees for future study.

71

NP

(NP (NP (DT the) (NN person)) NP* (SBAR (WHNP-1 (WP who)) (S (NP-SBJ (-NONE- *T*-1)) (VP (VBD believed) (SBAR (-NONE- 0) (S (NP-SBJ (NNP Mary)) (VP (VBD bought) (NP (DT the) (NN book)))))))))

SBAR

SBAR 0

S

WHNP

S NP

NP

VP

VP VBD

*T*

NP

VBD SBAR bought believed

(a)

(b)

(c)

Figure 4.23: An example where a predicate auxiliary tree should be not factored out: the person who believed Mary bought the book

4.10 Comparison with other work In this section, we compare LexTract with three other approaches to LTAG extraction.20 A di erence between LexTract and these three approaches is that LexTract produces derivation trees and MC sets, whereas these tasks are not mentioned in other approaches.

4.10.1 Srinivas' heuristic approach A Supertagger (Joshi and Srinivas, 1994; Srinivas, 1997) assigns an etree template to each word in a sentence. The etree templates are called Supertags in this context because they include more information than Part-of-Speech tags. Srinivas implemented the rst Supertagger and he also built a Lightweight Dependency Analyzer (LDA) that assembles the Supertags of words to create an almost-parse for the sentence. Supertaggers have been found useful for several applications, such as information retrieval (Chandrasekar and Srinivas, 1997). To train his Supertagger, Srinivas (Srinivas, 1997) rst selects a subset of templates from the XTAG grammar, then uses heuristics to convert the whole PTB into a (word, template) sequence. His method is di erent from LexTract in that the set of templates in his method is chosen from a pre-existing grammar before the conversion starts, while LexTract generates the template set on the y. His conversion program is also designed for this particular template set, and it is not very easy to port it to another template set. A third di erence is that the templates in his converted data do not always t together, LexTract can produce CFGs as well. A comparison between LexTract and other CFG extraction algorithms can be found in the next chapter. 20

72

due to the discrepancy between the XTAG grammar and the Treebank annotation and the fact that the XTAG grammar does not cover all the template tokens in the Treebank. In other words, for each sentence in his converted data, combining the templates in that sentence may not always produce the correct parse.

4.10.2 Neumann's lexicalized tree grammars Neumann (1998) describes an extraction algorithm and tests it on PTB and a German Treebank called Negra (Skut et al., 1997). There are several similarities between his approach and LexTract: rstly, both approaches adopt notions of head and use head percolation tables to identify the head-child at each level. Secondly, both decompose the ttrees from the top downwards such that the subtrees rooted by non-head children are cut o and the cutting point is marked for substitution. The main di erence between the two is that Neumann's system does not distinguish arguments from adjuncts, and therefore does not factor out the majority of recursive structures with modi ers. As a result, only 7.97% of the templates in his grammar are auxiliary trees, and the size of his grammar is much larger than ours: his system extracts 11979 templates from three sections of PTB (i.e., Section 02-04), whereas LexTract extracts 6926 templates from the whole corpus (i.e., Section 00-24). In addition, his paper does not mention attempts for ltering out implausible templates. It is also not clear from the paper how he treats conjunctions, empty categories and coindexation, so we can not compare these two approaches on these issues.

4.10.3 Chen & Vijayshanker's approach Among all three approaches in this section, Chen & Vijayshanker's method (2000) is the closest to ours. Both methods use a head percolation table to nd the head and both distinguish adjuncts from modi ers. Nevertheless, there are several di erences. First, unlike Chen & Vijayshanker's system, LexTract explicitly creates a fully bracketed ttree, which is identical to the derived tree for the sentence. As a result, extracted etrees can be seen as a decomposition of the fully bracketed ttree. That makes the tasks of building derivation trees and MC sets straightforward. Moreover, the mapping between the nodes 73

in fully bracketed ttrees and etrees makes LexTract a useful tool for Treebank annotation and error detection. Another di erence is that Chen & Vijayshanker's system uses a cuto threshold to lter out etrees, while LexTract determines the plausibility of etrees by decomposing etrees and automatically checking each part. These two approaches also di er in their treatments of coordinations and their algorithms for making argument/adjunct distinctions.

4.11 Proposed work In this chapter, we have outlined a system named LexTract which extracts LTAGs from Treebanks and converts the structures in the Treebank to the derivation trees that can be used to train statistical parsers directly. LexTract is language- and corpus-independent in the sense that all the language- or corpus-speci c information is stored in several tables which LexTract takes as input. As a result, LexTract can be applied to various Treebanks for di erent languages. In the next chapter, we will discuss a number of applications of LexTract. There is one remaining issue on grammar extraction, namely, for a particular NLP task such as parsing, how should certain constructions be represented in the Treebank grammar? For example, in Section 4.9, we discuss alternative representations for punctuation and coordination. For a particular parsing model, one representation may be more suitable than the others. We are going to explore these alternatives in the future.

74

Chapter 5

Applications of LexTract In this chapter, we discuss applications of LexTract and report experimental results. The applications of LexTract roughly fall into ve types:

 The LTAG built by LexTract is useful for grammar development (see Section 5.1 { 5.3).

 The lexicon and derivation trees derived from Treebanks can be used to train statistical tools such as Supertaggers and LTAG parsers (see Section 5.4).

 LexTract can retrieve the data from Treebanks to test theoretical linguistic hypotheses (see Section 5.5).

 LexTract maintains mappings between ttree nodes and etree nodes, and these mappings make LexTract a useful tool for Treebank annotation.

 LexTract is language- and corpus-independent in the sense that all the language- or

corpus-speci c information is stored in several tables which LexTract take as input. As a result, LexTract can be applied to various Treebanks for di erent languages. This provides quantitative support for an investigation into the universal versus stipulatory aspects of di erent languages.

We have conducted experiments on the rst three types of applications. The results are discussed in Section 5.1 { 5.5. In Section 5.6, we discuss our plans for conducting experiments on the last two applications. 75

5.1 Treebank grammars as stand-alone grammars The Treebank grammars extracted by LexTract can be used as stand-alone grammars for languages which do not have high-quality wide-coverage LTAG grammars.

5.1.1 Two Treebank grammars We ran LexTract on the English Penn Treebank (PTB) and extracted two Treebank grammars. The rst one, G1 , uses the PTB's tagset. The second Treebank grammar, G2 , uses a reduced tagset,1 where some tags in the PTB tagset are merged into a single tag, as shown in Table 5.1. The reduced tagset is basically the same as the tagset used in the XTAG grammar (XTAG-Group, 1998). We use this reduced tagset for two reasons: rstly, we want to compare LexTract with related work based on the XTAG grammar (XTAG-Group, 1998), therefore, the two grammars should use the same tagset; secondly, the number of templates in G2 is much smaller than that in G1 , and this will alleviate the sparse data problem for parsing and Supertagging. adjectives adverbs determiners nouns verbs clauses noun phrases adjective phrases adverbial phrases preposition phrases

tags in PTB JJ/JJR/JJS RB/RBR/RBS/WRB DT/PDT/WDT/PRP$/WP$ CD/NN/NNS/NNP/NNPS/PRP/WP/EX/$/# MD/VB/VBP/VBZ/VBN/VBD/VBG/TO S/SQ/SBAR/SBARQ/SINV NAC/NP/NX/QP/WHNP ADJP/WHADJP ADVP/WHADVP PP/WHPP

tags in XTAG A Ad D N V S NP AP AdvP PP

Table 5.1: Tags in PTB are converted to tags used in the XTAG grammar The sizes of two grammars are in Table 5.2. In G2 , a word type has 2.38 Supertags on average, whereas a word token has 27.7 Supertags.2 In contrast, on average a word type has There are 41 tags in G2 : 14 POS tags, 15 syntactic tags, and 12 tag for empty categories. In this chapter, we will use the term templates and Supertags interchangeably. For more detail on Supertags, see Section 4.10.1 and Section 5.4. A word has x Supertags means that the word can anchor x unique templates. 1

2

76

1.17 POS tags, whereas a word token has 2.29 POS tags.3 A word token has much more Supertags than POS tags because a word appearing in di erent syntactic structures will have di erent Supertags whereas its POS tag is likely to remain the same. For example, a preposition has di erent Supertags when the PP headed by the preposition modi es a VP, an NP, or a clause. Table 5.3 shows the top 40 words with the most numbers of Supertags in G2 .4 template types LTAG G1 6926 LTAG G2 2920

etree tokens 1,173,756 1,173,756

etree types 131,397 117,356

word types 49,206 49,206

etree types per word type 2.67 2.38

etree types per word token 34.68 27.7

Table 5.2: Two extracted grammars from PTB

5.1.2 Coverage of Treebank grammars Figure 5.1 shows the log frequency of templates and the percentage of template tokens covered by template types in G1 .5 In both cases, template types are sorted according to their frequencies and plotted on the X-axis. These two gures indicate that a small portion of template types, which can be seen as the core of the grammar, cover majority of template tokens in the Treebank. For example, the rst 100 (500, 1000 and 1500 resp.) templates cover 87.1% (96.6%, 98.4% and 99.0% resp.) of the tokens, whereas about half (3411) of the templates each occur only once, accounting for only 0.29% of template tokens in total. Furthermore, this core of the grammar can be extracted from a fraction of the whole corpus. For instance, 1055 ttrees (about 2.1% of the corpus) can produce all of the most frequent 1500 template types.6 This implies that the size of corpus needed for building the template set of a wide-coverage grammar can be greatly reduced if the right ttrees can For POS tags, we use PTB's tagset, since all the POS taggers trained on PTB use that tagset. A word may appear to have more Supertags (or POS tags) in the Treebank than they should due to annotation errors. 5 Similar results hold for G2 . 6 To get the number 1055, we just count the ttrees that produce the rst occurrence of those templates as the extraction process goes. 3 4

77

word in to and of on for put as say is set up more out said by with make made like take from says about have at down expected over do pay paid sell give cut sold much get close

# of Supertags 171 122 122 117 108 107 101 94 89 89 82 81 81 80 76 76 75 74 73 73 72 72 71 71 70 70 68 67 66 64 63 61 59 58 58 56 56 56 54

# of POS tags 6 3 5 3 3 3 6 3 5 2 6 5 3 5 2 3 3 3 2 4 3 1 1 5 5 3 7 4 4 2 3 3 4 2 6 2 2 2 5

word frequency 18857 27249 19762 28338 6367 9890 343 5268 876 8499 331 2079 2339 1254 7132 5524 5170 727 650 603 510 5389 2431 2604 3777 5336 934 665 1056 1156 438 265 600 274 323 478 831 563 402

Table 5.3: The top 40 words with highest numbers of Supertags in G2 78

12

1 "cover.data" 0.9

10

Percentage of template tokens

0.8

log(Freq of templates)

8

6

4

0.7 0.6 0.5 0.4 0.3 0.2

2 0.1 0

0 0

1000

2000

3000 4000 Tree templates

5000

6000

7000

0

1000

2000

3000 4000 Template types

5000

6000

7000

(a) Frequency of templates (b) Coverage of templates Figure 5.1: Template types and template tokens in G1 be found by sampling.7 # of tags POS tags 48 LTAG G1 6926 LTAG G2 2920

(sw, st) 0.44% 5.09% 4.20%

(uw, st) 2.47% 2.43% 2.45%

(sw, ut) 0 0.31% 0.10%

(uw, ut) 0 0.02% 0.01%

total 2.91% 7.85% 6.76%

Table 5.4: The types of unknown (word, template) pairs in PTB section 23 For parsing purposes, a more important question is: how often do the unseen (word, template) pairs occur in new data? Table 5.4 shows that in G2 only 6.76% of the pairs in section 23 of PTB are not seen in section 2 to 21.8 We also list the percentage of unseen (word, POS tag) pairs in same data for comparison. Of all the unseen (word, template) pairs in G2 ,9 only 1.62% (0.10%+0.01% divided by 6.76%) are caused by the unseen templates, while the remaining 98.38% are caused by unseen words or unseen combinations. This implies that the presence of unseen templates is unlikely to have a signi cant impact on Supertagging or parsing. Notice that the percentage of (sw, st) is much higher than (uw, ut) plus (sw, ut), indicating that some type of smoothing over sets of templates (e.g., the notion of tree families in the XTAG grammar) is important for It goes without saying that for parsing purpose the lexical dependence information is crucial, and therefore, the larger the corpus is, the better. 8 We chose those sections because several state-of-the-art parsers(e.g. (Collins, 1997; Ratnaparkhi, 1998)) are trained and tested on those sections. 9 In the unseen pairs, the words can be unseen (uw) or seen(sw), similarly, the templates can be unseen (ut) or seen (st), so there are four combinations. (sw, st) means both words and templates have been seen in section 2-21, but not the pair. 7

79

improving the parsing accuracy.

5.2 Treebank grammars being combined with other grammars If a language already has a wide-coverage hand-crafted grammar such as the XTAG grammar, is a Treebank grammar still useful? De nitely. A Treebank grammar can help a hand-crafted grammar in two ways:

 To evaluate and improve the coverage of a hand-crafted grammar on a large corpus.  To provide statistical information to the hand-crafted grammar. In this section, we will concentrate on the former and leave the latter for the future work. Previous evaluations (Doran et al., 1994; Srinivas et al., 1998) of hand-crafted grammars used raw data (i.e., a set of sentences without syntactic bracketing). The data are rst parsed by an LTAG parser and the coverage of the grammar is measured as the percentage of sentences in the data which get at least one parse. For more discussion on this approach, see (Prasad and Sarkar, 2000). We propose a new evaluation method that takes advantage of Treebanks and LexTract. Using LexTract, the coverage of a hand-crafted grammar can be measured by the overlap of the hand-crafted grammar and the Treebank grammar.

5.2.1 Methodology The central idea of our evaluation method is as follows: given a Treebank T and a grammar Gh, let Gt be the set of templates extracted from T by LexTract, the coverage of Gh can be measured as the percentage of T which are covered by the intersection of Gt and Gh. One complication is that the Treebank and Gh may choose di erent analyses for certain syntactic constructions. That is, although some constructions are covered by both grammars, the corresponding templates in these grammars would look very di erent. To account for this, our method has several stages: 80

1. Extract a Treebank grammar from T . Let Gt be the set of templates in the Treebank grammar. 2. Put into G0t all the templates in Gt which match some templates in Gh. 3. Check each template in Gt , G0t and decide whether the construction represented by the template is handled di erently in Gh. If so, put the template in G00t . The coverage of Gh on T is measured as count(G0t [ G00t )=count(Gt ). The templates in Gt , G0t , G00t are the ones that are truly missing from Gh. 4. Check templates in Gt , G0t , G00t and add the plausible ones to Gh . In this section, we are focusing on general syntactic structures in two grammars, not on the completeness of lexicons. Therefore, for grammar coverage we use templates, instead of etrees. The method can be easily extended to compare etrees. Each step of the evaluation is described in detail as follows.

5.2.2 Stage 1: Extracting templates from Treebanks We choose G2 as our Treebank grammar (see Table 5.2 in Section 5.1) and the XTAG grammar as the hand-crafted grammar. The former has 2920 templates, and the latter has 1004 templates.

5.2.3 Stage 2: Matching templates in the two grammars To calculate the coverage of the XTAG grammar, we need to nd out how many templates in G2 match some templates in the XTAG grammar. We de ne two types of matching: t-match and c-match. t-match

An obvious distinction between the two grammars is that feature structures and subscripts10 are present only in the XTAG grammar, while frequency informations are available only in G2 . We call two trees t-match (t for tree) if they are identical barring the type The subscripts on the nodes mark the same semantic arguments in related subcategorization frame. For example, an ergative verb break can be either transitive (e.g. Mike broke the window.) or intransitive 10

81

of information present only in one grammar, such as feature structures and subscripts in XTAG and frequency information in G2 . In Figure 5.2, XTAG trees in 5.2(a) and 5.2(b) t-match the G2 tree in 5.2(c). S NP0

S VP

NP1

S VP

V

V

sleep

break

(a) pure intransitive verbs in XTAG

(b) ergative verbs in XTAG

NP

VP V sleep/break

(c) intransitive verbs in G 2

Figure 5.2: The trees for pure intransitive verbs and ergative verbs in XTAG t-match the tree for all intransitive verbs in G2 XTAG also di ers from G2 in that XTAG uses multi-anchor trees to handle idioms (Figure 5.3(a)), light verbs (Figure 5.3(b)) and so on.11 In each of these cases, the multianchors from the predicate. By having multi-anchors, each tree can be associated with semantic representations directly (as shown in Figure 5.3), which is an advantage of LTAG formalism. These trees are the spine-etrees where some arguments of the main anchor is expanded. G2 does not have multi-anchor trees because semantics is not marked in the Treebank and consequently LexTract can not distinguish idiomatic meanings from literal meanings. Since expanded subtrees are present only in the XTAG grammar, we disregard them when comparing templates. c-match t-match requires two trees to have exactly the same structure barring expanded subtrees, therefore, it does not tolerate minor annotation di erences between the two grammars. For instance, in XTAG, relative pronouns such as which and the complementizer that occupy distinct positions in the etree for relative clauses, whereas the Penn Treebank treats both as pronouns and therefore they occupy the same position in G2 , as shown in Figure 5.5. Because the circled subtrees will occur in every tree for relative clauses and (e.g. The window broke.). The object of the transitive frame and the subject of the intransitive frame are both labeled as NP1 , while the subject of the transitive frame is labeled NP0 . 11 For detail on multi-anchor trees, see Section 2.1.

82

S

S

S NP0

NP 0

VP V

NP 1 D

kick

N

the

NP0

VP V

NP1

take

N

bucket

VP V

NP 1

kick/take

walk

(a) idioms in XTAG sem: die(NP0 )

(b) light verbs in XTAG

(c) transitive verbs in G 2

sem: walk(NP0 )

sem: kick(NP0, NP1 )

Figure 5.3: Templates in XTAG with expanded subtrees t-match the one in G2 when the expanded subtrees are disregarded S

S S*

PP P P

P

S*

PP P

NP

Ad

S

NP

PP P

S* NP

P of/to

because of

due

to

(a) P+P compound in XTAG

(b) Ad+P compound in XTAG

(c) all the prepositions in Ext-G

Figure 5.4: The compound trees in XTAG t-match the one in G2 when the expanded subtrees in the former are disregarded wh-movement, all these trees will not t-match their counterparts in the other grammar. Nevertheless, the two trees share the same subcategorization frame (NP V NP), the same subcategorization chain12 S ! V P ! V and the same modi cation pair (NP; S ). To capture this kind of similarity, we decompose a mod-etree into a tuple of (subcat frame, subcat chain, modi cation pair). Similarly, a spine-etree is decomposed into a (subcat frame, subcat chain) pair, and a conj-etree into (subcat frame, subcat chain, coordination sequence).13 Two etrees are said to c-match (c for component) if they are decomposed into the same tuples. According to this de nition, the two templates in Figure 5.5 c-match. A subcategorization chain is a subsequence of the spine in a spine-etree where each node on the chain is a parent of some argument(s) in the subcategorization frame. The nodes on a subcategorization chain roughly correspond to various lexical projections in GB-theory. 13 Similar decomposition is done in LexTract's template lter (see Section 4.6). 12

83

NP NP

NP

*

S extroot NPnew

NP *

S extroot

Sbar

S lexroot

NP

new

Comp

S lexroot NPext

NPext

VP

VP

*T* V@ NP1

*T* V@ NP 1 (a) in XTAG

(b) in G 2

Figure 5.5: An example of c-match

Matching results So far, we have de ned two types of matching. Notice that neither type of matching is one-to-one. Table 5.5 lists the numbers of matched templates in two grammars. The last row lists the frequencies of the matched G2 templates in the PTB. For instance, the second column says 162 templates in XTAG t-match 54 templates in G2 , and these 54 templates account for 54.6% of the template tokens in the PTB. XTAG

G2

frequency

t-match c-match matched subtotal 162 314 476 54 133 187 54.6% 5.3% 59.9%

unmatched subtotal 528 2733 40.1%

total 1004 2920 100%

Table 5.5: Matched templates and their frequencies One di erence between the XTAG and the Treebank annotation is that an adjective modi es a noun directly in the former whereas in the latter an adjective projects to an adjective phrase (AP) which in turn modi es an NP, as shown in Figure 5.6. Similarly, in XTAG an adverb modi es a VP directly, whereas in the Treebank an adverb sometimes projects to an ADVP rst. If we disregard these annotation di erences, the percentage of matched template tokens increases from 59.9% to 82.1%, as shown in Table 5.6. The magnitude of the increase is due to the high frequency of templates with nouns, adjectives and adverbs.

84

t-match c-match XTAG

G2

frequency

173 81 78.6%

matched subtotal 497 215 82.1%

324 134 3.5%

unmatched subtotal 507 2705 17.9%

total 1004 2920 100%

Table 5.6: Matched templates when certain annotation di erences are disregarded NP

N A@

AP

N*

NP*

A@ (a) in XTAG

(b) in G2

Figure 5.6: Templates for adjectives modifying nouns

5.2.4 Stage 3: Classifying unmatched templates The previous section shows that 17.9% of the template tokens do not match any template in the XTAG grammar. This is due to several reasons:

T1: incorrect templates in G2 These templates result from Treebank annotation errors, and therefore, they are not in XTAG.

T2: coordination in XTAG the templates for coordination in XTAG are generated on-the- y while parsing (Sarkar and Joshi, 1996), and are not part of the 1004 templates. Therefore, the conj-etrees in G2 , which account for 3.4% of the template tokens in the PTB, do not match any templates in the XTAG grammar.

T3: alternative analyses XTAG and G2 sometimes choose di erent analyses for the same phe-

nomenon. As a result, the templates used to handle those phenomena do not match each other by our de nition.

T4: constructions not covered by XTAG Some constructions, such as the unlike coordination phrase (UCP), parenthetical (PRN), fragment (FRAG) and ellipsis, are not currently covered by the XTAG grammar.

For the rst three types, the XTAG grammar can handle the corresponding constructions although the templates used in two grammars look di erent and do not match according to our de nition. To nd out what constructions are not covered by XTAG, we 85

manually classify 289 of the most frequent unmatched templates in G2 according to the reason why they are absent from XTAG. These 289 templates account for 93.9% of all the unmatched template tokens in the Treebank. The results are shown in Table 5.7, where the percentage is with respect to all the tokens in the Treebank. From the table, it is clear that most unmatched template tokens are due to alternative analyses (T3) adopted in the two grammars. Combining the results in Table 5.6 and 5.7, we conclude that 97.2% of template tokens in the Treebank are covered by XTAG, while another 1.7% are not. Because the remaining 2416 unmatched templates in G2 have not been checked, it is not clear how many of the remaining 1.1% template tokens are covered by XTAG.14 type freq

T1 T2 T3 T4 51 52 93 93 1.1% 3.4% 10.6% 1.7%

total 289 16.8%

Table 5.7: Classi cations of 289 unmatched templates

5.2.5 Stage 4: Combining two grammars Simply taking the union of two etree sets will only yield an inconsistent grammar. We propose two methods to improve the coverage of the XTAG grammar with what we found in the Treebank grammar. First, we could simply analyze the constructions in T4 , build new trees for them and add these trees to the XTAG grammar. Another possibility is to use LexOrg to generate a new grammar. It has several steps: rst, LexTract decomposes templates in both grammars into a set of substructures such as subcategorization frames, subcategorization chains, transformations and modi cation pairs. Second, we use LexTract's lter to automatically rule out all the implausible substructures (except for transformations) in XTAG and G2 . Third, we manually check the remaining substructures. If two grammars adopt di erent treatment for certain construction, choose one treatment and its corresponding substructures. Last, we use LexOrg to generate a new grammar from these substructures. The new grammar will be consistent and have a good coverage of the Treebank. We plan to exploit this possibility in the future. The number 97.2% is the sum of two numbers: the rst one is the percentage of matched template tokens (82.1% from Table 5.6). The second number is the percentage of template tokens in T1|T3 (16.8%1.7%=15.1% from Table 5.7). 14

86

5.2.6 Summary We have presented a method for evaluating the coverage of an LTAG grammar on a Treebank. First, we use LexTract to automatically extract a Treebank grammar; Second, the templates in two grammars are matched; Third, the unmatched templates in the Treebank grammar are classi ed to decide how many of them are due to missing constructions in the other grammar. We have used this method to evaluate the coverage of the XTAG grammar on the PTB. The results show that the XTAG grammar can cover at least 97.2% of the template tokens in the PTB. This method has several advantages: rst, the whole process is semi-automatic and requires little human e ort; second, the coverage can be calculated at sentence level, template level and sub-structure level; third, the method provides a list of templates that can be added to the grammar to improve its coverage; fourth, there is no need to parse the whole corpus, which could have been very time-consuming.

5.3 Treebank grammars as sources of CFGs LexTract is designed to extract LTAGs, but simply reading CFG rules o the templates in an extracted LTAG will yield a CFG. For example, the etree in Figure 5.7(a) will yield two CFG rules in 5.7(b). Table 5.8 lists the size of the two CFGs built from the two Treebank LTAGs. S NP

(1) S -> NP VP (2) VP -> VBD NP

VP VBD

NP

(a) a template

(b) CFG rules derived from (a)

Figure 5.7: CFG rules derived from an etree

87

# of templates # of CFG rules LTAG G1 6926 1524 LTAG G2 2920 675 Table 5.8: CFGs derived from extracted LTAGs

5.3.1 Comparison with other CFG extraction algorithms Krotov et. al Simply reading the grammar o the parsed tree in Treebanks will create a grammar with a square-root rate of growth. Krotov proposes an algorithm to compact the derived grammar by eliminating redundant rules, the rules that can be 'parsed' (in the familiar sense of context-free parsing) using other rules of the grammar (Krotov and others, 1998). The algorithm checks each rule in the grammar in turn. If it can be parsed using other rules, remove the rule from the grammar. The rules that remain when all rules have been checked constitute the compacted grammar. The compact grammar for PTB has 1122 context-free rules, and the recall and precision of CFG parsers with the compact grammar are 30.93% and 19.18% respectively, in contrast to 70.78% and 77.66% of the same parser with the full grammar. Krotov's method di ers dramatically from LexTract in several ways: rst, it does not use the notion of head and it does not distinguish adjuncts and arguments; second, the process of compacting may result in di erent grammars depending on the order in which the rules in the full grammar are addressed. To maintain the order-independence, they removed all unary and epsilon rules by collapsing them with the sister nodes. Because of frequent occurrences of empty categories and unary rules in the Treebank, we suspect this practice will make the resulting grammars less intuitive and it might also have an e ect on parsing accuracy. Third, the growth of their grammar is non-monotonic in that as the corpus grows, the size of the grammar may go up or down because the new rules in the grammar may cause the existing rules become redundant and get eliminated. Although the size of the compact grammar might approach a limit eventually in their experiment, it is not clear to us how stable the grammar really is, considering the existence of annotation errors in the Treebank. For example, it is possible that a few bad rules (e.g. fX ) X 88

ZPg, where ZP is any syntactic label) can ruin the whole grammar by making many good rules redundant and get eliminated. They mention in their paper they have developed a linguistic compaction algorithm that could retain redundant but linguistically valid rules. Unfortunately, the description is too sketchy for us to gure out how exactly that algorithm works.

5.4 Lexicons as training data for Supertaggers A Supertagger (Joshi and Srinivas, 1994; Srinivas, 1997) assigns an etree template to each word in a sentence. The templates are called Supertags because they include more information than Part-of-Speech tags. Srinivas implemented the rst Supertagger and he also built a Lightweight Dependency Analyzer (LDA) that assembles the Supertags of words to create an almost-parse for the sentence. One diculty in using Supertaggers is the lack of training and testing data. To use a Treebank for that purpose, the phrase structures in the Treebank have to be converted into (word, Supertag) sequences rst. Besides LexTract, there have been two other attempts at converting the English Penn Treebank to (word, Supertag) sequences in order to train a Supertagger. Srinivas (1997) uses heuristics to map structural information in the Treebank into a set of pre-existing Supertags. Chen & Vijay's method (2000) is similar to ours. For a comparison of these three methods, see Section 4.10. To compare these three methods with respect to Supertagging, we use the data converted by these methods to train a Supertagger. In the experiment, the Supertagger (Srinivas, 1997), the evaluation tool and the original PTB data are identical. The results are given in Table 5.9.15 The results of Chen & Vijay's method come from their paper (Chen and Vijay-Shanker, 2000). They built eight grammars. We just list two of them which seem to be most relevant: C4 uses a reduced tagset while C3 uses the PTB tagset. As for Srinivas' results, we re-ran his Supertagger using his data on the sections that we As usual, we use section 2-21 for training and section 22 or 23 for testing. We include the results for section 22 because (Chen and Vijay-Shanker, 2000) is tested on that section and its results on section 23 is not available. 15

89

have chosen, because his previous results were trained and tested on di erent sections.16 We have calculated two baselines for each set of data. The rst one tags each word in testing data with the most common Supertag w.r.t that word in the training data. For an unknown word, the most common Supertag is used. For the second baseline, we use a trigram POS tagger to tag the words rst, and then for each word we use the most common Supertag w.r.t. that (word, POS tag) pair. As shown in the table, the rst baselines for Supertagging are quite low, in contrast to the 91% for POS tagging. This indicates that Supertagging is much harder than POS tagging The results for the second baseline are slightly better, showing that using POS tags improves the Supertagging baselines. The Supertagging accuracy using G2 is 1.3{1.9% lower than the one using Srinivas' data. This is not surprising since the size of G2 is 6 times of Srinivas' tagset. Notice that G1 is twice the size of G2 and the accuracy using G1 is 2% lower. A word of caution is in order. Higher Supertagging accuracy does not necessarily mean the quality of converted data are better since the underlying grammars di er a lot with respect to the size and the coverage. A better measure would be the parsing accuracy, i.e., the converted data should be fed to a common LTAG parser and the evaluations should be based on parsing results. Nevertheless, the experiments show that the (word, template) sequences produced by LexTract are useful for training Supertaggers. Our results are slightly lower than the ones trained on Srinivas' data, but our conversion algorithm has Noticeably, the results we report on Srinivas' data, 85.78% on Section 23 and 85.53% on Section 22, are lower than the 92.2% reported in (Srinivas, 1997), 91.37% in (Chen et al., 1999) and 88.0% in (Doran, 2000). There are several reasons for that. First, the size of training data in our report is smaller than the ones for his previous work, which was trained on section 0-24 except for section 20 and tested on section 20. Second, we treat punctuation marks as normal words during evaluation because, like other words, punctuation marks can anchor etrees, while he treats the Supertags for punctuation marks as always correct. Third, he used some equivalent classes during evaluations. If a word is mis-tagged as x, while the correct Supertag is y, he does not consider that to be an error if x and y appear in the same equivalent class. We suspect that the reason that those Supertagging errors are disregarded is because they might not a ect parsing results when the Supertags are combined. For example, both adjectives and nouns can modify other nouns. The two templates (i.e., Supertags) representing these modi cation relations look the same except for the POS tags of the anchors. If a word which should be tagged with one Supertag is mis-tagged with the other Supertag, it is likely that the wrong Supertag can still t with other Supertags in the sentence and produce the right parse. We did not use these equivalent classes in this experiment. 16

90

templates 483

section Srinivas' 23 22 our G2 2920 23 22 our G1 6926 23 22 Chen & Vijay's 2366 | 8996 22 C4 4911 22 C3 8623 22

base1 72.59 72.14 71.45 70.54 69.70 68.79 -

base2 74.24 73.74 74.14 73.41 71.82 70.90 -

accuracy 85.78 85.53 84.41 83.60 82.21 81.88 77.8 | 78.9 78.90 78.00

Table 5.9: Supertagging results based on three di erent conversion algorithms some unique properties. First, our algorithm does not use a pre-existing Supertag set. Instead, it generates the Supertag set while converting PTB data. Second, the Supertags in our converted data t together, that is, it is guaranteed that the correct parse will be produced by an LTAG parser if the parser is given the correct Supertag for each word and is asked to produce all possible parses.

5.5 MC sets for testing the Tree-locality Hypothesis In Section 4.8, we discussed our strategy for testing the Tree-locality Hypothesis. It has three stages. Firstly, we use LexTract to nd all the examples that seem to violate the hypothesis. We call these examples \non-tree-local", or \non-local" for short. Secondly, we classify the examples according to the underlying constructions (such as extrapositions). Thirdly, we determine whether each class could become tree-local if an alternative analysis for the construction was adopted. In this section, we report our experimental results on the PTB.

5.5.1 Stage 1: Finding \non-local" examples In Section 4.8, we present an algorithm to nd all the examples that seem to violate the Tree-locality Hypothesis. We ran this algorithm on the PTB and found a number of non-local examples. Table 5.10 lists the the numbers of tree sets with particular sizes. A tree set, in this context, includes the two etree templates that include the co-indexed 91

constituents and the etree templates that connect them in the derivation trees. Out of 3151 tree sets, 999 (31.7%) have more than three etrees (i.e., the etrees for the gap and the ller do not adjoin to the same etree), and they count for 8.7% of all the occurrences of tree sets. size of tree sets  3 (tree-local sets) 4 5 6 7 8 subtotal total # of tree sets (type) 2152(68.3%) 874 94 26 4 1 999(31.7%) 3151 # of tree sets (token) 19994(91.3%) 1772 102 26 4 1 1905(8.7%) 21899 Table 5.10: Numbers of tree sets and their frequencies in PTB

5.5.2 Stage 2: Classifying \non-local" examples There are three possible explanations for those \non-local" examples:

 there are some errors in the Treebank annotation or imperfection in LexTract, which cause the examples to be incorrectly tagged as \non-local";

 the relations between the co-indexed nodes are non-local, but they are not syntactic movement, therefore, irrelevant to the Tree-locality Hypothesis;

 they are syntactic movement and are truly non-tree-local according to the current

de nitions of substitution and adjoining operations. There have been recent work on extending these de nitions to account for a number of syntactic movement (Joshi and Vijay-Shanker, 1999; Kulick, 2000). It is possible that these movements found by LexTract can be handled by tree-local MCTAG under these new de nitions.

We manually classify the \non-local" examples found by LexTract. The results are given in Table 5.11. Among the 999 sets, 71 are caused by obvious annotation errors, 65 are tree-local but our extraction algorithm does not recognize that.17 For the remaining ones, we divide them into seven classes according to the underlying syntactic constructions (e.g., extraposition). The reason for this is that LexTract does not factor out predicative auxiliary trees (such as the one for \think") in a relative clause, as discussed in Section 4.9.4. 17

92

PTB LexTract NP- extraction it- comparative of-PP paren- so .. others errors errors EXP from coord. EXP construction thetical that 71 65 337 209 176 50 31 30 11 19 Table 5.11: Classi cation of 999 extended MC sets that look non-tree-local

5.5.3 Stage 3: Studying \non-local" constructions An example for each construction is given in (1) | (7). Except for (2), every sentence includes a phrase in bold. According to the Treebank annotation, these phrases in bold are modi ed by the \moved" constituents. We have studied these constructions and found plausible analyses for them within the domain of the tree-local MCTAG. For example, we have shown in (Xia and Bleam, 2000) that the extraposition construction is not syntactic movement, therefore, the non-locality between the \ ller" and the \gap" is irrelevant to the Tree-local Hypothesis. In our nal thesis, we will include our analyses for these constructions. (1) NP-extraposition: Supply troubles were on the minds of Treasury investors ti yesterday, [who worried about the ood]i. (2) Extraction from coordination: That is [a skill]i Sony badly needs ti and Warner is loath to lose ti . (3) It-extraposition: It ti would be my inclination [to advise clients not to sell]i. (4) Comparative construction: Federal Express goes further ti in this respect [than any company]i . (5) Of-PP: [Of all the ethnic tensions in America]i , which ti is the most troublesome right now? (6) Parenthetical: [JMB ocials are expected to be hired to represent the pension fund on the Santa 93

Fe Paci c Realty board, Mr Roulac said ti , to insulate the fund from potential liability problems.]i (7) So ... that construction: The Diet doesn't normally even debate bills because the opposition parties are so often ti opposed to whatever LDP does [that it would be a waste of time]i.

5.6 Proposed Work In this chapter, we have introduced several types of applications for LexTract. First, the Treebank grammar produced by LexTract can be used as a stand-alone grammar; second, the Treebank grammar can be used to estimate the coverage of the hand-crafted grammar and to nd the constructions that are not covered by that grammar; third, the Treebank grammar can be the source of a Treebank context-free grammar; fourth, the elementary tree sequences produced by LexTract for every sentence in a Treebank can be used to train Supertaggers; last, we use LexTract to nd all multi-component tree sets from the English Penn Treebank. After examining all the examples that seem to violate Treelocality Hypothesis, we come to the conclusion that all those examples can be handled by other plausible analyses within tree-local MCTAG. In this section, we will discuss three remaining issues.

5.6.1 Training a statistical LTAG parser Supertagging can speed up parsing because a LTAG parser now only needs to consider one or a few etrees (in case of n-best Supertagging) for each word in the sentence, instead of every etree the word can anchor. However, the Supertagging accuracy does not link directly to parsing accuracy because even if the former is 100% correct, there may still be multiple ways to combine the etrees, resulting in multiple parses. A better way to use LexTract for parsing is to use derivation trees produced by LexTract to train a LTAG parser directly. Derivation trees provide the frequency of certain etrees substituting/adjoining to other etrees. Anoop Sarkar has built a head-corner LTAG parser (Sarkar, 2000). We are currently collaborating to build a language model for that parser and to test it with the derivation 94

trees produced by LexTract. During the process, we will exploit various treatments for coordination and punctuation among others to nd out what kind of Treebank grammars are most suitable for parsing. In addition, the same data provided by LexTract has been used by a LR LTAG parser(Prolo, 2000).

5.6.2 Error detection LexTract maintains the mappings between ttree nodes and etree nodes. The operation on etrees can be projected to the corresponding ttrees. This makes LexTract a useful tool for Treebank annotation. For instance, LexTract can detect annotation errors in the Treebank if the errors result in implausible etrees. Recall that LexTract has a lter that marks an etree as either plausible or implausible. If an etree is marked as implausible, it implies that the corresponding nodes in the ttree are annotated improperly. We are using LexTract for the nal cleanup of the Chinese Penn Treebank. We will report the results on the nal thesis.

5.6.3 Language comparison The kernel of LexTract is language- and corpus-independent in the sense that all the language- or corpus-speci c information is stored in several tables which LexTract take as input. Consequential, to apply LexTract to Treebanks for di erent languages, the only work involved is building those tables, which should take little human e ort. Once we apply LexTract to several Treebanks for di erent languages, it would be interesting to compare these Treebank grammars and see how similar or di erent they are. LexTract provides quantitative support for an investigation into the universal versus stipulatory aspects of di erent languages.

95

Chapter 6

Chinese Penn Treebank Treebanks are a prerequisite for running LexTract and any supervised NLP tools such as POS taggers and parsers. However, building a large high-quality Treebank is dicult. Since late 1998, we have been developing a 100-thousand-word Treebank for Chinese. During the process, we have encountered many challenges. In this chapter, we address two major challenges and our response to them.

 guideline preparation: preparing good guidelines for word segmentation, part-ofspeech tagging and bracketing.

 quality control: ensuring inter-annotator consistency and adherence to the guidelines. This chapter is organized as follows: Section 6.1 is an overview of the project; Section 6.2 introduces two phases of annotation; Section 6.3 describes our methodology for guideline preparation; Section 6.4 | 6.6 highlight the challenges we encountered while making three sets of guidelines; Section 6.7 discusses our strategy for quality control.

6.1 Project Inception With growing interest in Chinese Language Processing, numerous NLP tools (e.g. word segmenters, part-of-speech taggers, and parsers) for Chinese have been developed all over the world. However, since there are large-scale Chinese Treebank available to the public, 96

it is dicult to compare results and gauge progress in the eld. As a rst step towards addressing this issue, we have been preparing the Chinese Penn Treebank since late 1998. Our Treebank consists of 329 articles from the Xinhua newswire published in 19941998.1 It contains 173,981 hanzi (or 100,996 words after word segmentation). The corpus has 3,289 sentences, averaging 47 hanzi (or 27 words after segmentation) per sentence.2 A preliminary version of the Treebank was released to the public in June 2000. Our rst step in assessing community interest in a Chinese Treebank was a threeday workshop on issues in Chinese language processing, which was held at University of Pennsylvania in June 1998. The aim of this workshop was to bring together in uential researchers from mainland China, Hong Kong, Taiwan, Singapore, and the United States in a move towards consensus building with respect to word segmentation, part-of-speech (POS) tagging, syntactic bracketing and other areas.3 The workshop included presentations of guidelines which were used in mainland China and Taiwan, as well as segmenters, part-of-speech taggers and parsers. There were also several working groups that discussed speci c issues in segmentation, POS tagging and the syntactic annotation of newswire text. There was general consensus at this workshop that a large-scale e ort to create a Chinese Treebank would be well received, and that linguistics expertise was a necessary prerequisite to successful completion of such a project. The workshop made considerable progress in de ning criteria for segmentation guidelines as well as addressing the issues of part-of-speech tagging and syntactic bracketing. The Penn Chinese Treebank project began shortly after the workshop was held.4 The majority of these documents focus on economic developments, while the rest describe general political and cultural topics. 2 A sentence is anything that ends with a period, a exclamation mark or a question mark, therefore, it does not include the headline at the beginning of each article. 3 The American groups included the Institute for Research in Cognitive Science and the Linguistics Data Consortium (which distributes the English Treebank) at the University of Pennsylvania, the University of Maryland, Queens College, the University of Kansas, the University of Delaware, Johns Hopkins University, Systran, BBN, ATT, Xerox, West, Unisys and the US Department of Defense. We also invited scholars from Taiwan, Hong Kong and Singapore. 4 Our Penn Chinese Treebank website, http://www.ldc.upenn.edu/ctb , includes segmentation, POS tagging and bracketing guidelines, as well as sample les, information on our rst workshop and much 1

97

6.2 Annotation Process The annotation is done in two phases: the rst phase is word segmentation and part-ofspeech (POS) tagging and the second phase is syntactic bracketing. At each phase, all the data are annotated at least twice with a second annotator correcting the output of the rst annotator. During the process, we have held several meetings to get feedback from the Chinese NLP community and have revised our guidelines accordingly. Figures 6.1 and 6.2 summarize the milestones of the project. "draft": draft of word segmentation and POS tagging guidelines "pass": segmenting and POS tagging the corpus

fi sta nish rt 1s 1s t d t p ra as ft s

guidelines and annotations

6/98

8/98

9/98

meeting during ACL-98 Montreal Canada

3-day CLP workshop Philadelphia USA

f f ini sta inis sh 1 rt h 2 st 2n nd pa d d ss pa ra ss ft f co ini an m sh d pil 2n cl e d ea le p n xi ass th co ec n or pu s

workshops and meetings

11/98

1/99

3/99

meeting during ICCIP-98 Beijing, China

Figure 6.1: The rst phase: segmentation and POS tagging workshops and meetings

"draft": draft of bracketing guidelines

guidelines and annotations

"pass": bracketing the corpus

5/99

6/99

10/99

meeting during ACL-99 Maryland, USA

12/99

4/00

fin ex aliz tr e fin act guid al gra el cl mm ine ea nu ars s p

tp as s fin i sta sh rt 2n 2n d d dr pa af ss t fin ish 2n d pa ss

1s ish fin

fi sta nish rt 1s 1s t d t p ra as ft s

release of the Treebank

6/00 7/00 10/00 5/00 prelim release 2nd release 2nd CLP workshop Hong Kong

Figure 6.2: The second phase: bracketing and data release Our team includes two linguist professors, three computational linguists, two annotators and several external consultants. My responsibilities include: writing segmentation more.

98

and POS guidelines, co-authoring bracketing guidelines, managing day-to-day operation, organizing a number of workshops and meetings in USA and abroad, and so on.

6.2.1 Word segmentation and POS tagging Before the rst pass of annotation, we nished a draft of segmentation and POS tagging guidelines. Our corpus was automatically segmented and POS tagged by the BBN/GTE integrated stochastic segmenter and part-of-speech tagger. The tagger was trained on the Academia Sinica Balanced Corpus (ASBC). Since the ASBC guidelines and our guidelines have some di erences (see our website), we wrote tools to convert the ASBC tags into our tags automatically. Although the mapping was not one-to-one and introduced some errors, this process greatly accelerated annotation. The rst pass (including the time spent on training annotators) took roughly four months to complete. After the rst pass, the guidelines were revised and the second pass began, where one annotator double-checked the les annotated by the other annotator. The second pass took less than two months because the annotators were well-trained by then and the input of the second pass was much better than the input of the rst pass. To identify possible tagging errors, we compiled a list of (word, POS tag) pairs. Then we checked the list for implausible pairs and x the errors in the Treebank.

6.2.2 Syntactic bracketing The rst pass of bracketing started once we had nished a preliminary draft of the bracketing guidelines and it took two annotators four months. The primary goal of this pass was to identify the problems in the draft guidelines such as certain constructions that were not covered or analyses that could not be extended to account for new data. Having a corpus bracketed, albeit coarsely, also made it possible for us to use a corpus search tool that operates on bracketed sentences. With this tool we were able to pull out all the sentences in the corpus relevant to a particular construction and come up with better generalizations which are more robust in accommodating new data. After the rst pass, we did an extensive revision of the bracketing guidelines based on 99

the problems we had collected and solved during the rst pass. After training the annotators on the revised guidelines the second pass began. The emphasis of the second pass was quality control, i.e. to ensure inter-annotator consistency and annotators' adherence to the guidelines as discussed in Section 6.7. After the second pass, we plan to run LexTract to identify certain types of annotation errors, which will facilitate our nal cleanup of the corpus and nal revisions of the three sets of guidelines.

6.3 Methodology for Guideline Preparation To build a Treebank for Chinese we need three sets of guidelines | segmentation, partspeech tagging and bracketing guidelines. Making these guidelines is especially challenging because:

 Unlike Western writing systems, Chinese writing does not have a natural delimiter between words, and the notion of word is very hard to de ne.

 Chinese has very little, if any, in ectional morphology. Words are not in ected

with number, gender, case, or tense. For example, a word such as hu-mie in Chinese corresponds to destroy/destroys/destroyed/destruction in English. This fuels the discussion on whether the POS tags should be based on meaning or on syntactic distribution in Chinese NLP communities. If only the meaning is used, hu-mie should be a verb all the time. If syntactic distribution is used, the word is a verb or a noun depending on the context.

 There are many open questions in Chinese syntax. To further complicate the sit-

uation, Chinese, like any other language, is under constant change. With its long history, a seemingly homogeneous phenomenon in Chinese (such as long and short bei-construction) may be, in fact, a set of historically related but syntactically independent constructions (Feng, 1998).

 Chinese is widely spoken in areas as diverse as mainland China, Hong Kong, Taiwan,

and Singapore. There is a growing body of research in Chinese natural language 100

processing, but little consensus on linguistic standards along the lines of the EAGLES initiative in Europe.5 To tackle these issues, we adopted the following approach:

 In addition to studying the literature on Chinese morphology and syntax, we collaborate closely with our linguistics experts to work out plausible analyses for syntactic constructions.

 When there are no clear winners among several alternatives, we choose one, and annotate the corpus in a way that our annotation can be easily converted to accommodate other alternatives when needed.

 We study other groups' guidelines, such as the Segmentation Standard in China (Liu et al., 1993) and the one in Taiwan (Chinese Knowledge Information Processing Group, 1996), and accommodate them in our guidelines if possible.

 We organize regular workshops and meetings and invite experts from the United States and abroad to discuss open questions, share resources and seek consensus. We also visited China and Taiwan to present our work and ask for feedback.

 Annotators are encouraged to ask questions during the annotation process and in the second pass of bracketing randomly selected les are re-annotated by both annotators to evaluate their consistency and accuracy. Annotation errors and inter-annotation inconsistencies can reveal places in the guidelines that need revision.

In an ideal situation, guidelines would be nalized before annotation begins. However, real data from a corpus are far more complicated and subtle than examples discussed in the linguistics literature and many problems do not surface until sucient data have been annotated. In this project, we divided each phase of the annotation and guideline development into several steps: EAGLES stands for the Expert Advisory Group on Language Engineering Standards. For more information, please check out its website at http://www.ilc.pi.cnr.it/EAGLES/home.html. 5

101

1. Before the rst pass, we created the rst version of guidelines based on corpus analysis, review of the literature, and consultation with experts in treebanking and Chinese linguistics. 2. During the rst pass, these guidelines evolve gradually through the resolution of annotation diculties and annotator inconsistencies. 3. After the rst pass, the guidelines are partially nalized and when possible the corpus is automatically converted to be consistent with the new guidelines before the second pass begins; 4. In the second pass our quality control method (Section 6.7) is applied to reveal places in the guidelines that need revision. Fortunately, guideline revision at this stage has been very few. 5. After the second pass, the guidelines are nalized and the annotation is revised if necessary. In this process, through careful design of the rst version of the guidelines, no substantial changes have been made in the following versions and most revision of the annotation is done automatically by simple conversion tools. In the following three sections we discuss highlights from each set of guidelines.

6.4 Segmentation Guidelines The central issue for word segmentation is how the notion of word should be de ned.

6.4.1 Notions of word There are several di erent notions of words. (Sciullo and Williams, 1987) discusses four of them, namely, morphological object, syntactic atom, phonological word and listeme. According to (Sciullo and Williams, 1987), the syntactic atoms are the primes of syntax. They possess the syntactic atomicity properties, that is, the inability of syntactic rules to

102

analyze the contents of X 0 categories.6 Since our task is to build syntactic structures for sentences, we adopt the notion syntactic atom for our word segmentation task. Once the notion of word is de ned, the notions of ax, compound and phrase can be de ned accordingly. However, the distinction between a word and a non-word is not always clear-cut. For example, in English, pro- (which means supporting) normally can not stand alone, therefore, it is like a pre x. However, it can appear in a coordinated structure, such as pro- and anti-abortion. Based on the assumption that only words and phrases can be coordinated, it is like a word.7 Therefore, the status of pro- is somewhere between pre x and word. Making word and non-word distinction is even more dicult for Chinese for a number of reasons:

 Chinese is not written with word delimiters so segmenting a sentence into "words" is not a natural task even for a native speaker.

 Chinese has little morphological marking to ease word identi cation.  The structures within words and phrasal structures are very similar, making the distinction between words and phrases more elusive.

 There is little consensus in the community on dicult constructions which could

a ect word segmentation. The handling of resultative verb compounds, for instance, depends on the analysis of the construction, for which there is still no consensus in the linguistics community. One view on how a verb-resultative compound is formed says that a simple sentence with the compound is actually bi-clausal and the compound is formed by movement, therefore, the compound should be treated as two words. Another view believes the compound is formed in the lexicon, and therefore should be one word.

 Many monosyllabic morphemes which used to be able to stand alone become bound in Modern Chinese. The in uence of non-Modern Chinese makes it dicult to draw

Whether morphology and syntax are truly independent is still an open question (Sciullo and Williams, 1987). We will not get into detail in this thesis. 7 For example, both a- and im- are pre xes and \amoral and immoral" can not be shorted into \a- and immoral". 6

103

the line between bound morphemes and free morphemes, the notions which could otherwise have been very useful for deciding word boundaries. Not surprisingly, the de nition of word in Chinese has long been notoriously controversial in Chinese linguistic community, and word standards adopted by various research groups including the natural standard in Mainland China (Liu et al., 1993) and the word standard used by Academia Sinica in Taiwan (Chinese Knowledge Information Processing Group, 1996) di er substantially.

6.4.2 An experiment To test how well native speakers agree on word segmentation of written texts, we randomly chose 100 sentences (5060 hanzi) from the Xinhua newswire and asked the participants of the rst CLP workshop to segment them according to their personal preferences.8 We got replies from seven groups, almost all of whom hand corrected their output before sending it. Table 6.1 shows the results of comparing the output between each group pair. Here, we use three measures that are widely used to measure parsing accuracy: precision, recall, and the number of crossing brackets (Black et al., 1991).9 The experiment is similar to the one discussed in (Sproat et al., 1996) in which six native speakers were asked to mark all the places they might pause if they were reading the text We did not give them any segmentation guidelines. Some participants applied their own guideline standards for which they had automatic segmenters, while others simply used their intuitions. 9 Given a candidate le and a Gold Standard le, the three metrics are de ned as: the precision is the number of correct constituents in the candidate le divided by the number of constituents in the candidate le, the recall is the number of correct constituents in the candidate le divided by the number of constituents in the Gold Standard le, and the number of crossing brackets is the number of constituents in the candidate le that cross a constituent in a Gold Standard le. If we treat each word as a constituent, a segmented sentence is similar to a bracketed sentence and its depth is one. To compare two outputs, we chose one as the Gold Standard, and evaluated the other output against it. As noted in (Sproat et al., 1996), for two outputs J1 and J2 , taking J1 as the Gold Standard and computing the precision and recall for J2 yields the same results as taking J2 as the Gold Standard and computing the recall and the precision respectively for J1 . However, the number of crossing brackets when J1 is the standard is not the same as when J2 is the standard. For example, if the string is ABCD and J1 segments it into AB CD and J2 marks it as A BC D, then the number of crossing brackets is 1 if J1 is the standard and the number is 2 if J2 is the standard. 8

104

1 2 3 4 5 6 7

1 88/90/3 90/90/3 88/83/9 91/92/3 91/91/3 84/92/1

2 90/88/6 90/87/5 88/80/10 90/89/4 89/86/6 83/89/2

3 90/90/4 87/90/3 88/82/7 88/89/4 89/89/4 82/89/2

4 83/88/3 80/88/14 82/88/2 86/92/9 81/86/3 74/87/4

5 92/91/3 89/90/4 89/88/5 92/86/7 90/90/4 85/92/1

6 91/91/3 86/89/3 89/89/4 86/81/9 90/90/4 83/91/1

7 92/84/9 89/83/7 89/82/10 87/74/16 92/85/8 91/83/10 -

average 90/89/5 87/88/6 88/87/5 88/81/10 90/90/5 89/88/5 82/90/2

Table 6.1: Comparison of word segmentation results from seven groups aloud. In both experiments, the native speakers (or judges) were not given any speci c segmentation guidelines. Following (Sproat et al., 1996), we calculate the arithmetic mean of the precision and the recall as one measure of agreement between each output pair, and the average agreement is 87.6%, much higher than 76% in (Sproat et al., 1996). Without comparing the data in these two experiments, we do not know for sure why the numbers di er so much. One factor that might have contributed to the di erence is that the instructions given to the judges were not exactly the same: in our experiment, the judges were asked to segment the sentences into words according to their own de nitions of a word, while in their experiment, the judges were asked to mark all places they might possibly pause if they were reading the text aloud. There are places in Chinese, e.g. between a verb and an aspect marker that follows the verb, where normally native speakers do not pause but they still treat the verb and the aspect marker as two words. Another factor that might explain why the degree of the agreement in our experiment was much higher is that in our experiment all the judges were well-trained computational linguists. Some judges had their own segmentation guidelines and/or segmenters. They either followed their guidelines or used their segmenters to automatically segment the data and then hand corrected the output. In either way, their outputs should be more consistent. The fact that the average agreement in our experiment is 87.6% and the highest agreement among all the pairs is 91.5% con rms the belief that it is common for native speakers to disagree on where word boundaries should be. On the other hand, the average number of crossing brackets is only 5.4 and the lowest is 1. Furthermore, most of these crossing brackets later turned out to be caused by careless human errors. This implies that much of the disagreement is not critical and if native speakers are given good segmentation 105

guidelines, the agreement between them will improve greatly.

6.4.3 Tests of wordness So what is a word? The following tests for establishing word boundaries have been proposed in the literature: (Without loss of generalization, we assume the string that we are trying to segment is X-Y, where X and Y are two morphemes)

 bound morpheme: a bound morpheme should be attached to its neighboring morpheme to form a word when possible.

 productivity: if a rule that combines the expression X-Y does not apply generally, i.e., it is not productive, then X-Y is likely to be a word.

 frequency of co-occurrence: if the expression X-Y occurs very often, it is likely to be a word.

 complex internal structure: strings with complex internal structures should be segmented when possible.

 compositionality: if the meaning of X-Y is not compositional, it is likely to be a word.

 insertion: if another morpheme can be inserted between X and Y, then X-Y is unlikely to be a word.

 XP-substitution: if a morpheme can not be replaced by a phrase of the same type, then it is likely to be part of a word.

 the number of syllables: several guidelines (Liu et al., 1993; Chinese Knowledge

Information Processing Group, 1996) have used syllable numbers on certain cases. For example, in (Liu et al., 1993), a verb-resultative compound is treated as one word if the resultative part is monosyllabic, and it is treated as two words if the resultative part has more than one syllable.

All of these tests are very useful. However, none of them is sucient by itself for covering the entire range of dicult cases. Either the test is applicable only to limited 106

cases (e.g. the XP-substitution test) or there is no objective way to perform the test as the test refers to vaguely de ned properties (e.g. in the productive test, it is not clear where to draw the line between a productive rule and a non-productive rule). For more discussion on this topic from the linguistics point of view, please refer to (Packard, 1998; Sciullo and Williams, 1987). Since no single test is sucient, we chose a set of tests for our segmentation guidelines which includes all of the ones mentioned except for the productivity test and the frequency test. Rather than have the annotators try to memorize the entire set and make each decision from these principles, in the guidelines we spell out what the results of applying the tests would be for all of the relevant phenomena. For example, for the treatment of verbresultative compounds, we select the relevant tests, in this case the number of syllables, the insertion test, and the XP-substitution test, and give several examples of the results of applying these tests to verb-resultative compounds. This makes it straightforward, and thus ecient, for the annotators to follow the guidelines. The guidelines are organized according to the internal structure of the corresponding expressions (e.g. a verb-resultative compound is represented as V+V, while a verb-object expression is as V+N), so it is easy for the annotators to search the guidelines for needed references. The segmentation guidelines, including the comparisons between our guidelines and the ones used in China and Taiwan, can be found on our website.

6.5 POS Tagging Guidelines The central issue on POS tagging is how POS tags should be de ned and distinguished from one another.

6.5.1 Criteria for POS tagging People generally agree that a POS tagset should include the tags for nouns, verbs, prepositions, adverbs, conjunctions, determiners, classi ers, and so on, but they di er in how those tags should be de ned. Since Chinese has little, if any, in ectional morphology, POS tags are normally de ned based on either semantics or syntax. The rst view believes the 107

tags should be de ned based on meaning, whereas the second one de nes POS tags based on syntactic distribution. This issue has been debated since the 1950s (Gong, 1997) and the controversy remains. For example, a word such as hu-mie in Chinese can be translated into destroy/destroys/destroyed/destroying/destruction in English, and it is used roughly the same way as its counterparts in English. According to the rst view, POS tags should be based solely on meaning. Since the meaning of the word remains roughly the same across all of these usages, it should always be tagged as a verb. The second view says POS tags should be determined by the syntactic distribution of the word. When hu-mie is the head of a noun phrase, it should be tagged as a noun in that context. Similarly, when it is the head of a verb phrase, it should be tagged as a verb. We have chosen syntactic distribution as our main criterion for POS tagging, because this approach complies with the principles adopted in contemporary linguistics theories (such as the notion of head projections in X-bar theory and GB theory) and it captures the similarity between Chinese and other languages. One argument that is often used against this approach is as follows: since many verbs in Chinese can also occur in noun positions, using this approach will require each word has two tags, thus increasing the size of the lexicon. We do not believe this argument poses an problem for this approach. Firstly, the extra POS tag allows us to distinguish between these verbs and many other verbs (such as monosyllabic verbs) which can not occur in noun positions. Secondly, if there are generalizations about which verbs can occur in noun positions and which can not, these generalizations can be represented as morphological rules, which allow the lexicon to be expanded automatically. On the other hand, if no such generalizations exist and the nominalization process is largely idiosyncratic, it supports the view that this is a lexical phenomenon and verbs which can be nominalized should be marked by having two POS tags in the lexicon. Thirdly, the phenomenon that many verbs can also occur in noun positions is not unique to Chinese, and the standard treatment in other languages is to give them both tags.

108

6.5.2 Syntactic distribution and POS tagset The syntactic distribution of a word can be seen as a set of positions at which this word can appear, such as the argument position of a VP, the head of a VP and so on. The next step is to choose a POS tagset and then for each pair of tags nd tests that can be used to distinguish these two tags. This task can be de ned as follows (see Figure 6.3): Given a relation f between the word set W and a set of positions P in which the words can appear, nd a POS tagset T , a relation f1 between W and T , and a relation f2 between T and P , such that 10

(C1) f is a subset of f1  f2. i.e. no information should be lost in the process; (C2) f1  f2 , f should be small, i.e. over-generation should be avoid; (C3) f1, T and f2 should be small; That is, a small (word, pos) lexicon, a small tagset and less numbers of positions for each POS tag are prefered.

Notice that (C2) and (C3) are often con icting with each other. For example, if T has only one tag, i.e. every word has the same tag, f1 , T and f2 will be minimal, but f1  f2 , f will be maximal. On the other hand, if T has a distinct tag for every word, f1  f2 , f will be minimal, but T is too big. It is very dicult to solve the problem directly because: rstly, in practice, f is not given to the guideline designer and the designer has to decide for himself if a word can appear in certain positions); Secondly, the POS tagset is unknown; Thirdly, even if both the relation f and the tagset T were given, it is not clear to use what ecient method should be used to nd f1 and f2 . Due to these diculties, we adopt the approach in Table 6.2. Some observersion is in order. First, the third choice in (C) is very costly. To reduce the cost of using it, the initial tagset in (A) is prefered to be reasonably large under the condition that the distinction among the tags in the tagset is clear. The tagset we used in (A) has 11 tags (noun, verb, determiner, number, measure word, adverb, preposition, conjunction, interjection, punctuation and a tag for the rest). The tagset eventually has Recall that a relation between set A and set B is a set of (x,y) pairs, where x 2 A and y 2 B . f1  f2 is the composition of two relations, that is, (x; y) 2 f1  f2 if and if only there exists z such that (x; z) 2 f1 and (z; y) 2 f2 . 10

109

(a)

words(W)

f

w1 w2 w3 .....

positions(P) arg VP mod VP headVP ...

(b)

words(W) w1 w2 w3 .....

positions(P)

POS tags(P)

f1 ?

noun adverb verb

f2 ?

argVP modVP headVP ...

Figure 6.3: Words, POS tagset and positions 33 tags. Considering the borderline examples (i.e. the examples that might cause some tags to be split) at the early stage also helps. Second, the relation f in (C) is a superset of the (word, position) pairs in the Treebank. However, when we wrote the rst draft of the guidelines, we had not examined the whole Treebank, so we did not know what (word, position) pairs were in the Treebank. As a result, the guidelines needed to be revised during the rst pass of the POS annotation as more (word, position) pairs became available to the guideline designer. Third, each step in this method requires human decision, so we use the method totally by hand. Now that the Treebank has been fully bracketed, we could easily extract the f1 and f2 from the Treebank11 and compare them with the ones in the POS guidelines. The discrepancy between them would reveal mistakes in either the Treebank anotation or the guidelines. For example, in Figure 6.3, let the three words w1 , w2 , and w3 be hu-mie (destruct/destruction), zuo-tian (yesterday), and zheng-zh (politics/politically), respectively. The pairs in f are shown in (a), where all these words can be arguments of VPs, w2 and w3 can also modify VPs, and w1 can also be the head of a VP. Assume currently T includes three tags only: noun, adverb, and verb. f2 includes three pairs and f1 includes four pairs, 11

f1 is simply a (word, pos) lexicon, while f2 can be built by decomposing templates of the extracted

grammar (see Section 4.6).

110

(A) choose an initial tagset T such that it includes only the well-established tags (such as nouns, verbs and adverbs) and a tag that captures the rest of tags; (B) choose f2 so that it includes only the basic positions for each tag. For example, nouns can be the argument of VPs and verbs can be the heads of VPs. (C) for each (w; p) 2 f , i.e. the word w appears in p position, we need to add (w; p) to f1  f2 by choosing one or more from the following: (1) choose a tag t in T , add (w; t) to f1 ; (2) if (there exists a tag t 2 T , such that (w; t) 2 f1 ) add (t; p) to f2 ; (3) if (there exists a tag t 2 T , such that (w; t) 2 f1 )f split a tag t into two tags t1 and t2 ; add (t1 ; p) to f2 . for (each (x; t) 2 f1 and (t; y) 2 f2 ) replace t with t1 or t2 or both

g

Table 6.2: A method for creating POS guidelines as marked by solid lines in (b). Now we need to add the pairs (w2 , modV P ) and (w3 , modV P ) to f1  f2 . Figure 6.4(a)-(c) shows three possibilities (the newly added lines are in bold face):

(a) add (w2 , adverb) and (w3 , adverb) to f1; (b) add (noun, modV P ) to f2, i.e. allow nouns to modify VPs; (c) split the tags for nouns so that one type of nouns can modify VP while the other type can not.

If these three words were the only words Chinese had, then (a) would be prefered since (b) over-generates (i.e. it allows w1 to modify V P ) and (c) requires a larger tagset. However, when we consider all the nouns in Chinese, we realize that temporal nouns can modify VP directly, while the majority of the rest of nouns can not. If we adopted (a), all the temporal nouns will have two tags, resulting in a large lexicon. As for (b), it overgenerates too much. In contrast, (c) is more appealing because it does not overgenerate and most nouns (including temporal nouns) have a single tag. So we choose (c).12 Notice that in Figure 6.4(c), w2 and w3 have di erent tags although they seem to have the same syntactic distribution, There are two possible explanation for this. First, it can be argued that our simple 12

111

words(W) w1 w2 w3 .....

POS tags(T)

f1

noun adverb verb

positions(P)

f2

arg VP modVP headVP ...

(a) words(W) w1 w2 w3 .....

POS tags(T)

f1

noun adverb verb

positions(P)

f2

arg VP modVP headVP ...

(b) words(W)

w1 w2 w3 .....

POS tags(T)

f1

noun 1 noun 2 adverb verb

positions(P)

f2

arg VP mod VP head VP ...

(c)

Figure 6.4: Three alternatives for de ning the mappings f1 and f2 . For such a simple example, we have to consider all these three alternatives. To make the POS tagging guidelines, we have to consider much larger sets (W , T and P ) and the interaction among them. This process is very time-consuming. The corpus has 100,996 word tokens and 10829 word types. The number of unique (word, POS tag) pairs is 11960. Therefore the average number of POS tags per word is only 1.10 (11960 divided by 10829). Counting only the words that occur more than once, the number increases from 1.10 to 1.21.

example does not lists all the positions for these words. For example, w2 can appear before the subject, while w3 can not. Therefore, the syntactic distribution of w2 and w3 are not really identical. Another possibility is that although syntactic distribution is the primary criterion for POS tagging, other auxiliary criteria based on semantics or and morphology are still needed.

112

6.6 Syntactic Bracketing Guidelines This section discusses three issues on bracketing guidelines.

6.6.1 Representation scheme The rst issue is the choice of a representation scheme. Given that the sentences in the corpus are very long and complex, the representation scheme needs to be robust enough to be able to represent all the important grammatical relations and at the same time be suciently simple so that the annotators can follow it easily and consistently. An overly complicated scheme will slow down productivity and jeopardize consistency and accuracy. In our representation scheme, each bracket has a syntactic label and zero or more functional tags. The label indicates the syntactic category of the phrase, while the function tags provide additional information. For example, when a noun phrase such as zuo-tian/yesterday modi es a verb phrase, its syntactic label will be NP (for noun phrase) and it is given a function tag -TMP, indicating that the NP is a temporal phrase and its function is similar to that of an adverbial phrase. We also use reference indices to mark syntactic movement. Our scheme is similar to the one adopted in the Penn English Treebank (Marcus et al., 1994).

6.6.2 Syntactic constructions The second issue is the treatment of various syntactic constructions. Many of them, such as the ba-construction and the bei-construction, have been studied for decades, but there is still no consensus on how they should be analyzed. To tackle this issue, we:

 studied the linguistics literature,  attended Chinese linguistics conferences,  had discussions with our linguistic colleagues,  studied and tested our analyses on the relevant sentences in our corpus, and  used special tags to mark crucial elements in these constructions. 113

For example, the word ba in the ba-construction has been argued to be a case marker, a secondary topic marker, a preposition, a verb, and so on in the literature. Clearly, the word is di erent from other prepositions and other verbs and there is no strong evidence to support Chinese having overt case markers or topic markers. We believe the word is more like a verb than a preposition, but to distinguish it from other verbs, we assign it a unique POS tag BA and in the bracketing guidelines we give detailed instructions on how to annotate the construction. If some users of our corpus prefer to treat it as a preposition, it is easy to convert our annotation to accommodate that approach.

6.6.3 Ambiguities The third issue with respect to bracketing guidelines is the treatment of ambiguous sentences. In the guidelines, we have classi ed ambiguous sentences according to the origins of their ambiguity, and speci ed the treatment for each type. For example, a subset of Chinese adverbs can occur either before the subject or after it. When the subject is phonologically empty as a result of pro-drop or relativization, the empty subject can be marked either before the adverb or after it without much di erence in meaning and there is no syntactic evidence to favor one analysis over another. If nothing is speci ed in the guidelines and the annotator is allowed to mark the empty subject in either place, there will be inconsistency, even though both analyses are plausible. In this case we specify a "default" position for the subject and require that the empty subject be put before the adverb. By doing so we avoid the need for annotators to make individual choices, which is a potential cause of inconsistency.

6.7 Quality Control A major challenge in providing syntactic annotation of corpora is ensuring consistency and accuracy. Carefully documented guidelines (see Section 6.3 | 6.6), linguistically trained annotators, and annotator support tools are pre-requisites to creating a high quality corpus. Both of our annotators are linguistics graduate students, one of whom authored the Bracketing Guidelines and regularly participated in the meetings on Chinese syntax. 114

Their knowledge of linguistics, in general, and syntax, in particular, is crucial for the success of the project. To support bracketing, our annotators use the bracketing interface described in (Marcus et al., 1994). In this section, we describe our strategy for quality control in the bracketing phase. With the rst pass complete and the annotation guidelines revised, we adopted a quality control method. Our goal was to accelerate the annotation of consistent and accurate data by eliminating the need for blind double re-annotation13 of the entire corpus, yet account for annotator consistency and adherence to guidelines throughout the second pass. A secondary goal is to create a subset of data for testing (20% of the corpus) that had double re-annotation and could serve as a Gold Standard. Our primary tool for evaluating consistency is the Parseval software that produces three metrics | precision, recall and numbers of crossing brackets (Black et al., 1991), which we extended to process Chinese. Our process of evaluating consistency is as follows: rst, some les from the output of the rst pass are randomly selected for double reannotation. Next, Parseval is used to compare the two independent re-annotations, and any discrepancies are carefully examined and the annotation is revised. This may in turn lead to revisions of the guidelines to prevent a recurrence of similar inconsistencies. Then, the corrected, reconciled annotation is considered the Gold Standard, and each of the two original re-annotations is then run against it and against each other, again using Parseval, to provide a measure of individual annotator accuracy and inter-annotator consistency. We rst used this method to re-train annotators at the beginning of the second pass, when forty les from the output of the rst pass were randomly selected for double reannotation. After that, the annotators continued to correct rst pass data and each week two les were randomly selected and double re-annotated and the re-annotations were compared with Parseval software. In this way, we continue to monitor our consistency and accuracy and to enhance guidelines. Figure 6.5 shows the accuracy of each annotator (denoted by 1st and 2nd in the chart) compared to the Gold Standard and the interannotator consistency during the second pass after the re-training period. The Figure Double re-annotation means both annotators re-annotate the same les from the output of the rst pass. 13

115

100 1st annotator 2nd annotator agreement

(Precision+Recall)/2

98

96

94

92

90 0

2

4

6 week

8

10

12

Figure 6.5: Accuracy and inter-annotator consistency during the second pass shows both measures are in the high 90% range, which is more than satisfactory.

6.8 Proposed work We have discussed in detail the approach we have used to create a 100K word Chinese Treebank, including the development of the guidelines for segmentation, POS tagging and syntactic bracketing, as well as our methodology for ensuring inter-annotator consistency and community involvement. We think our methodology for guideline development and consistency checking will be applicable to monolingual text annotation for other languages as well, and will be testing this hypothesis. The preliminary release of the Treebank was completed in June 2000 and we are looking forward to getting feedback from the community on its usefulness. There are two tasks remaining: rst, we are using LexTract to detect annotation errors in the Treebank, and the results should be available soon; second, Academia Sinica of Taiwan has developed a 230-thousand-word Treebank and recently they released 1000 sentences/phrases of the Treebank. We will include a comparison between their Treebank and ours in the nal thesis.

116

Chapter 7

Summary and Proposed Work Grammars and Treebanks both are useful resources for NLP applications. In this proposal, we have addressed several issues concerning grammar development, Treebank development and the relation between grammars and Treebanks. In this chapter, we summarize the previous chapters and propose work for the future.

7.1 Summary As discussed in Chapter 1, the relation between grammars and Treebanks are illustrated in Figure 7.1. In Chapter 3-6, we de ned explicitly several additional links between raw data, grammars, Treebanks and NLP tools. These new links are shown in solid lines in Figure 7.2 and explained in detail below: NLP tools

(e.g. parsers)

grammar extraction

grammars

Treebanks

linguistic analyses

linguistic analyses quality control

quality control

raw data

Figure 7.1: The relations between grammars, Treebanks and NLP tools 117

NLP tools (e.g. parsers) LexTract (ch 5)

LexTract (ch 4)

grammars LexOrg linguistic analyses

(ch 3)

quality control

Treebanks quality control

language specification

linguistic analyses (ch 6)

linguistic analyses (ch 3,6)

raw data

Figure 7.2: The new relations with the links provided by LexOrg and LexTract

raw data ! grammars (chapter 3): Redundancy among templates in an LTAG makes

it very dicult to develop and maintain these templates by hand. Our LexOrg system addresses this problem by taking three types of language speci cation as input and automatically generating an LTAG. Our approach uses language-independent speci cations that can be tailored to speci c languages by eliciting linguistic information from native informants, thus partially automating the grammar development process.

Treebanks ! grammars (chapter 4): We have outlined a system named LexTract

which extracts LTAGs from Treebanks and converts the structures in the Treebank to the derivation trees that can be used to train statistical parsers directly. LexTract is language- and corpus-independent in the sense that all the language- or corpusspeci c information is stored in several tables which LexTract takes as input. As a result, LexTract can be applied to various Treebanks for di erent languages.

Treebanks ! NLP tools (chapter 5): We discuss a number of applications for Lex-

Tract and report our experimental results. First, Treebank grammars produced by LexTract can be used as stand-alone grammars. Second, matching templates in a Treebank grammar with the templates in a hand-crafted grammar allows us to estimate the coverage of the hand-crafted grammar and nd the constructions that are not covered by that grammar. Our experiments show that the XTAG grammar covers 97.2% of template tokens in the English Penn Treebank. Third, the Treebank 118

NLP tools (e.g. parsers) LexTract (W4) LexTract (W1, W6)

Treebanks

grammars LexTract (W2) LexOrg linguistic (W3,W6) language analyses specification quality control

5)

t (W

ex

c Tra

L

quality control

linguistic analyses

linguistic analyses

raw data

Figure 7.3: The new relations after adding proposed extension on LexOrg and LexTract grammar can be used as a source of a Treebank context-free grammar. Fourth, the elementary tree sequences produced by LexTract for every sentence in a Treebank can be used to train Supertaggers. Last, multi-component tree sets built by LexTract are use to test Tree-locality Hypothesis. After having examined and re-analyzed all the examples in the English Treebank that seem to violate Tree-locality Hypothesis, we come to the conclusion that all those examples can be handled by other plausible analyses within tree-local MCTAG.

raw data ! Treebanks (chapter 6): We have addressed two major challenges (i.e,

guideline preparation and quality control) that we encountered while developing a 100-thousand-word Treebank for Chinese. We highlight issues in three sets of guidelines (i.e., segmentation, POS tagging and bracketing guidelines) and our response to them.

7.2 Proposed Work In previous chapters, we have applied LexOrg and LexTract to a number of NLP tasks for English and Chinese. In essence, these tools, LexOrg and LexTract, are languageindependent, and can be applied to any languages. It is important to keep validate this by demonstrating similar results for other languages. At the end of each previous chapter, we have discussed our future work in that particular area, which is summarized as follows: 119

Immediate applications of LexTract : (W1) apply LexTract to Treebanks for other languages, compare these Treebank grammars and see how similar or di erent they are.

(W2) use LexTract to detect annotation errors in the Chinese Treebank. Extension of LexOrg : (W3) expand LexOrg so that it can generate trees for modi cation and coordination. Experimentation with LexTract : (W4) cooperate with Anoop Sarkar in training a statistical LTAG parser (Sarkar,

2000). During the process, we will exploit various treatment for coordination and punctuation among others to nd out what kind of Treebank LTAGs are most suitable for that parser.

LexOrg and LexTract : (W5) extend LexTract so that it can infer language speci cations from Treebank grammars, and feed these speci cations to LexOrg to generate a new grammar.

(W6) investigate the possibilities of using LexOrg and LexTract to generate grammars based on other formalisms.

After these proposed tasks are completed, the new relations among raw data, Treebanks, grammars and NLP tools will look like the diagram in Figure 7.3. We believe that our approach bridges the gaps between raw data, grammars, Treebanks and NLP tools. It provides us with the capability of automatically transforming an annotated corpus in Treebank style into an explicit grammar with associated NLP tools such as parsers and Supertagger.

120

References Anne Abeille, Marie-Helene Candito, and Alexandra Kinyon. 2000. The current status of ftag. In Proceedings of 5rd International Workshop on TAG and Related Frameworks(TAG+5). Breckenridge Baldwin, Christine Doran, Je rey Reynar, Michael Niv, B. Srinivas, and Mark Wasson. 1997. EAGLE: An Extensible Architecture for General Linguistic Engineering. In Proceedings of RIAO97, Montreal. Tilman Becker, Owen Rambow, and Michael Niv. 1992. The derivational generative power, or, scrambling is beyond LCFRS. Technical Report IRCS-92-38, University of Pennsylvania. Tilman Becker. 1994. Patterns in metarules. In Proceedings of the 3rd International Workshop on TAG and Related Frameworks(TAG+3), Paris, France. Ann Bies, Mark Ferguson, Karen Katz, and Robert MacIntyre. 1995. Bracketing guidelines for treebank ii style penn treebank project. E. Black, S. Abney, D. Flickinger, C. Gdaniec, and et. al. 1991. A procedure for quantitatively comparing the syntactic coverage of English grammars. In Proceedings of the DARPA Speech and Natural Language Workshop. Tonia Bleam. 1994. Clitic Climbing in TAG: A GB Perspective. In Proc. of TAG+3. J. Bresnan, R. Kaplan, S. Peters, and A. Zaenen. 1982. Cross-serial dependencies in dutch. Linguistic Inquiry. Marie-Helene Candito. 1996. A principle-based hierarchical representation of ltags. In Proceedings of COLING-96, Copenhagen, Denmark. R. Chandrasekar and B. Srinivas. 1997. Gleaning information from the web: Using syntax to lter out irrelevant information. In Proceedings of AAAI 1997 Spring Symposium on NLP on the World Wide Web. Eugene Charniak. 1996. Treebank grammars. In Proceedings of AAAI-96. 121

Eugene Charniak. 1997. Statistical parsing with a context-free grammar and word statistics. In Proceedings of AAAI-97. John Chen and K. Vijay-Shanker. 2000. Automated extraction of tags from the penn treebank. In 6th International Workshop on Parsing Technologies (IWPT 2000), Italy. John Chen, Srinivas Bangalore, and K. Vijay-Shanker. 1999. New models for improving supertag disambiguation. In EACL-1999. Chinese Knowledge Information Processing Group. 1996. Shouwen Jiezi - A study of Chinese Word Boundaries and Segmentation Standard for Information Processing (In Chinese). Technical report, Taipei: Academia Sinica. Mike Collins. 1997. Three generative, lexicalised models for statistical parsing. In Proceedings of the 35th ACL. Mike Collins. 1999. Head-driven Statistical Models for Natural Language Parsing. Ph.D. thesis, University of Pennsylvania. C. Doran, D. Egedi, B. A. Hockey, B. Srinivas, and M. Zaidel. 1994. XTAG System - A Wide Coverage Grammar for English. In Proc. of COLING'94, Kyoto, Japan, August. Christine Doran. 1998. Incorporating Punctuation into the Sentence Grammar: A Lexicalized Tree Adjoining Grammar Perspective. Ph.D. thesis, University of Pennsylvania. Christine Doran. 2000. Punctuation in a Lexicalized Grammar. In Proceedings of 5th International Workshop on TAG and Related Frameworks(TAG+5). Roger Evans and Gerald Gazdar. 1989. Inference in datr. In EACL-89. Roger Evans, Gerald Gazdar, and David Weir. 1995. Encoding Lexicalized Tree Adjoining Grammars with a Nonmonotonic Inheritance Hierarchy. In Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics(ACL '95), Cambridge, MA. Shengli Feng. 1998. Short Passives in Modern and Classical Chinese. In The 1998 Yearbook of the Linguistic Association of Finland (41-68). 122

Qianyan Gong. 1997. zhongguo yufaxue shi (The history of Chinese syntax). Yuwen Press. J. Goodman. 1997. Probabilistic feature grammars. In Proceedings of the International Workshop on Parsing Technologies. Caroline Heycock. 1987. The Structure of the Japanese Causative. University of Pennsylvania, June. J. Higginbotham. 1984. English is not a context-free language. Linguistic Inquiry. Aravind Joshi and Yves Schabes. 1997. Tree adjoining grammars. In A. Salomma and G. Rosenberg, editors, Handbook of Formal Languages and Automata. Springer-Verlag, Herdelberg. Aravind Joshi and B. Srinivas. 1994. Disambiguation of super parts of speech (or supertags): Almost parsing. In COLING 94. Aravind Joshi and K. Vijay-Shanker. 1999. Compositional Semantics with LTAG: How Much Underspeci cation is Necessary? In Proc of 3nd International Workshop on Computational Semantics. Aravind K. Joshi, L. Levy, and M. Takahashi. 1975. Tree Adjunct Grammars. Journal of Computer and System Sciences. Aravind K. Joshi. 1985. Tree Adjoining Grammars: How much context Sensitivity is required to provide a reasonable structural description. In D. Dowty, I. Karttunen, and A. Zwicky, editors, Natural Language Parsing, pages 206{250. Cambridge University Press, Cambridge, U.K. Aravind K. Joshi. 1987. An Introduction to Tree Adjoining Grammars. In Alexis Manaster-Ramer, editor, Mathematics of Language. John Bebjamins Publishing Co, Amsterdam/Philadelphia. Laura Kallmeyer and Aravind Joshi. 1999. Underspeci ed Semantics with LTAG. R. Kasper, B. Kiefer, K. Netter, and K. Vijay-Shanker. 1995. Compiling HPSG into TAG. In ACL-95. 123

Karin Kipper, Hoa Trang Dang, and Martha Palmer. 2000. Class-based construction of a verb lexicon. In AAAI-2000. Anthony S. Kroch and Aravind K. Joshi. 1985. The Linguistic Relevance of Tree Adjoining Grammars. Technical Report MS-CIS-85-16, Department of Computer and Information Science, University of Pennsylvania. Anthony S. Kroch. 1987. Unbounded dependencies and subjacency in a tree adjoining grammar. In Alexis Manaster-Ramer, editor, Mathematics of Language. John Bebjamins Publishing Co, Amsterdam/Philadelphia. Anthony S. Kroch. 1989. Asymmetries in long distance extraction in a tag grammar. In M. Baltin and A. Kroch, editors, Alternative Conceptions of Phrase Structure. University of Chicago Press. Alexander Krotov et al. 1998. Compacting the penn treebank grammar. In Proceedings of ACL-COLING. Seth Kulick. 1998. TAG and Clitic Climbing in Romance. In Proc. of TAG+4. Seth Kulick. 2000. Constraining Non-Local Dependencies in Tree Adjoining Grammar: Computational and Linguistic Perspectives. Ph.D. thesis, University of Pennsylvania. Y. Liu, Q. Tan, and X. Shen. 1993. Segmentation Standard for Modern Chinese Information Processing and Automatic Segmentation Methodology. David M. Magerman. 1995. Statistical decision-tree models for parsing. In Proceedings of the 33rd ACL. Mitchell Marcus, Grace Kim, Mary Ann Marcinkiewicz, et al. 1994. The Penn Treebank: annotating predicate argument structure. In Proc of ARPA speech and Natural language workshop. K. F. McCoy, K. Vijay-Shanker, and G. Yang. 1992. A functional approach to generation with TAG. In Proceedings of the 30th ACL. 124

Gunter Neumann. 1998. Automatic Extraction of Stochastic Lexicalized Tree Grammars from Treebanks. In the 4th International Workshop on TAG and Related Frameworks(TAG+4). Jerome L. Packard, editor. 1998. New Approaches to Chinese Word Formation. Mouton de Gruyter. Martha Palmer, Owen Rambow, and Alexis Nasr. 1998. Rapid prototyping of domainspeci c machine translation system. In Proceedings of AMTA-98, Langhorne, PA, October. Rashmi Prasad and Anoop Sarkar. 2000. Comparing test-suite based evaluation and corpus-based evaluation of a wide-coverage grammar for English. In In this volumn, Athen, Greece. Carlos A. Prolo. 2000. An ecient LR parser generator for TAGs. In 6th International Workshop on Parsing Technologies (IWPT 2000), Italy. Adwait Ratnaparkhi. 1998. Maximum Entropy Models for Natural Language Ambiguity Resolution. Ph.D. thesis, University of Pennsylvania. James Rogers and K. Vijay-Shanker. 1994. Obtaining Trees from their Descriptions: An Application to Tree Adjoining Grammars. Computational Intelligence, 10(4). Beatrice Santorini. 1990. Part-of-speech tagging guidelines for the penn treebank project. Technical report, Dept of Computer and Information Science, University of Pennsylvania. Anoop Sarkar and Aravind Joshi. 1996. Coordination in Tree Adjoining Grammars: Formalization and Implementation. In Proceedings of the 18th COLING, Copenhagen, Denmark, August. Anoop Sarkar. 2000. Practical experiments in parsing using tree adjoining grammars. In Proceedings of 5rd International Workshop on TAG and Related Frameworks(TAG+5). The XTAG-Group. 1998. A Lexicalized Tree Adjoining Grammar for English. Technical Report IRCS 98-18, University of Pennsylvania. 125

Yves Schabes and Stuart Shieber. 1992. An Alternative Conception of Tree-Adjoining Derivation. In Proceedings of the 20th Meeting of the Association for Computational Linguistics. Yves Schabes. 1990. Mathematical and Computational Aspects of Lexicalized Grammars. Ph.D. thesis, Computer Science Department, University of Pennsylvania. Anna Maria Di Sciullo and Edwin Williams. 1987. On the de nition of word. The MIT Press. S. Shieber. 1984. Evidence against the context-freeness of natural language. SRI International Technical Note no. 330. Kiyoaki Shirai, Takenobu Tokunaga, and Hozumi Tanaka. 1995. Automatic Extraction of Japanese Grammar from a Bracketed Corpus. In Proceedings of Natural Language Processing Paci c Rim Symposium (NLPRS-95). W. Skut, B. Krenn, T. Brants, and H. Uszkoreit. 1997. An annoation scheme for free word order languages. In 5th International Conference of Applied Natural Language. R. Sproat, W. Gale, C. Shih, and N. Chang. 1996. A Stochastic Finite-state Word Segmentation Algorithm for Chinese. Computational Linguistics. B. Srinivas, A. Sarkar, C. Doran, and B. A. Hockey. 1998. Grammar and Parser Evaluation in the XTAG Project. In Workshop on Evaluation of Parsing Systems, Granada, Spain, 26 May. Bangalore Srinivas. 1997. Complexity of Lexical Descriptions and Its relevance to Partial Parsing. Ph.D. thesis, University of Pennsylvania. Matthew Stone and Christine Doran. 1997. Sentence planning as description using tree adjoining grammar. In Proceedings of the 35th ACL. K. Vijay-Shanker and Yves Schabes. 1992. Structure sharing in lexicalized tree adjoining grammar. In Proceedings of the 15th International Conference on Computational Linguistics (COLING '92), Nantes, France. 126

K. Vijay-Shanker. 1987. A Study of Tree Adjoining Grammars. Ph.D. thesis, Department of Computer and Information Science, University of Pennsylvania. Bonnie Webber and Aravind Joshi. 1998. Anchoring a Lexicalized Tree Adjoining Grammar for Discourse. In Proceedings of COLING/ACL Workshop on Discourse Relations and Discourse Markers. Bonnie Webber, Alistair Knott, Matthew Stone, , and Aravind Joshi. 1999. What are little trees made of: A structural and presuppositional account using lexicalized TAG. In Proceedings of International Workshop on Levels of Representation in Discourse (LORID'99). D. Weir. 1988. Characterizing mildly context-sensitive grammar formalisms. Ph.D. thesis, University of Pennsylvania. Fei Xia and Tonia Bleam. 2000. A Corpus-based Evaluation of Syntactic Locality in TAGs. In Proceedings of 5rd International Workshop on TAG and Related Frameworks(TAG+5). Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Shizhe Huang, Tony Kroch, and Mitch Marcus. 2000. Developing Guidelines and Ensuring Consistency for Chinese Text Annotation. In Proceedings of the second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece. Fei Xia. 1999. Extracting tree adjoining grammars from bracketed corpora. In Proc. of NLPRS-99, Beijing, China.

127