EngCG tagger, Version 2 1 Introduction - Semantic Scholar

1 downloads 0 Views 190KB Size Report
Atro Voutilainen. Research Unit for Multilingual Language Technology ..... "An Introduction to the Coptic Art of Egypt" By Azer Bestavros. "Elf and Faerie : The ...
EngCG tagger, Version 2

Atro Voutilainen Research Unit for Multilingual Language Technology Department of General Linguistics FIN-00014 University of Helsinki, Finland [email protected]

This paper1 examines some problems of earlier versions of the EngCG morphological disambiguator and describes solutions to them. An informal evaluation of the new version of the EngCG tagger is reported.

1 Introduction The EngCG (English Constraint Grammar) morphological disambiguator is a reductionistic rule-based tagger based on the Constraint Grammar framework (Karlsson 1990; Karlsson et al (eds.) 1995). It contains three main modules (the following gures concern the previously published, `early' versions of EngCG):  

a tokeniser (identi cation of words, punctuation marks and some 8,000 multiword expressions { idioms and modi er{head expressions) a morphological analyser (introduction of morphological ambiguity)

{ a two-level lexicon and morphology (over 90,000 entries) { a rule-based heuristic analyser of unknown words (`guesser') 

a rule-based disambiguator: alternative analyses are removed on the basis of context-conditions expressed in some 1,150 constraint rules.

This paper was published in Tom Brondsted and Inger Lytje (eds.), Sprog og Multimedier. Aalborg Universitetsforlag, Aalborg. Note that in the book the paper was misnamed 1

due to an editorial mistake.

1

The sentence Check the cylinder bores for score marks and remove glaze and carbon deposits looks like the following after tokenisation and morphological analysis: "" "check" "check" "check" "check" "check" N NOM SG "" "the" DET CENTRAL ART SG/PL "" "cylinder_bore" N NOM PL "" "for" PREP "for" CS "" "score" N NOM SG/PL "score" V SUBJUNCTIVE VFIN "score" V IMP VFIN "score" V INF "score" V PRES -SG3 VFIN "" "mark" V PRES SG3 VFIN "mark" N NOM PL "" "and" CC "" "remove" N NOM SG "remove" V SUBJUNCTIVE VFIN "remove" V IMP VFIN "remove" V INF "remove" V PRES -SG3 VFIN "" "glaze" N NOM SG "glaze" V SUBJUNCTIVE VFIN "glaze" V IMP VFIN "glaze" V INF "glaze" V PRES -SG3 VFIN "" "and" CC "" "carbon" N NOM SG "" "deposit" V PRES SG3 VFIN "deposit" N NOM PL ""

2

V V V V

SUBJUNCTIVE VFIN IMP VFIN INF PRES -SG3 VFIN

After morphological disambiguation, most ambiguities are resolved: "" "check" V IMP VFIN "" "the" DET CENTRAL ART SG/PL "" "cylinder_bore" N NOM PL "" "for" PREP "" "score" N NOM SG/PL "" "mark" N NOM PL "" "and" CC @CC "" "remove" N NOM SG "remove" V IMP VFIN "" "glaze" N NOM SG "glaze" V PRES -SG3 VFIN "" "and" CC @CC "" "carbon" N NOM SG "" "deposit" V PRES SG3 VFIN "deposit" N NOM PL ""

Some of the most dicult ambiguities may remain unresolved: e.g. here "" retains a noun and a verb reading. Early versions of the EngCG tagger became generally known in the early 1990's because of two main reasons. Firstly, the methodology was di erent from most other systems: unlike mainstream morphological disambiguators (or taggers), the EngCG tagger uses hand-coded rules rather than automatically generated corpus-based language models. Secondly, according to several empirical evaluations with previously unseen texts of a few thousand up to some thirty thousand words (Voutilainen et al. 1992; Voutilainen and Heikkila 1994; Tapanainen and Voutilainen 1994; Voutilainen 1995c), EngCG turned out to be successful in terms of its ambiguity/correctness tradeo : when the average output word had some 1.061.1 alternative analyses (input: 1.7-2.2 analyses/word), about 99.7-99.8% of all output words contained the analysis marked as correct in the benchmark corpus hand-tagged before the evaluation by using the double-blind method (for details, see Voutilainen and Jarvinen 1995). As Voutilainen and 3

Heikkila (1994) show, certain state-of-the-art probabilistic taggers perform their slightly di erent task (i.e. they use di erent tag sets) with a considerably poorer ambiguity/correctness tradeo . The success of the EngCG tagger seems to have renewed the interest in the linguistic approach to tagging. The author is aware of recent or ongoing work on disambiguation grammars of several other languages, e.g. Finnish, Swedish, Danish, Norwegian, German, French (Chanod and Tapanainen 1995), Basque, Turkish (O azer and Kuruoz 1994) and Swahili (Hurskainen 1996).

2 Problems in early versions of EngCG It seems that EngCG is an advance in morphological (or part-of-speech) tagging. The system is currently used in many academic and industrial institutions, and large constraint grammars for several other languages have been developed. However, the early versions of the EngCG tagger also su er from certain shortcomings. Firstly, the system does not resolve all the ambiguities it introduces. Though 1.06-1.1 analyses per word does not seem much when compared to the initial ambiguity (1.7-2.2 analyses per word), it may seem problematic especially to those accustomed to the typically unambiguous output of a probabilistic tagger. Secondly: though the system's correctness rate is reasonably high, it is also obvious that some of the errors (cases where the output word does not contain a correct analysis) could be avoided. Most of the errors fall into two categories.

2.1 Lexical problems One type of error is due to the lexical analyser: none of the analyses it proposes are acceptable in some particular context. Typical cases are listed: 

The word was not represented in the lexicon, and the guesser, whose predictions are based on the properties of the word (but not its context) fails to give the correct analysis. For instance, on the basis of its ending "th", the noun "mega-month" is analysed as an adjective: "" "a" DET CENTRAL ART SG "" "mega-month" A ABS "" "at" PREP

4



The word is represented in the lexicon, but the correct category is not given. Three examples are given:

{ The word-form cram is given only a verb analysis in the lexicon, but in a particular case it was also used as a noun: "" "the" DET CENTRAL ART SG/PL "" "cram" V PRES -SG3 VFIN "" "trouser_suit" N NOM SG

Note that cram retains the verb analysis only because the disambiguator never discards the last reading, even if there were a constraint for discarding it. The expression trouser suit is one of the multiword expressions listed in the EngCG description, hence the underscore. { Names are a frequent source of problems in this subcategory: "" "clothe" PCP1 "" "source" PCP2 "" "by" PREP "" "flip" V PRES -SG3 VFIN

{ Also misspellings cause problems. In the following sample, it seems that head was miswritten as hear, which is recognised only as a verb in the lexicon: "" "he" PRON PERS MASC GEN SG3 "" "hand" N NOM SG "" "on" PREP "" "he" PRON PERS MASC GEN SG3 "" "hear" V PRES -SG3 VFIN ""

For a parser trying to assign a sensible syntactic structure onto its input, assigning a noun analysis for "hear" in the above sample might be preferable over a verb analysis. 5

2.2 Grammar problems Some 50-80% of all errors of the early versions of the EngCG tagger were due to a mispredicting constraint. There are two basic types of grammar errors: anticipated errors and `surprises'. Anticipated errors occur in the analysis of constructions that were consciously ignored in the development of the early versions of the disambiguation grammar. Voutilainen (1995) and Savolainen and Voutilainen (1995) give several examples of these constructions (e.g. certain elliptical constructions, topicalisations and verbless utterances). Most of the disambiguation errors were not anticipated by the grammarian (the author of this article). Many aws in the original grammar were probably due to the somewhat inecient development method of the grammar: the constraints were tested and corrected by laboriously proofreading the tagger's output (rather than by automatically comparing the tagger's output to a correctly annotated large test corpus). Provided with more ecient facilities for testing and correcting the constraints, a better grammar could be produced with less e ort.

3 Solutions 3.1 Context-sensitive feature replacement A new implementation of the CG formalism, called CG-2 (Tapanainen 1996) enables context-sensitive replacement of readings with others, as part of the so-called mapping module. Using this new feature, a new module was built for introducing appropriate readings. This module is applied to the output of the lexical analyser. As an illustration, consider a typical input: "
" "a" DET CENTRAL ART SG "" "wrestle" V "wrestle" V "wrestle" V "wrestle" V "" "with" PREP "" "gravitation" N NOM SG

SUBJUNCTIVE VFIN IMP VFIN INF PRES -SG3 VFIN

The word "wrestle" is recognised only as a verb in the EngCG lexicon, but here it is used as a noun. Next consider this rule: 6

REPLACE ( N NOM TARGET (INF) IF (-1C (NOT (NOT (NOT

SG) DET/GEN/PP OR CORE-TITLE) -1 () OR (INDEP)) 0 ("let") OR OPEN-NOMINAL OR AUXW OR (PREP) OR (CC)) 1 (ART) OR (ACC) OR (PRON GEN)) ;

The rule replaces all readings containing the INF tag with the tag sequence N NOM SG if all four context-conditions are satis ed: 







the rst word to the left is an unambiguous member of the set DET/GEN/PP (determiners, genitives and prepositions) or the set CORE-TITLE (words like Mr, Mrs and Dr.), the rst word to the left does not contain the tag (for relative pronouns, e.g. whose and relative determiners) or the tag INDEP (for certain genitives acting as a noun phrase head, e.g. theirs), the word itself is not a form of let, nor does it contain tags from the set OPEN-NOMINAL (e.g. N, ABBR, A) or the set AUXW (e.g. AUXMOD, "do", "be") or the tag PREP or CC, the rst word to the right does not contain the readings ART or ACC or PRON GEN.

If any condition is violated, no replacement is carried out by this rule. A linguistic interpretation for this rule should be fairly easy to determine: the left-hand context speci es a beginning of a noun phrase. A noun phrase beginning can be followed only by certain categories. By ruling out all but one of the legitimate categories, the correct category can be speci ed unambiguously. The conditions on the right-hand context are a kind of extra check: if the word precedes a typical context of a verb, no substitution is carried out. The rule produces the following output for the above sample: "
" "a" DET CENTRAL ART SG "" "wrestle" V SUBJUNCTIVE VFIN "wrestle" V IMP VFIN "wrestle" N NOM SG "wrestle" V PRES -SG3 VFIN "" "with" PREP "" "gravitation" N NOM SG

7

The tag is an abbreviation for "Contextual Morphological Heuristics" and it is useful for identifying cases where this substitution was carried out. This output is passed on to the disambiguator which removes the competing verb readings of wrestle. The present version of this module contains only nine rules written when lexical errors were encountered and tested by applying them to large amounts of text and, using the CG-2 parser in a mode which marks the output with rule application information, improving the rules on the basis of observed mispredictions.

3.2 Grammar development A central part in the redevelopment of the disambiguation grammar was the development of the necessary testing facilities. The linguistic resource is a manually annotated test corpus of over 600,000 words. About a half of this corpus contains parts of the Brown corpus, and the rest come from various sources (magazines, journals, books, manuals). The texts were rst analysed with the ENGCG morphological analyser. The Brown sections were manually disambiguated by Ms. Leena Savolainen a few years ago. The other sections were disambiguated in two stages: during the rst stage some extremely reliable and e ective rules were used, such as REMOVE (V) (-1C DET) ;

which discards all non-participial verb readings if the rst word to the left is an unambiguous determiner. The rest of the ambiguities were manually disambiguated, mostly by Ms. Pirkko Paljakka as part of the annotation of a syntactic corpus. Here is a sample from these corpora: "" "the" DET CENTRAL ART SG/PL "" "jury" N NOM SG/PL "" "further" V IMP VFIN "further" V INF "far" ADV CMP "further" DET POST SG/PL "" "say" PCP2 "say" V PAST VFIN ""

8

"in" PREP "in" ADV ADVL "" "term-end" N NOM SG "" "presentment" N NOM PL "" "that" CS "that" DET CENTRAL DEM SG "that" ADV AD-A> "that" PRON DEM SG "that" PRON SG/PL "" "the" DET CENTRAL ART SG/PL "" "city" N NOM SG "" "executive" A ABS "" "committee" N NOM SG/PL ""

All ambiguities are retained but the correct reading is marked with the tag . This kind of corpus can be used for automatically identifying cases where a rule discards a reading marked as correct. A useful tool for this job is available e.g. in the CG-2 implementation of the CG-parser (Tapanainen 1996). The tool applies the disambiguation grammar to the test corpus and returns con icts between the grammar and the corpus, one at a time: ----------------------------------------------SELECT (INF) # 18691 35 (-1 DO) (NOT -1 PTCPL) (NOT 0 PROPER) ;

2

0.0541

----------------------------------------------"" D:5258 "accent" N NOM SG "" D:3794, 15790, 3861, 3945 "that" PRON SG/PL "" "do" V PAST VFIN "" D:6885, 12190, 8963, 18691! "credit" V INF ""

9

"to" PREP "to" INFMARK> "" D:9516 "miss" V INF "miss" N NOM SG "" "sloan" N NOM SG -----------------------------------------------

The tool can be used in the GNU Emacs editor. One window gives the rule, the other the sentence where the reading marked as correct was discarded. In this case, "" lost the correct reading, as indicated by the symbol `18691!': the number indicates the line number of the rule in the grammar, and the exclamation mark tells that the rule discarded the "correct" reading. Using this data, the appropriate measures can be taken: the rule, or, in the case of corpus annotation error, the analysis in the corpus can be corrected. In this case, the rule needs correcting. It seems to prefer in nitive analyses after a non-participial form of "do", but seems to be unaware of certain nouns that typically occur as object of "do", e.g. "credit". The rule could be improved as follows: SELECT (INF) (-1 DO) (NOT -1 PTCPL) (NOT 0 PROPER OR DO-OBJ-NOUN) ;

where the new context-condition DO-OBJ-NOUN is a set containing nouns like "credit". Using these resources, I proceeded by rst testing and improving the original 1,150 rules (as well as the annotated corpora). This step resulted in a grammar which made considerably less mispredictions, but also the e ectiveness of the grammar decreased somewhat: while the original grammar used to leave some 1.06-1.1 analyses per word in the output, the corrected grammar often left over 1.1 analyses per word. Obviously, the second step was writing new rules. The new rules were written by applying the disambiguator to new texts (i.e. texts other than in the annotated test corpus) and proposing new rules for dealing with remaining ambiguities. Di erent kinds of texts were used for increasing the coverage of the grammar; if e.g. only the test corpora had been used, the resulting grammar might have re ected the properties of the corpus more than the properties of the object language in general. I typically spent a few hours writing some 50-80 new rules which I then tested and corrected using the above-described facilities. 10

At this stage I also started to organise the grammar into several (presently ve) sequentially applied subgrammars. The rst subgrammar contains the most reliable rules and is applied rst. When no more disambiguation is done by this subgrammar, also the second, slightly less reliable subgrammar is applied, and so on, depending on how many errors and remaining ambiguities are tolerated. To assign a constraint in an appropriate subgrammar, two kinds of information are useful: how many predictions and how many mispredictions it makes after the corrections. A new tool was made that adds some application statistics to the rules. Here is a sample: REMOVE (PRES) # 273 6 0.0215 (NOT -4 CHAPTER) (NOT -3 CHAPTER) (NOT -2 CHAPTER) (-1 NUM) (NOT -1 ONE OR ORD OR () OR ()) (0 (N NOM) OR (ABBR NOM)) (NOT 0 (SUBJUNCTIVE) OR (IMP) OR (NOM SG)) (NOT 1 ("until")) ; REMOVE (INF) # 19 1 (1 COMMA) (2C DET/GEN/PP) (NOT -1 CC) (NOT -2 CC) ;

0.0500

"" REMOVE (N NOM SG) # 8 (-1 CLB) ;

The parser interprets the hash sign "#" as a start of a comment line. The rst gure indicates how many correct predictions the rule made; the second indicates the number of mispredictions, and the third the error rate of the constraint on the test corpus (percentage of mispredictions in all predictions). None of the above rules should be admitted in the rst subgrammar on the basis of these gures: the error rate of the rst two is too high, while the third makes too few predictions overall to give a reliable picture about the performance of the constraint. Still, these constraints seem useful enough for including them in some other subgrammar. From March to October 1996, I spent about 120 hours writing some 2,500 new disambiguation rules. The resulting grammar of about 3,600 rules is organised into ve subgrammars: subgrammar 1: 2967 rules subgrammar 2: 158 rules subgrammar 3: 374 rules subgrammar 4: 71 rules subgrammar 5: 44 rules 11

My purpose was to make the rst three subgrammars ecient but also very reliable; only the last two subgrammars (actually written during the last few days of the half-year grammar development period) were allowed to contain really rough heuristics (constraints with an error rate of 10-30%). The whole grammar can be used in applications that tolerate a signi cant amount of errors, e.g. 0.5% or even more. I use only the rst three subgrammars as a front-end of a nite-state syntactic parser under development (Voutilainen 1997) { using more subgrammars tends to compromise too heavily the parser's chances of nding the correct parse. Almost 90% of the new rules are lexicalised in the sense that some part of the rule directly refers to a word-form or a base-form (as well as to morphological features). Considering that most of the purely feature-based generalisations were fairly comprehensively treated in the original grammar, this increased lexicalisation does not seem surprising. Further, it has often been argued that part-of-speech tagging is to a great extent a matter of lexical collocation rather than feature-based syntax alone (cf. e.g. Church 1992). Let us consider a case where lexical information is used along with feature-based information. The sample shown above happens to contain a nice example: "" "to" PREP "to" INFMARK> "" "miss" V INF "miss" N NOM SG "" "sloan" N NOM SG

Grammatically, miss could be an in nitive or a noun here (and to an in nitive marker or a preposition, respectively). However, there is a lot of other evidence for preferring the noun analysis: 

miss is written in the upper case, which is untypical for verbs



the word is followed by a proper noun, an extremely typical context for the titular noun miss

So let us assume that miss is a noun in this case. The following rule can be proposed: SELECT ("miss" N NOM SG) (1C ( NOM)) (NOT 1 PRON) ;

12

This rule selects the nominative singular reading of the noun miss written in the upper case () if the following word in a non-pronoun nominative written in the upper case (i.e. also abbreviations are accepted). A run against the test corpus shows that the rule makes 80 correct predictions and no mispredictions. This suggests that the collocational hypothesis was a good one, and the rule should be included in the grammar.

4 Informal evaluation This section reports an informal test of the EngCG disambiguator against previously unseen texts. The test is informal because of at least two reasons: (i) the benchmark corpus is rather small and (ii) the annotation correctness of the benchmark corpus may be below 100% because of the method used in its preparation (for a methodological discussion, see Voutilainen and Jarvinen 1995). However, the data is publicly available, so hopefully this evaluation gives at least an approximate idea of the quality of the new EngCG tagger.

4.1 Benchmark corpus Seven extracts from electronic texts in the Wiretap electronic text archive were taken:     

 

"Attila the Hun and the Battle of Chalons" by A. Ferrill "How to Build a Flying Saucer After So Many Amateurs Have Failed: An essay in Speculative Engineering" by T. B. Pawlicki "An Introduction to the Coptic Art of Egypt" By Azer Bestavros "Elf and Faerie : The development of Elves in Tolkien's Mythology" by Peter A. van Heusden "From Red Tape to Results Creating a Government that Works Better & Costs Less: Report of the National Performance Review" by Vice President Al Gore "The Case against Gun Control" by David Botsford "The Text of Search Warrant from Waco, Texas"

Together, the samples contain 6631 words and 289 utterances (an average of 23 words per utterance). Before the texts were submitted to the EngCG disambiguator, a benchmark version was created: the morphological ambiguities were rst introduced with 13

the EngCG morphological analyser, and the lexical correction module was applied to this ambiguous data. The ambiguities were manually disambiguated by the author of this article, and the quality of the lexical replacement module was evaluated (of the ve predictions made, all were correct). To increase the objectivity of the evaluation, the benchmark corpus and the disambiguator's outputs are available at the URL http://www.ling.helsinki.f_i/~avoutila/aalborg.html.

4.2 Results The corpus tagged by the new version of EngCG was automatically compared to this benchmark corpus, and the statistics in Figure 1 were generated. Figure 1: Results from a disambiguation test with the new version of the EngCG morphological disambiguator (version from 11/96) against a benchmark corpus of 6631 words. "Levels" = number of subgrammars used; "Extra" = number of super uous morphological analyses in the output; "R/W" = average number of morphological analyses per word in the output; "Errors" = number of words without the correct analysis in the output; "Correct %" = percentage of words retaining the correct analysis in the output. Levels Extra R/W Errors Correct % 0 5095 1.770 3 99.95 1 423 1.064 8 99.88 2 363 1.055 9 99.86 3 284 1.043 10 99.85 4 217 1.033 19 99.71 5 176 1.026 30 99.55 Overall, the results suggest that the new version of the EngCG morphological disambiguator has reached a relatively mature status. The rst three grammars seem make extremely few mispredictions, while the last two grammars (4-5) are less successful, as expected on the basis of the minimal e ort spent on writing them. Still is it reassuring to see that even after the application of the fourth grammar, the system's correctness rate is over 99.7%, a typical rate of Subgrammar 1 alone in the older versions of EngCG. At the level of utterances, the system's success was also quite good. For instance, when the rst three disambiguation grammars were used, 280 utterances of the total 289 (about 97%) were such that every word contained the correct morphological analysis. This gives a reasonable chance for later analysis stages to succeed.

14

4.3 Detailed analyses 4.3.1 Lexical replacement module Let us look at the ve predictions made by the lexical replacement module. All ve cases were basically similar: the context-free morphological guesser had assigned an adjective analysis to words whose form suggests the category. In the following two samples, Theodoric contains the ending "ic", while in the last three samples antigravity begins with the pre x "anti". On the basis of the immediate context of the words, the replacement module correctly introduced a noun analysis instead.2 (i) "" "by" PREP "by" ADV ADVL "" "theodoric" N NOM SG "" "" "king" N NOM SG "" "of" PREP "" "the" DET CENTRAL ART SG/PL "" "visigoth" N NOM PL ""

(ii) "" "theodoric" N NOM SG "" "have" V PAST VFIN "have" PCP2 "" "lose" PCP2 "lose" V PAST VFIN

(iii) Note that, like the guesser (Voutilainen 1995b), the lexical replacement module does not try to distinguish between common nouns, proper nouns and abbreviations { a noun reading serves for all. 2

15

"" "kind" N NOM SG/PL "kind" A ABS "" "of" PREP "" "antigravity" N NOM SG ""

(iv) "" "but" PREP "but" ADV "but" CC "" "antigravity" N NOM SG "" "depend" V PRES SG3 VFIN "" "upon" PREP

(v) "" "therefore" ADV "" "" "antigravity" N NOM SG "" "be" V PRES SG3 VFIN "" "an" DET CENTRAL ART SG "" "acceleration" N NOM SG

4.3.2 Lexical and disambiguation errors Let us look at the errors made by the lexical analyser and the rst three grammars. To save space, only the text words are given along with an expression of the form [X/Y] which means that the previous word was unambiguously analysed as X while it should have been analysed as Y. First the lexical errors: 

(i) As antigravity[A/N] is not known to exist in physical theory or experimental fact in popular science, the saucer is clearly alien and beyond human comprehension. 16





(ii) As part of its 13 scal year 1995 appropriations bills, Congress should permanently allow agencies to roll over 50 percent of unobligated[N/A] year-end balances in all appropriations for operations. (iii) Carmel Center to tell them that UPS[N PL/ABBR SG] was coming with a COD package.

All the three errors were made by the guesser, i.e. the words "antigravity", "unobligated" and "UPS" were not represented in the EngCG lexicon. Note that these errors do not seem particularly problematic for syntactic parsing since nouns, adjectives and abbreviations have very similar syntactic functions anyway. The following misanalyses were made by the disambiguation grammars (subgrammars 1-3): 

 



 

(iv) Perhaps he wanted to leave Attila with his forces, though battered, still intact in order to keep the barbarians of Gaul united[PAST/PCP2] behind Rome. (v) The balloon and the submarine rise[N/PRES] by displacing a denser medium; they descend by displacing less than their weight. (vi-vii) In fact, spokes alone make a more ecient ywheel than the complete wheel; this is because momentum only goes up only in proportion to[INFMARK/PREP] mass[INF/N] but with the square of speed. (viii) In fact, many relief slabs show both the "ankh" and the Christian "cross" together, frequently anked by the rst and last letters of the Greek alphabet, the Alpha (A) and the Omega (W), in an early form of what was to become the monogram of Jesus Christ the Lord for[PREP/CS], in Revelation 1:8, He said: "I am the Alpha and the Omega, the Beginning and the End." (ix) Since April, people all[PRON/ADV] across our government have been working full time to reinvent the federal bureaucracy. (x) As Tom Peters and Robert Waterman wrote in[ADV/PREP] In Search of Excellence, any organization that is not making mistakes is not trying hard enough.

Error (iv) was due to a constraint that required the occurrence of a nite verb if there is a clause boundary marker to the left, and there are no other candidate nite verbs intervening or to the right. Error (v) was due to a constraint that required the occurrence of a nite verb if the immediate left context supports a NP analysis and to the right there is an unambiguous nite verb with no intervening clause boundary markers or conjunctions. 17

Errors (vi-vii) were due to a heuristic constraint that prefers in nitives over nouns in certain constructions. Error (viii) was due to a heuristic constraint that prefers the preposition reading of "for" if it does not occur sentence-initially or directly after a punctuation mark. Error (ix) was due to a constraint that regards the adverb "all" as an intensi er. The constraint required that the immediate right-hand context belongs to a category that can be intensi ed by an adverb (adjectives, numerals, adverbs, quanti ers). It was not aware of certain prepositions like "across" that also can have certain intensi ers. Error (x) was due to a constraint that discards a preposition reading if there is another preposition to the right and there are no intervening nominal heads or coordinating conjunctions.

5 Conclusion The paper reported some problems of earlier versions of the EngCG tagger and described some solutions to them. A new version was evaluated; it seems to be signi cantly better in terms of its error rate and remaining ambiguity. Still further improvements can be made to the tagger. One obvious possibility for further disambiguation is extending the lexicon with (semi)automatically extracted multiword expressions (e.g. modi er{head sequences whose parts are part-of-speech ambiguous in isolation). The new version of EngCG can be used over the WWW; try the following URL: http://www.ling.helsinki.f_i/~avoutila/engcg-2.html.

References Jean-Pierre Chanod and Pasi Tapanainen 1995. Tagging French: comparing a statistical and a constraint-based method. In Proc. EACL'95. ACL, Dublin. Kenneth W. Church 1992. Current Practice in Part of Speech Tagging and Suggestions for the Future. In Simmons (ed.), Sbornik praci: In Honor of Henry Kucera, Michigan Slavic Studies. Michigan. 13-48. Arvi Hurskainen 1996. Disambiguation of Morphological analysis in Bantu Languages. Proc. CoLing'96. ICCL, Copenhagen. 568-573. Fred Karlsson 1990. Constraint Grammar as a Framework for Parsing Running Text. In Proc. CoLing'90. ICCL, Helsinki. Fred Karlsson, Atro Voutilainen, Juha Heikkila and Arto Anttila (Eds.) 1995. Constraint Grammar. A Language-Independent System for Parsing Unrestricted Text. Berlin and New York: Mouton de Gruyter. 18

Kemal O azer and Ilker Kuruoz 1994. Tagging and morphological disambiguation of Turkish text. In Procs. ANLP-94. ACL, Stuttgart. Leena Savolainen and Atro Voutilainen 1995. Testing and modifying the disambiguation grammar. In Karlsson et al. (eds.). Pasi Tapanainen 1996. The Constraint Grammar Parser CG-2. Department of General Linguistics, University of Helsinki Pasi Tapanainen and Atro Voutilainen 1994. Tagging accurately { Don't guess if you know. In Procs. ANLP-94. ACL, Stuttgart. Atro Voutilainen 1995. Morphological disambiguation. In Karlsson et al. (eds.). Atro Voutilainen 1995b. Experiments with heuristics. In Karlsson et al. (eds.). Atro Voutilainen 1995c. A syntax-based part of speech analyser. In Proc. EACL'95. ACL, Dublin. Atro Voutilainen 1997. The design of a ( nite-state) parsing grammar. In Emmanuel Roche and Yves Schabes (Eds.), Finite State Devices for Natural Language Processing. MIT Press, Bradford Book. Atro Voutilainen, Juha Heikkila and Arto Anttila 1992. Constraint Grammar of English. A Performance-Oriented Introduction. Publications 21, Department of General Linguistics, University of Helsinki. Atro Voutilainen and Juha Heikkila 1994. An English constraint grammar (ENGCG): a surface-syntactic parser of English. In Udo Fries, Gunnel Tottie and Peter Schneider (eds.), Creating and using English language corpora. Rodopi: Amsterdam and Atlanta. Atro Voutilainen and Timo Jarvinen 1995. Specifying a shallow grammatical representation for parsing purposes. In Proc. EACL'95. ACL, Dublin.

19