Schematic Finite-State Intersection Parsing

0 downloads 0 Views 103KB Size Report
9. 3. 10. 3. 11. 3. 12. 3. 13. 3. 14. 3. 13. 3. 14. 3. 15. 3. 16. 3. 16. 4. 17. 3. 18. 3. 18. 4. *FAIL*. *FAIL*. Figure 2: A sentence automaton (a), a rule automaton (b) and ...
Schematic Finite-State Intersection Parsing Anssi Yli-Jyr¨a Research Unit for Multilingual Language Technology Hallituskatu 11, Helsinki, Finland [email protected]

Abstract The framework of Finite-State Intersection Grammars employs a parsing technique according to which several finite-state automata are intersected to determine the output automaton. Implementation of the intersection parser has turned out to be a difficult task. Several problems in efficiency arise when disambiguation choices are based on long contexts with many don’t cares. We are concerned with the long-distance effects of the grammatical constraints. We outline a compacting method that rearranges the search space of the intersection process and makes it shallower. This method is based on the specialization of grammar rules relative to each input sentence.

1

Introduction

The present work is relevant to the framework of Finite-State Intersection Grammars (FSIG) proposed by Koskenniemi [Kos90]. The FSIG parsing process is currently based on the idea that several finite-state automata (FSA) are intersected to determine the output automaton. The input automata are considered as partial restrictions that help each other to completely define the parser’s output. Implementation of the parser for FSIG has turned out to be a difficult task. Several problems in efficiency arise when choices are based on long contexts with many don’t cares. In this paper, we will focus on this kind of far-reaching effects of the grammatical constraints. Some other aspects have been addressed to by Tapanainen [Tap92] and Piitulainen [Pii95]. Our main purpose in this paper is to describe a compacting method that rearranges the search space of intersection process and makes it shallower. On the way, we define some auxiliary techniques to allow efficient calculation and use of the new data structures.

1.1 Overall picture of the Finite-State Intersection Grammars In the FSIG framework, parsing is a monotonic process, which starts with an input containing all the potential readings for the sentence and determines remaining readings after the application of the grammatical constraints. The base set of sentence readings is constructed as follows. At first, the lexicon supplies each token (a word or a punctuation mark) with a set of readings, each of which is a string of tags or atomic features, including lemmas. A base set of tags to indicate bracketing or other kind of structural distinctions is present in the lexicon as well, and it is invoked at each word boundary. Concatenation of sets of lexical readings will produce the base set of sentence readings. According to Voutilainen [Vou94], the word likes could be given the following set of feature strings: 95

{ < "like" < "like" < "like"

N N V

NOM PL @SUBJ >, NOM PL @I-OBJ >, PRES SG3 @+FMAINV > }

< "like" < "like"

N N

NOM NOM

PL PL

@OBJ @, >,

On the other hand, the sentence John likes coffee would have as many as 120 different readings. Fortunately, both the lexical token readings and the sentence readings are actually represented as rather small finite-state automata, such as the one in Figure 1. @@ ”John” N

NOM

SG

@SUBJ

@

@OBJ

@/

@I-OBJ

@