budu muset cist */ futmod --> VBU VOI VI. The extended arrow (-->) supplements the right side with possible interseg- ments in between each couple of listed ...
Implementation of Efficient and Portable Parser for Czech Pavel Smrˇz and Aleˇs Hor´ak Faculty of Informatics, Masaryk University Brno Botanick´ a 68a, 602 00 Brno, Czech Republic E-mail: {smrz,hales}@fi.muni.cz
Abstrakt This paper presents our work on an implementation of efficient system for syntactic analysis of natural language texts. We show the application of the system to the parsing of Czech by means of a special meta-grammar. Simultaneously, we have prepared a tool that builds the actual parser which analyses input sentences according to the given grammar.
1
Introduction
Syntactic analysis of running texts plays a key role in natural language processing. Many researches have contributed to the area of text parsing by systems that reach satisfactory or even excellent results for English [1]. Other languages bring many more objections in attempts at creating a systematic description of the language by the help of traditional sort of grammars (the situation in German is discussed e.g. in [2]). Even more problems are arising in free word order, respectively free constituent order languages. The sentence structure of such language defies to be described by a maintainable number of rules. The order of sentence constituents is designated as free, but in a matter of fact the order is driven more by human intuition than by firmly set regulators specified by linguists. The word order plays an important role in communicative dynamism, it expresses the sentence focus. This phenomenon is intensively explored by Prague Linguistic School [3] in the context of Functional Generative Description.
2
Czech Language Parsing System
The Czech language (together with other Slavonic languages) is a typical example of free constituent order language. One of the first steps to robust parsing of Czech, is described in [4]. This system puts to use a certain kind of procedural grammar. It is based on a formalism called RFODG (Robust Free-Order Dependency Grammar) developed in the LATESLAV project [6]. The system encompasses complex rules for sentence syntax specification written in a Pascallike form. In contrast to the procedural grammar approach, we constitute a grammar system that retains simplicity of rules and thus is a show-case of declarativeness.
Herewith the maintenance of the set of grammar rules is kept under an acceptable limit, so that the modifications can be performed even by those users who do not need to have the perfect knowledge of all the internals of the grammar. The parser builder itself is based on the public domain parser generator BtYacc developed by Chris Dodd and Vadim Maslov [5], which is an open source program written in C programming language and is designed and carefully tuned for efficiency and portability. BtYacc processes a given context free grammar and constructs a C program capable of analysing input text according to the grammar rules. Natural language processing involves manipulation with grammars, that allow more than one possible analysis of the input sentence. BtYacc enables the processing of ambiguous grammar that in case of ordinary LALR analysis causes shift-reduce or reduce-reduce conflicts, which are in deterministic systems solved by choosing only one variant according to predefined precedences. For the purpose of working with ambiguous grammar we have implemented an intelligent backtracking support for BtYacc that is combined with routines which take care of successive formation of the derivation tree.
3
Meta-grammar
In the previous version of our system [7] we have used the notation of the grammar rules in the expanded form, that served directly as input to the grammar parser. Such grammar consisted of more than eight hundred CF rules. Maintenance of the system has shown to be a nightmare of linguists working with the tool. Therefore we have decided to use a special kind of meta-grammar designed to discharge some mechanical constructs that are based on the same rule pattern which repeatedly occurs in rule declarations. The number of rules in the metagrammar has now radically decreased to one fifth of the number of generated rules. We believe that further elaboration of the meta-grammar will improve the reduction ratio even more. The meta-grammar consists of global order constraints that safeguard the succession of given terminals, special flags that impose particular restrictions to given nonterminals and terminals on the right hand side and of constructs used to generate combinations of rule elements. The notation of the flags can be illustrated by the following examples: ss -> conj clause The single arrow (->) denotes an ordinary CFG transcription. /* budu muset cist */ futmod --> VBU VOI VI The extended arrow (-->) supplements the right side with possible intersegments in between each couple of listed elements.
/* byl bych byval */ cpredcondgr ==> VBL VBK VBLL The extended double arrow (==>) adds (besides filling in the intersegments) the checking of correct enclitics order. This flag is more useful in connection with the order or rhs constructs discussed below. /* musim se ptat */ clause ===> VO R VRI The three character double arrow (===>) provides the completion of the right side to form a full clause. It allows to add intersegments in the beginning and end of the rule, tries to supply the clause with conjunctions etc. The global order constraints represent universal simple regulators, that are used to inhibit some combinations of terminals in rules. /* jsem, bych, se */ %enclitic = (VB12, VBK, R) /* byl, cetl, ptal, musel */ %order VBL = {VL, VRL, VOL} /* byval, cetl, ptal, musel */ %order VBLL = {VL, VRL, VOL} In this example the %enclitic specifies which terminals should be regarded as enclitics and determines their order in the sentence. The %order constraints guarantee that the terminals VBL and VBLL always go before any of the terminals VL, VRL and VOL. The main combinatoric constructs in the meta-grammar are order(), rhs() and first(), which are used for generating variants of assortments of given terminals and nonterminals. /* budu se ptat */ clause ===> order(VBU,R,VRI) /* ktery ... */ relclause ===> first(relprongr) rhs(clause) The order() construct generates all possible permutations of its components. The first() and rhs() constructs are employed to implant content of all the right sides of specified nonterminal, prefixed with the attribute of first() that is firmly tied to the beginning, it cannot be preceded by an intersegment neither any other construct.
4
Lexico-semantic Constraints
The analysis is supported by a set of commonly used grammatical tests that have been described in [7]. In addition to these tests we have extended the valency test functions with lexico-semantic constraints. The constraints take advantage
of an ontological hierarchy of the same type as in Wordnet [8]. They enable us to impose a special request of compatibility with selected class or classes in the hierarchy to each valency expression. In current version we use a very limited subset of the complete hierarchy and we plan to connect the system to the results of Czech part of parallelly running project Eurowordnet 2 [9]. An example of the constraints in action can be demonstrated by the following phrase: Leaseholder N´ajemce | {z }
k1gMnSc1245,k1gMnPc4
draws ˇcepuje
beer. pivo. | {z }
k1gNnSc145
epovat = sb. & st. The lexico-semantic constraints that are found in the valency list of the verb ˇepovat (draw) makes it possible to distinguish the word pivo (beer) as an c object and n´ ajemce (leaseholder) as the subject. Considering metonymy and other forms of meaning shifts we do not regard this feature so strictly to throw out a particular analysis. We use it rather as a tool for assigning preferences to different analyses. The part of the system dedicated to exploitation of information obtained from our list of verb valencies [10] is necessary for solving the prepositional attachment problem in particular. During the analysis of noun groups and prepositional noun groups in the role of verb valencies in a given input sentence one needs to be able to distinguish free adjuncts or modifiers from obligatory valencies. We have implemented a set of heuristic rules that determine whether a found noun group typically serves as a free adjunct. The heuristics are also based on the lexicosemantic constraints described above. In the forest he walked only in a T-shirt. V |{z} lese chodil jenom v |triˇ cku.} {z
In this example the expression v lese (in the forest) is denoted as a free adjunct by the valency list of the verb chodit (walk) and by the rule specifying that the preposition v (in) in combination with a class member forms a location expression.
5
Conclusions
The presented system has the potential to become a launch pad to a robust natural language parser augmented by semantical constraints and case frames. Even the contemporary version has proved to be a highly suitable and useful tool for various kind of Czech text processing. Future research will aim at extensions of the meta-grammar formalism together with enlarging the coverage of Czech grammatical phenomena.
Reference 1. Anoop Sarkar. Incremental Parser Generation for Tree Adjoining Grammar. In Proceedings of the 34th Meeting of the ACL, Student Session. Santa Cruz, June 1996 2. Martin Volk, Gerold Schneider. Comparing a statistical and a rule-based tagger for German. September 1998. http://xxx.lanl.gov:80/ps/cs/9811002. 3. Hajiov, E., Sgall, P., Skoumalov, H., An Automatic Procedure for Topic-Focus Identification, Computational Linguistics 21, pp. 81–94, 1994. 4. Vladislav Kubo. A Robust Parser for Czech. Technical Report TR-1999-06, UFAL, Charles University, Prague. 5. Chris Dodd, Vadim Maslov, BtYacc – BackTracking Yacc, version 2.1, http://www.siber.com/btyacc/ 6. Pltek, M., Holan, T., Kubo, B., Hryz, J., Grammar Development & Pivot Implementation, JRP PECO 2824 Language Technologies for Slavic Languages, Final Research Report, Prague, 1996. 7. Pavel Smr, Ale Hork. Determining Type of TIL Construction with Verb Valency Analyser. In Proceedings of SOFSEM’98, pp. 429-436, Springer-Verlag. Berlin, 1998. 8. G. Miller. Five papers on WordNet. Special Issue of International Journal of Lexicography 3(4). 1990. 9. EuroWordNet: Building a multilingual database with wordnets for several European languages, http://www.let.uva.nl/~ewn/ 10. Karel Pala, Pavel eveek. Valence eskch sloves (Valencies of Czech Verbs). In Proceedings of Works of Philosophical Faculty at the University of Brno, pp. 41-54. Brno, 1997.