Declension of Czech Noun Phrases

Declension of Czech Noun Phrases Zuzana Nevˇeˇrilová Natural Language Processing Centre Faculty of Informatics, Masaryk University Botanická 68a 602 00 Brno, Czech Republic

Abstract. This paper shows a practical application resulting from linguistic studies and language engineering. The application takes as input a Czech noun phrase (NP) or prepositional phrase and its case and outputs the same NP in the output case. The algorithm for such declension is described. Its usability is evaluated on a set of 286 (relatively short) corpus examples. The results seem to be promising as 259 of the NPs were judged to be correctly declined. NP were extracted from the corpus using a syntactic parser and the correctness of the respective NPs in different case were evaluated manually.

1

Introduction

In many natural language processing (NLP) tasks we need analysis as well as synthesis of sentences in natural language. Sometimes, the synthesised sentences use the same noun phrases (NP) as in previous input but within a different syntactic context. For example, in recognising textual entailment (RTE) a program has to be able to reformulate a sentence to another one with (almost) the same meaning, e.g. transition between active and passive sentences. In these cases NPs very often change their syntactic role (e.g. from object in active sentences to subject in passive sentences). In languages with rich nominal inflection such as Czech (and of course most Slavic languages) NPs often have to change its case. Case changing concerns not only head nouns in the NP but also other components due to mandatory agreement of noun and its adjective modifiers. The problem looks quite straightforward but in this paper we show that in many cases it is not and sometimes it cannot even be solved without additional (semantic) information. There are not many papers concerning noun phrases declension from the NLP point of view. On the other hand it is clear that natural languages have many features in common such as construction and attachment of NPs. According to [2] each phrase has a hierarchical phrase structure which is usually represented by a tree.

2

Czech Noun Phrases and Sentence Generation

Syntactic analysis is one of the most important parts of NLP analysis. From a broader perspective, syntactic analysis consists of three main steps: sentence boundaries detection, sentence constituents detection and (dependency/ phrasal/ hybrid) parse tree construction. This work is closely related to the second step – sentence constituents detection. Thanks to syntactic parsers for Czech language such as synt [3] or SET [4] we are able to determine sentence constituents such as: – – – –

noun phrases (NP) prepositional phrases (PP) verb phrases (VP) adverbial phrases (AP)

For the purpose of sentence generation we have to determine sentence constituents as well as generation rules that determine syntactic roles of these sentence constituents. Sentences are seen as tuples (VP , NP 1 , . . . , NP m , PP 1 , . . . , PP n , AP 1 , . . . AP p ) and generation rules are tuples of functions (f1 . . . , fr ). Each of these functions fi describes how does the ith sentence constituent change in the newly generated sentence. For example, the sentence “Petr vˇcera snˇedl bramborovou polévku.”1 is a tuple (sníst2 / 3PERSON, SG, PAST; Petr3 / NOM; bramborová polévka4 / ACC; vˇcera5 / ADV, TIME). A corresponding generation rule for passive sentence construction is a tuple: (f1 (sníst/ 3PERSON, SG, PAST) ⇒ sníst/ PASSIVE, SG, PAST; f2 (Petr/ NOM) ⇒ ; f3 (bramborová polévka/ ACC) ⇒ bramborová polévka/ NOM; f3 (vˇcera/ ADV, TIME) ⇒ vˇcera/ ADV, TIME). In case of passive sentences the mandatory agreement concerns gender and number of the subject (“bramborová polévka”6 and the verb phrase “být snˇezen”7 . 1 2 3 4 5 6 7

Yesterday, Peter ate the potato soup. to eat Peter potato soup yesterday “Potato soup” is SG, FEM. “to be eaten” has to be changed to “was eaten”/ SG, FEM

The resulting sentence is “Byla snˇezena bramborová polévka vˇcera.”8 , generated by a tuple (sníst/ PASSIVE, SG, PAST, , bramborová polévka/ NOM, vˇcera/ ADV, TIME). In the sentence generation described above the order of sentence constituents does not matter. Since Czech is a nearly free word order these newly generated sentences may sound unusual but are (syntactically) correct sentences. We extended the generation algorithm to generate more correct sentences by permutation of the sentence constituents but this step goes beyond the scope of this paper. Moreover, in this paper we treat prepositional phrases as preposition + noun phrase. We therefore work with both prepositional (PP) and noun phrases (NP) in nearly the same matter.

3

Morphological Analyser

For declension of Czech noun phrases we are using automatic morphological analyser/ generator majka [9]. Authors of the software [5] keep in mind linguistic approaches such as distributed morphology analysis [10] but finally the application uses a dictionary lookup – it lists “all combinations of recognized input and corresponding outputs” [9]. This dictionary lookup is much faster than a “real” analysis and therefore majka outputs are usable in analysis and in generation as well. The software works with words, lemmata (base forms) and tags (syntactic categories described by abbreviations) and it operates in two modes: – analyser: for a given word all lemmata and possible tags are returned, e.g. for the word “vysokou”, possible pairs (lemma, tag) are (vysoká/ NOUN, ACC, SG, FEM) and (vysoký/ ADJ, ACC, SG, FEM)9 – generator: for given lemma and tag it returns the word, e.g. for the lemma “pes10 ” and the tag NOUN, DAT, PL it returns ”ps˚um“ It is clear that for the generation purpose a smaller set of tags is needed. For example in case of “vysoká” the generator does not need any information about gender. In syntactic parsing a larger set of tags is needed because of agreement checking between nouns, adjective modifiers and verbs. N.B. that majka uses a rich set of tags described in [8]. This system is better in automatic processing but more difficult for human comprehension. In this paper we use a more explicit set of tags. 8 9 10

The potato soup was eaten. The word “vysoká” has at least two meanings – high (adjective) and university (informal). dog

4

Declension Algorithm

In general, in each NP the syntactic analyser finds the head noun. Afterwards, it marks the whole NP with the head noun tag, e.g. the accusative NP such as “bramborovou polévku” in “Petr snˇedl bramborovou polévku.” has the following tags: ACC, SG, FEM. The algorithm takes the original NP and its tag as input and outputs the same NP in a different case. In case of prepositional phrases the preposition is omited if the input case differs from the output case. After determining the head noun h at jth position in the NP, the NP is processed subsequently and the following rules (based on [1, p. 175–209]) are applied to each word wi in the NP: 1. determine all possible lemmata and tags for wi 2. if wi is not recognized and i < j output wi and do not stop declension for the following words 3. if there is no agreement with h stop declension for the following words 4. if i > j and wi is not adjective stop declension for the following words 5. if wi is a coordination conjunction (such as and, or, neither, respective) and i < j allow declension for the following words 6. if wi is a numeral proceed in a different mode (decline the following genitive if the numeral means more than 4 and the output case is neither nominative not accusative) 7. if wi has agreement in gender, case and number, change the case according to the output case Changing the case is quite straightforward for NPs containing: 1. a single noun or pronoun (which is at the same time the head noun) changes its case: “Petr snˇedl polévku (accusative).” → “Polévka (nominative) byla snˇezena.”11 2. a single noun with preceding adjective modifiers changes its case, its modifiers change their cases as well: “Petr snˇedl bramborovou polévku (accusative).” → “Bramborová polévka (nominative) byla snˇezena.”12 3. a coordination of nouns: “Petr snˇedl polévku a rýži (accusative).” → “Polévka a rýže (nominative) byly snˇezeny.”13 . In this case the last member of the coordination is considered to be the head noun. All preceding nouns, pronouns and adjectives change their case. 11 12 13

Petr ate the soup. The soup was eaten. Petr ate the potato soup. The potato soup was eaten. Petr ate the soup and the rice. The soup and the rice were eaten.

Declension becomes more complicated when NPs are more structured. These cases concern: 1. modifiers following the head noun. In case of agreement with the head noun these adjectives are treated in the same manner as preceding adjectives, e.g. “Petr snˇedl polévku plnou lesních hub.”14 . The declension stops after the adjective modifier “plnou” resulting in “polévka plná lesních hub”. 2. several subsequent nouns starting with capital letters. This case happens in case of proper names, e.g. “pan Pavel Novák”15 . In this case the last noun is considered to be the head noun and all the preceding nouns have to change their case. Since in locative and dative there are variants of most (human) names a rythmical rule is used. The forms “panu Pavlovi”16 and “panu Pavlu Novákovi”17 are correct NPs. 3. several subsequent words starting with capital letters. This case is similar to the previous one but concerns naming nominatives such as “v hotelu Praha”18 . The proper name is not the subject of declension, it always stays in (naming or citation) nominative. We can distinguish both cases using a database of persons’ proper names. 4. genitive groups. Declension stops after the head noun, e.g. “fakulta” in “fakulta sociálních studií (nominative)”19 . If the whole NP is in genitive the following adjective is ambiguous because it could be treated as the case 1, e.g. “fakulty sociálních studií (genitive)”. 5. numerals. In Czech (as well as many other Slavic languages) the declension of NPs with numerals depends of the number expressed by the numeral. If it means one, two, three or four, declension proceeds like in case 2, e.g. “dva psi (nominative)”20 and “dvou ps˚u (genitive)”. If the numeral expresses a number bigger than four, the NP is treated as a genitive group (e.g. “pˇet ps˚u (nominative)”21 ) but only in nominative and accusative. 6. ambiguous (noun or adjective) words occuring in the NP. There is a specific class of Czech nouns derived from adjectives. Originally these nouns were parts of a noun phrase progressively reduced to the adjective part, e.g. “vysoká škola”22 is transformed to “vysoká”23 (at present used informally). 14 15 16 17 18 19 20 21 22 23

Petr ate the soup full of wild mushrooms. Mr. Pavel Novak Mr. Pavel (dative) Mr. Pavel Novak (dative) in the Praha hotel faculty of social studies two dogs five dogs university but literally vysoká škola means high school high

After this transformation the nouns have the same forms as the adjective. When ajka determines a word both as adjective and noun a significant ambiguity can occur if the whole NP is in genitive, e.g. “vysoké stromy”24 . Since this ambiguity can be recognized by language users we assume that language users tend to eliminate it. Therefore if there are two subsequent nouns n1 and n2 in the NP and n1 can also be an adjective, the algorithm prefers the adjective. 7. coordinations of genitive groups. In cases of NPs in genitive such as “Byli jsme tam i bez pˇredsedy oblastní rady a prezidenta asociace malých výrobc˚u.”25 it is not feasible to decide correctly the head noun without additional (semantic) information and therefore the algorithm has lower precision.

5

Evaluation

We have evaluated the precision of the algorithm on 286 NPs from the corpus. We did not measure recall since the NPs detection depends on the syntactic analyser. Table 1 shows the number of correct declensions depending on the output case. input NP length # of input NPs # of correct output NPs 6 1 0 5 1 1 4 5 4 3 25 14 2 37 31 1 212 209 total 286 259

Table 1. Number of correct declension depending on the length of the input NP (expressed by number of words).

The preliminary evaluation gives promising results. From 286 NPs 259 were judged correct. 24 25

high trees but also trees of the university This sentence is ambiguous and most humans will translate is as ”We were there without the local council chair and small manufacturers association president.“ The other interpretation can be such as ”We were there without the chair of the local council and small manufacturers association president.“

6

Conclusion

This paper presents a practical application that takes a noun phrase (or a prepositional phrase) and its case (called input case) and outputs the noun phrase in the new case (called output case). The application is used when sentence generation in Czech language is used, namely in the game X-plain [6] and the inferencing system using verb valency frames [7]. A larger discussion is needed on whether the algorithm is transferable to other languages. In case of most Slavic languages NP/PP generation is a complex issue. In future we should test the algorithm on near languages such as Slovak or Polish. The success rate depends heavily on appropriate morphological analysers/generators. We have evaluated the precision of the results on a small corpus and proved that the application is usable. It can be run from . In future we plan most detailed evaluation on different corpora.

Acknowledgments This work has been partly supported by the Czech Science Foundation under the project P401/10/0792.

References 1. Grepl, M.; Karlík, P.: Skladba spisovné cˇ eštiny. Edice Uˇcebnice pro vysoké školy, Státní naklad., 1986. 2. Hawkins, J. A.: The Typology of Noun Phrase Structure from a Processing Perspecˇ tive. Ríjen 2008. 3. Horák, A.; Kadlec, V.: New Meta-grammar Constructs in Czech Language Parser synt. In Proceedings of the 8th International Conference on Text, Speech and Dialogue (TSD 2005), Springer-Verlag, 2005, s. 85–92. 4. Kováˇr, V.; Horák, A.; Jakubíˇcek, M.: Syntactic Analysis Using Finite Patterns: A New Parsing System for Czech. In Human Language Technology. Challenges for Computer Science and Linguistics: 4th Language and Technology Conference, LTC 2009, Rozna´n, Poland, November 6-8, 2009, Revised Selected Papers, Springer, 2011, str. 161. 5. Šmerk, P.: K poˇcítaˇcové morfologické analýze cˇ eštiny [online]. Disertaˇcní práce, Masarykova univerzita, Fakulta informatiky, 2010 [cit. 2012-05-21]. 6. Nevˇeˇrilová, Z.: X-plain – A Game That Collects Common Sense Propositions. In Proceedings of NLPCS, Funchal, Portugal: SciTePress, 2010, ISBN 978-989-8425-13-3, str. 47–52. 7. Nevˇeˇrilová, Z.: Common Sense Inference using Verb Valency Frames. In Proceedings of the 15th International Conference on Text, Speech and Dialog TSD 2012, Brno: Springer-Verlag, 2012, submitted.

8. Sedláˇcek, R.: ajka tagset. online; accessed 2012-05-21 from , 2006. 9. Šmerk, P.: Fast Morphological Analysis of Czech. In Proceedings of the Raslan Workshop 2009, Masarykova univerzita, 2009, ISBN 978-80-210-5048-8. 10. Ziková, M.; Caha, P.: Princip synkretismu aneb Augiáš˚uv chlév cˇ eské deklinace. Linguistica ONLINE, roˇcník 1, 2006, ISSN 1801-5336.