by a printed introduction, facilitating its use and presenting the authors' theoretical principles. The dictionary may serve as a source of research in the domain of inflection and ..... computer program Word marks them as incorrect (and â for piraty .... pendently: deverbal nouns (gerunds) and adjectives (participles) derived ...
Zygmunt Saloni, W³odzimierz Gruszczyñski Marcin Woliñski, Robert Wo³osz
Grammatical Dictionary of Polish Presentation by the Authors
Abstract The dictionary provides a comprehensive grammatical description of Polish words. It covers about 180,000 lexical items (lexemes). The dictionary has been compiled in an electronic form and made accessible via a computer program. All lexemes are morphologically and syntactically characterized by a set of features, which display on the monitor. Additionally, some regular derivatives are presented in entries. The inflectional description strives for completeness, while the derivation and syntax are described as far as a clear formalized approach was feasible. Non-inflected lexemes are provided with their part-of-speech feature and the valence information is added where instructive (case government for prepositions, type of conjoined phrase for conjunctions). The compact disc containing the program is accompanied by a printed introduction, facilitating its use and presenting the authors’ theoretical principles. The dictionary may serve as a source of research in the domain of inflection and — to some extent — syntax of Polish. It can also be used in the automatic processing of Polish. It may also be useful for teaching Polish, especially to foreigners. Key words: dictionary, Polish, grammar, electronic dictionary, inflection, formalized approach Streszczenie: S³ownik dostarcza obszernego opisu gramatycznego polskich s³ów. Zawiera ok. 180 000 jednostek leksykalnych (leksemów). Zosta³ on opracowany w postaci elektronicznej, a korzystanie z niego odbywa siê poprzez program komputerowy. Wszystkie leksemy s¹ scharakteryzowane za pomoc¹ zestawu cech, które pojawiaj¹ siê na monitorze. Dodatkowo, niektóre regularne derywaty pojawiaj¹ siê w has³ach. Staraliœmy siê podaæ mo¿liwie wyczerpuj¹cy opis fleksyjny, natomiast s³owotwórstwo i sk³adnia by³y potraktowane na tyle dok³adnie, na ile pozwala³o na to podejœcie formalne. Dla leksemów nieodmiennych podajemy czêœæ mowy; informacja dotycz¹ca walencji jest podana tam, gdzie jest to instruktywne (rz¹d przypadka dla przyimków, typ zdania z³o¿onego dla spójników). CD zawieraj¹cemu program towarzyszy drukowany wstêp u³atwiaj¹cy korzystanie, a tak¿e przedstawiaj¹cy za³o¿enia teoretyczne autorów. S³ownik mo¿e stanowiæ Ÿród³o dla badañ nad polsk¹ fleksj¹, a tak¿e — w pewnym zakresie — sk³adni¹. Mo¿e te¿ znaleŸæ zastosowanie w przetwarzaniu automatycznym jêzyka polskiego. Bêdzie te¿ przydatny w nauczaniu jêzyka polskiego, zw³aszcza jako obcego S³owa klucze: s³ownik, jêzyk polski, gramatyka, s³ownik elektroniczny, fleksja, formalizacja gramatyki
5
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
1. Introduction The content of this article is related to the talks given at the meeting of the Polish Academy of Sciences Committee of Linguistics on October 2007 and at the 7th European Conference on Formal Description of Slavic Languages (FDSL-7), held at the University of Leipzig, Germany, on December 2007. Its purpose is to present the Grammatical Dictionary of Polish, S³ownik gramatyczny jêzyka polskiego (Saloni et al. 2007, henceforth: SGJP) compiled by the authors and published by Wiedza Powszechna Publishers in December 2007. SGJP gained its final shape as a result of the project 2 H01D 007 24 S³ownik gramatyczny jêzyka polskiego, sponsored by the Polish Ministry of Science and held at the University of Varmia and Masuria in Olsztyn, 2003–2006. The participants of this project, besides the authors of the dictionary and of this paper, were several colleagues who did some auxiliary but important work. They were: Monika Czerepowicka, Dorota Kopciñska, Ma³gorzata Sas, and Anna ŒledŸ. We are also indebted to volunteers who analysed particular groups of lexemes: Joanna Bartycha, Alena Bielewicz, Patrycja D¹browska, Przemys³aw Lipski, Danuta Makowska, Teofil Mroczek, Laura Polkowska, and Joanna Szumig³owska. At various stages of our work we obtained essential comments and sugestions from our colleagues. We thank them all sincerely. We express special thanks to Janusz Bieñ, who supported our project from the very beginning. The dictionary provides a comprehensive grammatical description of Polish words. It is compiled in electronic form and made accessible via a computer program. All lexemes are characterized by a set of features, which are displayed on the monitor. Dictionary entries can be selected by typing them into a dialog box or marking them on the list of entries. The CD containing the program is accompanied by a printed introduction, facilitating its use and presenting the authors’ theoretical principles.
2. The History of the Project The idea of SGJP was conceived by Z. Saloni under the influence of A. Zaliznjak’s grammatical dictionary of Russian (Zaliznjak 1977, cf. Saloni 1979). The project to compile such a dictionary, formulated immediately after analyzing the Russian model, has been carried out since then slowly and with variable intensity. It is clear that the conception of a dictionary had to evolve during those 30 years. For example, in 1975 the only possible format in which to publish a dictionary was a traditional book; in 2007 it no longer make much sense (we will comment further on this statement later). Nevertheless, some work was carried out still in the eighties by Z. Saloni and his students at the Bia³ystok Branch of the Warsaw University. The first task consisted in analyzing the grammatical information in the main Polish dictionary, usually referred to as Doroszewski’s dictionary (Doroszewski 1958–1969, 11 volumes, henceforth: SJPDor.). The results of those analyses were presented in a series of master’s theses, partially published, as well as in other articles, in the three volumes of Studies in
6
Grammatical Dictionary of Polish
Polish Contemporary Lexicography (Studia z polskiej leksykografii wspó³czesnej, vide Saloni, ed. 1977, 1978, 1979). At this stage the grammatical information containted in SJPDor. was transferred into a card index (almost 130 000 items — on the basis of ca. 125 000 source entries). The cards were useful in the next step, especially for a comprehensive analysis of the declension of Polish common nouns, conducted by W. Gruszczyñski in his PhD. thesis (Gruszczyñski 1989). Very useful material for our analytical work was the reverse index to SJPDor. Indeks a tergo do SJPDor. (Grzegorczykowa and Puzynina 1973), initially intended as an additional volume to accompany the main work, but finally published separately. We used it from the very beginning. To the items in this index were given very general grammatical characteristics (a part of the speech symbol and main inflectional group, without details) taken from the dictionary. Those draft notes (the revised version of the corrected and annotated Indeks a tergo do SJPDor. was uploaded onto the Internet by R. Wo³osz) turned out to be an essential starting point for further work.1 The above source was first used during final work on Jan Tokarski’s schematic reverse index of Polish word forms (Tokarski 1993). This new index gave strings of typical closing letters of word forms connected with a given grammatical characteristic. The preliminary version was prepared by the author in the form of a sloppy rough draft during his work in the team compiling SJPDor. Later, although convinced of its usefulness, Tokarski did not see the possibility of developing and publishing it, so he bequeathed it to Z. Saloni to elaborate and publish in a well ordered form. This index became the source material for several computer programs for morphological analysis of Polish, two of which were constructed by members of the SGJP team (vide Wo³osz 2005, Woliñski 2006). The two reverse indexes mentioned above were an introduction to a much more important work: an electronic version of the list of headwords of SJPDor., with grammatical information prepared by Robert Wo³osz. This list could be used as source data for a spell checker, but could only serve as first approximation of a list of entries for a dictionary. We needed to check all the details to be included in a well edited work. This task began with Polish verbs. Polish conjugation is complicated (the typical paradigm contains forms represented textually by at least 37 different words), but it has been quite precisely analyzed and characterized by J. Tokarski (Tokarski 1951). However, the level of exactness of his analysis was not sufficient for automated analysis. It was necessary to examine the whole material and where necessary change the qualification. The results were presented in the handbook of Polish conjugation (Saloni 2001) and included in their entirety in SGJP.
1
A copy of the book with pencil annotation for every entry is preserved as documentation of this stage of our work.
7
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
3. The Grammatical Model The work characterized in the previous section was carried out in the framework of a model of Polish grammar developed and enriched during ca. 35 years by a team of Polish linguists. For more than 10 years a computer specialist, Marcin Woliñski, has also been cooperating with this group. The description used in SGJP refers to the tradition of Polish grammar and takes over its well established achievements. However, it also contains some new ideas proposed in the second half of the 20th century. The predecessor of the description given by the authors of SGJP was Jan Tokarski’s work. He developed a homogeneous classification of alternations in Polish verbal paradigms (organized in thematic groups). He was also the author of the conception of inflectional information in SJPDor. (vide Tokarski 1958)2 . Our attempts to present Polish grammar in a formalized rigorous form can be treated as a natural continuation of Tokarski’s work. For example, the general conception of the classification of Polish lexemes (Saloni 1974) referred to classes introduced in SJPDor. (and in later dictionaries based on it). Many solutions applied in SGJP were worked out earlier by members of the group (above all, the entire conception of gender description, declension of pronouns and numerals, the two-stage presentation of the morphological description: deep and surface, according to the function of the forms and enumeration of the base forms — vide Bibliography). The essence of this description is presented in a university textbook of Polish syntax (Saloni–Œwidziñski 1981). Some new ideas for the description of conjunctions were taken from a later formalized description of Polish syntax, formulated by M. Œwidziñski (Œwidziñski 1992). The division of uninflected lexemes is based on the conceptions of R. Laskowski and M. Grochowski (Laskowski 1984, Grochowski 1997). We will present here two crucial aspects of our model, applicable mainly to nouns, but having essential implications for the decription of other classes of lexemes: the repertoire of genders/subgenders, and the introduction of a new inflectional category for nouns: depreciativity. The reader can see applications of our decision in the examples given below. Moreover, their presentation can help to see the character of our grammatical model and its position in contemporary structural linguistics.
3.1. Gender The system of genders adopted in SGJP is based on Saloni (1976b), which is a continuation of Mañczak’s analysis (Mañczak 1956), also taking into consideration the analysis of gender differentiation in Russian by Zaliznjak (1967). According to a long and important tradition, we understand grammatical gender in the noun as its syntactic property, consisting in requiring a particular form of the subordinate word. So grammatical gender of the noun manifests itself in the form of words associated with it. (In linguistic literature grammatical genders are sometimes 2
8
Tokarski also explicitly suggested compiling a grammatical dictionary of Polish — cf. Tokarski 1969 p. 390.
Grammatical Dictionary of Polish
called noun classes.) Theoretically, every noun must belong to one of the classes (in practice, we permit a very few exceptions). Traditional grammarians of Polish distinguished between masculine, feminine, and neuter genders, e.g., ten dobry chleb (m) ‘this good bread’, ta dobra woda (f) ‘this good water’, to dobre wino (n) ‘this good wine’. However, this classification is not exhaustive. Pluralia tantum nouns can’t be put into any diagnostic context for the nominative singular: ten dobry , ta dobra , to dobre . So we distinguish an additional gender, which we name co-plural and mark with the symbol p. Thus on the basis of nominative singular forms of adjectives and verbs associated with the noun (in the so-called agreement) we distinguish four genders of Polish nouns. However, analogous distinctions occur in Polish also in other cases of adjectives governed by nouns, mainly in the accusative; noun forms also require different forms of numerals. If we take into account all of the above-mentioned features, we must distinguish nine noun classes in Polish. Ultimately, we have the following system of Polish genders: Widzê jednego albo dwóch spoœród tych , których lubiê. m1 Widzê jednego albo dwa spoœród tych , które lubiê. m2 Widzê jeden albo dwa spoœród tych , które lubiê. m3 Widzê jedno albo dwoje spoœród tych , które lubiê. n1 Widzê jedno albo dwa spoœród tych , które lubiê. n2 Widzê jedn¹ albo dwie spoœród tych , które lubiê. f Widzê jedno albo dwoje spoœród tych , których lubiê. p1 Widzê jedne albo dwoje spoœród tych , które lubiê. p2 Widzê (jedn¹ albo dwie pary) spoœród tych , które lubiê. p3 (approximate English translation: I see one or two from among those whom/which I like.) We distinguish three subgenders of the masculine (vide Mañczak 1956, Saloni 1976b), two subgenders of the neuter, and three subgenders of the co-plural (vide Saloni 1976b).
3.2 Depreciativity One of the most disputable decisions in our description is introducing an additional inflectional category: depreciativity (vide Saloni 1988). In contemporary Polish many masculine nouns denoting humans have an additional form in the nominative plural (we call it depreciative), for example the lexeme pirat ‘pirate’ — beyond the regular nominative plural form piraci — has also a second one: piraty, significantly more rare, marked and requiring other forms of the dependent adjectives and verbs. The description of them in Polish grammar textbooks is irresolute. As a rule, the existence of such forms is mentioned in comments (sometimes they are called impersonal), but not shown in paradigms. Consequently, many non-professional but educated native speakers may have problems with how to qualify grammatically forms of type piraty. The Polish version of the popular computer program Word marks them as incorrect (and — for piraty — suggests the
9
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
corrections: piaty, pirat, piryty, pirata), although they occur fairly frequently in Polish texts (many occurrences can be found via Google). These forms are in syntactic opposition to the regular ones, as is shown by the following examples: Byli to dobrzy piraci/oficerowie. ‘They were good pirates/officers.’ By³y to dobre piraty/oficery. It sometimes happens that both forms coincide morphologically; however we can identify them by their dependents: Byli to tacy proœci ¿o³nierze. By³y to takie proste ¿o³nierze.
‘They were such common soldiers.’
Therefore we introduced this opposition also into the adjective and verb paradigms. Interestingly enough, the “neutral” form has — in terms of superficial morphology — special features, while the depreciative one is totally “regular”, i.e., it is derived like the only nominative plural form of nouns of other genders representing the same inflectional pattern. For instance, for the noun pirat the depreciative form is piraty (like the nom. pl. of the noun aparat (m3) ‘device’: aparaty, termit (m2) ‘termite’ — termity, or chata (f): chaty) and the “neutral” form piraci has no analogy in the paradigms of nouns of other genders. There are nouns (mostly with a some tint of expressiveness) for which the depreciative forms are much more frequent, e.g., przedszkolaczek ‘kindergarten pupil’ — (te) przedszkolaczki (by³y), cham ‘boor’ — (te) chamy (by³y), and the existence of the neutral, “personal” form is debatable. When we carry out an informal survey, Poles, as a rule, are not sure whether we can say (ci) przedszkolaczkowie (byli) or (ci) chami/chamowie (byli), and reply that these forms do not exist; however it is possible to find instances of them in texts (especially on the Internet). Therefore we must treat them as possible, permitted both by the system and by usage. We treat depreciativity as grammatical category. The meaning of depreciative forms varies for individual nouns; however, as a rule they are marked. Most often they express disgust, aversion, contempt, disrespect. The typical contrast is the following: dobrzy kierowcy ‘good drivers’ may be said with approbation; dobre kierowce is unambiguously ironic. Nevertheless, it sometimes happens that the depreciative is used in archaization or introduces a special, informal mood, e.g. for ch³op in a secondary meaning ‘man’: dobrzy ch³opi ‘good men’ (formal, unusual) and dobre ch³opy ‘good guys’.
4. The Scope of the Dictionary The core of the SGJP vocabulary consists of words found in readily available sources: dictionaries and texts. Although the vocabulary is large, it does not include all Polish words. It is hoped, however, that the dictionary includes all possible inflectional patterns for all inflecting lexemes of Polish.
10
Grammatical Dictionary of Polish
Below we present data concerning main classes and subclasses of entries, as well as the inflectional patterns in SGJP for particular classes. Among the entries, in addition to lexemes whose characterization is the task of SGJP, there are prefixes occurring in texts as initial parts of words. They are very productive and can be used spontaneously to derive new lexical units, which can be found in texts rarely — such new units are inflected like the basic lexemes without prefixes. Entries
Patterns
244 669
1 095
81
2
244 588 135 529 6 29 590 28 980 68 171 8 782 59 389
1,095 762 6 2 1
adjectives participles active passive “regular” comparative positive
65,671 34,301 13,931 20,370 31,370 950 30,420
71
deadjectival adverbs comparative positive
11,146 1,106 10,040
1 1 1
98
45
29,532 35 29,497
215 1 214
2,612 491 193 113 121 458 1,117 119
2 1 1 2 1 1 1 1
total prefixes lexemes nouns pronouns gerunds -oϾ others proper common
numerals verbs predicatives conjugated others other adverbs particles prepositions conjunctions interjections and the like abbreviations others
1 71 1 71
11
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
In the above list we put before the core of a given class of lexemes the subclasses that have properties atypical or less typical for that class. For nouns, this category includes groups of deverbal and deadjectival nouns as well as a group of several substantive pronouns (ja, ty, my, wy, on, siê) whose relation to the category of gender is complicated (although Polish nouns have stable gender, e.g., forms of the lexeme ja occur with verbal and adjectival forms of various gender values: ja by³em ‘I was’ (m) — ja by³am ‘I was’ (f)), and whose paradigm is idiosyncratic from the point of view of Polish inflection as a whole. Contrary to the tradition of Polish grammarians, we treat differentiated groups of adjectival and adverbial forms of positive and comparative degrees as separate lexemes (too few adjectives participate in this opposition to accept it as having an inflectional character) — the comparatives are listed in the table before the positives. In the class of verbs we distinguish the so-called predicatives, verbs derived from words of another type and having no superficial features of conjugation — no specific verbal endings, like potrzeba (derived from the nominative singular of the noun potrzeba ‘need’) or niepodobna ‘it is impossible’ (derived from an adjective form), in contrast to regular verbs, which are inflected by means of typical conjugational morphemes. We have set apart subclasses of lexemes that have not been entered in SGJP independently: deverbal nouns (gerunds) and adjectives (participles) derived automatically from verbal entries; names of attributes (properties) and regular deadjectival adverbs — from adjective entries. Such entries, although potentially existing, might not have been verified in texts and corpora and have been less carefully characterized (elaboration of lexical items varies depending on their stylistic status and their frequency in contemporary Polish texts). The consequence of this decision is the number of entries in SGJP (shown in the lower part of the screen): in the published version there are 244 669 entries (lexemes) and 4 223 981 word forms (counting syncretic forms of the same lexeme as one unit). However, we treat this number as overestimated. Four classes of derived lexemes number ca. 100 000 units; however, about half of them are quite regular, neutral lexemes of contemporary Polish. Therefore in informative and advertising materials we define the size of SGJP as ca. 180 000 lexemes. It is worth adding that there are groups of lexemes that are characterized in SGJP although they are not on the list of its entries. This problem will be discussed below. SGJP, unlike the most popular general unilingual Polish dictionaries, includes the most frequent and most useful proper names, geographical and personal (first names used by Poles and two categories of surnames: popular or belonging to famous persons). We present the number of inflectional patterns for the classes and subclasses here unsystematically — for general illustration of our principles. In any case, the subclasses of atypical lexemes as well as the classes of lexemes derived automatically are associated with few patterns; it is the main class of lexemes that produces the diversity of patterns in each part of speech. Parenthetically, we feel obliged to explain that the second pattern for prepositions and prefixes is connected with units which in texts can occur with or without e, e.g. pode mn¹ — pod tob¹ ‘under me/you’, pode+przeæ — pod+par³ ‘support’ (two forms).
12
Grammatical Dictionary of Polish
5. Information Provided by the Dictionary The Grammatical Dictionary of Polish has as its goal a complete grammatical characterization of contemporary Polish vocabulary. However, such a rigorous description is possible in different respects and to different degrees. First of all, in order to describe textual units it is necessary to classify and group them into lexical units, or lexemes, organized in a strict and fixed manner. This grouping and organizing is the subject of the inflectional description. Therefore we strive for completeness of inflectional description information (complete information on form variation for virtually all Polish inflecting lexemes). This means that each form of any lexeme is included with all values of all morphological categories (categories for which a given lexeme inflects). However, this does not mean that all values are visible at any particular moment. Some word forms (identified as strings of letters) are syncretic (connected with several combinations of specific values) and sometimes their shape is more important than their function. Therefore we decided to introduce lexeme forms in two stages. In the first stage (surface) we are interested in the textual exponent of the given form, in the second — in its function (vide Mel’èuk 1974). The two stages of inflectional description will be illustrated below. Consequently, the most important syntactic functions of word forms have been defined in the paradigms. They can be generalized to a basic syntactic characterization of lexemes. However, it is only a general, limited characterization. It can be said that the syntax in SGJP is described to the extent to which a clear formalized approach was feasible. For nouns the dictionary defines their gender (on a high level of precision, with masculine, neuter, and pluralia tantum nouns split into subclasses — vide above); for numerals it defines the type of their syntactic relation with nouns; for verbs, besides perfective/imperfective aspect, it provides information about transitivity and reflexiveness (co-occurrence with siê) — obligatory or optional. Case government (except for the nominative-subject) of verbs is rarely defined — for reasons of organization of the whole work; however it is systematically introduced for prepositions. Generally speaking, non-inflected lexemes are provided with their part-of-speech feature, and valence information is added where instructive (case government for prepositions, type of conjoined phrase for conjunctions). Moreover the entries of SGJP contain also some information about the derivational features of lexical units. This information is given by links between lexemes, e.g., between elements of aspectual pairs for verbs, between a verb and its nominal derivatives (gerunds and participles), between adjectives and their regular derivatives: adverbs and nouns (substantival names of qualities), between positive and comparative adjectives.
5.1. What is a Lexeme in SGJP? SGJP contains no definitions. Short glosses suggesting the meaning are included for homonymous entries. Homonymy is treated purely formally. The main reason to consider lexemes as different is the existence of differing inflectional paradigms.
13
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
For example, SGJP contains only one lexeme para although the word can have two clearly distinct meanings: ‘couple’ and ‘vapor’. Both meanings lead to exactly the same paradigm. In this case we do not need glosses. When the difference in meaning is accompanied by some grammatical features included in the dictionary, the lexems are differentiated. Thus three different lexemes bokser are distinguished, because three different meanings are associated with three sets of forms, each having its own syntactic features. This is reflected in their gender: it is m1 for bokser ‘athlete’, m2 for bokser ‘breed of dogs’, and m3 for bokser ‘kind of car engine’. Some exceptions are possible for lexical units with a clearly defined unique meaning but vague gender, e.g. cz³owieczysko ‘great good chap’ m1 / n2 or cabernet ‘Cabernet’ m2 / m3 / n2. There are also lexical units that are treated in SGJP as lexemes and characterized grammatically (i.e., inflectionally) although they are not included explicitly in the list of entries. One example is the superlative: a search for any superlative form refers the user to the corresponding comparative (superlatives in Polish are derived from comparatives with the prefix naj+). Similarly, negated adjectives with the prefix nie+ are not included in the list of entries but searching for one refers the user to its non-negative counterpart.
5.2. Examples How we present the dictionary information will be illustrated with examples of typical entries belonging to three main classes of lexemes. We begin with adjectives because they best illustrate our method. Nouns and verbs involve specific problems and will be shown in the next section. 5.2.1. Adjectives The paradigm of krakowski ‘Cracovian’ shown in Figure 1 is organized like paradigms traditionally given in textbooks of Polish grammar, but it contains some original solutions proposed by the authors (the system of genders, depreciativity). The paradigm itself is contained in the upper part of the table below the header (containing four elements: the headword, the grammatical qualification — przymiotnik ‘adjective’, the note about its presence in SJPDor., and the symbol of its inflectional pattern, P08). In the last two lines of this part of the paradigm, following the lines labeled with the names of cases, we place forms that are not normally included in the paradigm: krakowsko — marked Z³o¿. (Pol. z³o¿enie ‘composition’) and krakowsku — marked C.(po). The first is used as the non-final component of compounds (hundreds of thousands of examples on the Internet, especially Jura Krakowsko-Czêstochowska in various cases, but with the same shape of the component krakowsko); the second is used after the proposition po (tens of thousands of examples on the Internet). It can be systematically derived from adjectives ending with -sk(i), -ck(i), -dzk(i), and also from denominal adjectives produced from proper names (e.g. Putin ? putinowski ? (po) putinowsku). Parallel constructions (having the same meaning) derived from other adjectives contain the regular dative form (of masculine or neuter), e.g. traktowa³a
14
Grammatical Dictionary of Polish
Figure 1. An adjective entry with all forms (deep paradigm).
go nie po macierzyñsku, ale po macoszemu ‘she treated him not like a mother, but like a stepmother’ (macierzyñsku is the C. (po) of macierzyñski, whose masculine dative is macierzyñskiemu; macoszemu is the regular masculine dative of macoszy). Therefore the form under discussion is marked as a special form of the dative (C. — celownik in Polish). The forms Z³o¿. and C. (po) are inflectionally regular; it is strange that they were not treated this way in earlier works. The lower part of the table is devoted to regular derivatives. Both lexemes given in the sample entry, the adverb krakowsko and the nominal quality name krakowskoœæ, have visible drawbacks. The noun is rare and the adverb is only potential — on the border of acceptability (the construction po krakowsku is normally used in the adverbial function parallel to krakowski). They are included automatically in the list of entries in SGJP, although only some of them can be easily found on the Internet in many instances (e.g. rosyjskoœæ, bohaterskoœæ or mistrzowsko, barbarzyñsko — cf. rosyjski ‘Russian’, bohaterski ‘heroic’, mistrzowski ‘masterly’, barbarzyñski ‘barbarous’). Possible elimination of such units should be discussed before the next release of our dictionary. The screen presented in Figure 1 is shown when the option Wszystkie formy ‘All forms’ in the menu Odmiana ‘Inflection’ is chosen and when its grammatical function has been assigned — this way of presentation can be called deep morphology. On the other hand, it is possible to choose Formy bazowe ‘Basic forms’ (we can call this presentation surface), when the function is neglected; only the shape is of concern.
15
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
Figure 2. An adjective entry with basic forms only (surface paradigm).
In the array in Figure 2, the forms are presented in one column and numbered: 1–12 and 3+. It is sufficient to have 11 differentiated shapes to express all distinctions of case, number, and gender (12 and 3+ are designed for the forms Z³o¿. and C.(po)). This is universal for almost all Polish adjectives (except for several so-called adjectival pronous, which have two additional differentiations).
5.2.2. Nouns Nominal entries are much simpler, because they are organized according to two main inflectional categories: case and number. As an example for Figure 3 we chose a feminine noun: kopalnia ‘mine’. It does not have the category of the depreciativity, which applies only to virile nouns (m1). However, there is a specific grammatical opposition in it. In the genitive plural two forms occur: the first is syncretic with some forms of the singular, the second one is specific, used only for this combination of grammatical values. This contrast is well known to Polish grammarians. We have introduced it as an inflectional category. It is also possible to present textual manifestations of these forms, basic forms, as in Figure 4. The basic forms are chosen on the basis of the “type of inflection”, i.e., the syncretisms occurring in the given type of the nominal pattern. We distinguish the
16
Grammatical Dictionary of Polish
Figure 3. A noun entry with all forms (deep paradigm).
following types of inflection: masculine, feminine, neuter, and a special one for noninflecting nouns. Among the basic forms presented in Figure 4 there is no locative singular, because its form in the feminine pattern is always syncretic with the dative. Accusative plural is omitted in all patterns, because it is always syncretic with the accusative or genitive, depending on the gender. Let us note that in the surface paradigm, as illustrated in Figure 4, some parts of forms are distinguished by colors: a word is divided into a stable part (the letters which occur in any form presented in the paradigm) and a changeable part specific for the form. Quite commonly these parts are not what could be called from the linguistic point of view a stem or an ending.
5.2.3. Verbs Conjugation is the most complicated part of Polish inflection. The full (deep) paradigm with the explicit description of the functions of all forms — in a shape we can present on paper — would be awkward and non-illustrative. Therefore we will show several simpler illustrations. In order to have a overall look at the main verbal categories let us consider the example of the forms of the secondary predicative mo¿na ‘it is possible to’ (derived from the feminine nominative singular of the adjective mo¿ny ‘mighty’, obsolete,
17
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
Figure 4. A noun entry with basic forms only (surface paradigm).
but used also in contemporary Polish). It has only one synthetic (and basic) form; however, it can be inflected analytically for mode and tense (see Figure 5). The same scheme is used also for other verbs (we call them niew³aœciwe ‘improper’) that are used in constructions without any subject-nominative. As a result they do not inflect for person, number, and gender (verb forms agree in that respect with the subject-nominative). However, some verbs of this type, such as brakowaæ ‘not suffice’, are constructed regularly in their surface paradigms. Such a paradigm is shown in two variants in Figures 6 and 7. The full (deep) paradigm of a typical verbal lexeme defines, in a given mode and tense, all distinctions of person, number, and gender with variants for the position of movable morphemes (like (e)m or byœmy). The conjugational tables for SGJP were worked out according to methods used in the reference book on Polish conjugation (Saloni 2001). In Figure 8 we quote from this book the inflectional table for the verbs bóœæ and ubóœæ ‘hit with the horus (#?#)’ (in fact, for the pattern represented by these two verbs, which are an aspectual pair): In SGJP the paradigms are derived separately for each lexeme, as well as for each verb (on the basis of its pattern and other grammatical features); as a result the
18
Grammatical Dictionary of Polish
Figure 5. The verb entry showing the deep paradigm of mo¿na.
Figure 6. The verb entry showing deep paradigm of brakowaæ.
tables for the imperfective bóœæ and the perfective ubóœæ are created separately. An interested reader can look at the paradigms on the computer. In order to show the complexity of the conjugation presented in SGJP we present in Figure 9 only the surface variant (basic forms) of the paradigm of bóœæ.
19
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
Figure 7. Verb entry showing basic forms (surface paradigm) of brakowaæ.
Figure 8. Inflectional tables for verbs bóœæ and ubóœæ (Table 28 from Saloni 2001, p. 80).
20
Grammatical Dictionary of Polish
Figure 9. A verb entry from SGJP: surface paradigm of bóœæ.
The headword is given with the optional reflexive pronoun siê in brackets. Immediately below that, the header contains information about the aspect of the lexeme bóœæ, its “w³aœciwoœæ” (occurring with a subject), transitivity, and its presence in SJPDor., as well as its conjugational pattern (the classification of patterns is based on Tokarski’s systematization). At the bottom are references to its aspectual counterpart and regular derivatives: nominal (ods³ownik ‘gerund’) and adjectival (imies³ów przymiotnikowy czynny i bierny ‘active and passive participles’). The set of 12 basic forms is the minimal one: it must be used in order to derive all forms of all verbal patterns, including derivatives. For the pattern given for bóœæ each basic form has two variants — both serve to derive non-basic forms; as a result, the broad paradigm (i.e. including forms of both participles and the gerund) of the verb bóœæ contains 85 different synthetic forms: 8 nominal, 31 adjectival (including variants), and 46 purely verbal (including non-finite forms: bóœæ and bod¹c). All are introduced in the full (deep) version of the paradigms, either of the verb or its derivatives.
6. The Organization of Data in SGJP Due to the large amount of data involved, SGJP was developed using relational database tools. This approach proved useful in an earlier work, Czasownik polski (Saloni 2001; cf. Saloni and Woliñski 2003, 2004). In that project information on
21
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
the inflection of over 29 000 Polish verbs was entered into a database and developed during the several years of the duration of the project. Finally, inflectional tables for verbs and the dictionary part of the work were generated from the database and typeset automatically. Within the present project the entire material of the dictionary was organized in a similar way. At an early stage Woliñski developed a relational model of Polish inflection, so we were able to describe linguistic phenomena within the database framework (Woliñski 2007). (A different organization of work, in our opinion less convenient, would have been to use the database merely as a means of storage and to resort to other facilities, e.g., to generate all inflected forms from the dictionary data.) As a result, our database describes all subtleties of Polish inflection within a uniform and relatively compact relational model. It is important to stress that the published version of the dictionary is only one of possible uses of the underlying database. The data could easily be used in various systems for natural language processing. From the technical point of view, data for each grammatical class was kept in a separate MS Access file that was operated by one of the authors. The form of data used in the user interface was generated on a Linux system with Perl scripts and the SQLite tool. In the next stages of SGJP’s development we intend to build a webbased application to enable authors to cooperate more closely.
7. The Program We attempted to harmonize solutions intended for various users: professional linguists and laymen (having only basic educational background) seeking immediate grammatical help. We wanted to make our dictionary user-friendly and introduced many graphic solutions (the organization of inflectional tables, distinctions, colors, etc.). The structure of typical entries was described above. Below we will consider the structure of the list of entries and the search methods.
7.1. The List of Entries The great advantage of a computerized dictionary over a traditional one is its flexibility. In SGJP entries can be organized in various ways. In particular, the list of entries (in the left part of the window) can be displayed in several ways. First, it can be put into two orders: ordinary (a fronte) and reverse (a tergo). In both cases the headword is provided with a simplified and abbreviated qualification of the lexeme (repeated in the full form in the header of the entry). In addition, the content of the displayed list is changeable. There are three possibilities: the user may choose either the full list of entries (more exactly, their headwords), or reduce it to one of five classes of lexemes (nouns, adjectives, numerals, verbs, other). It is also possible to list all the wordforms occurring in SGJP. In any of these displays the number of units presented on the list is shown in the lower part of the screen (those numbers are given in the table above). The main classes of lexemes (together with derivatives) are dispayed on backgrounds of different colors.
22
Grammatical Dictionary of Polish
7.2. Search Of course, it is possible to search in the dictionary for any headword of any lexeme (found on the list of entries or typed from the keyboard). Moreover, it is possible to find a lexeme through any of its forms. For example, if we type into the query window (when the list of entries includes nouns) ód, we will obtain the information on the lexeme oda ‘ode’ (its genitive plural has the shape ód). If the word typed in is homonymic, i.e., it can be interpreted as a form of several lexemes, only one of them is seen. However, we can easily find all possible interpretations. When we press the key Enter or the button Szukaj ‘search’, in the upper part of the panel of the list of entries, we will get an additional small window containing a “sublist of suggested entries”, which include the given homonymic word as one or several of its forms. For example, such a sublist for the word mam (see Figure 10) contains the lexemes mama (noun) ‘mom’, mamiæ (verb) ‘beguile’, mieæ (verb) ‘have’; and for the word ¿ó³ci — the lexemes: ¿ó³æ (noun) ‘bile’, ¿ó³ciæ (verb) ‘make yellow’, ¿ó³ty (adjective) ‘yellow’. When we click on one of them we will see the chosen entry. Additionally, it is possible to reconstruct with the help the dictionary the paradigm of a lexeme that does not occur on the list of entries. Methods for such an advanced search are discussed (in Polish) in the instructions to SGJP, contained in the program’s helpfile and in the printed booklet.
Figure 10. The result of typing mam into the search box, plus Enter or Szukaj ‘search’. Note the sublist immediately below search box.
23
Zygmunt Saloni, W³odzimierz Gruszczyñski, Marcin Woliñski, Robert Wo³osz
8. Perspectives 8.1. Planned Improvements The dictionary may serve as a source of research in the domain of inflection and — to some extent––syntax of Polish. It may also be useful for teaching Polish, especially to foreigners. In our work on SGJP we chose the extensive method — including a great number of entries. However, it seems that in the future extending it further will be desirable and favorable — mainly with proper names. The breadth contributed, unfortunately, in some instances to a lack of depth of description. So we plan the following improvements to the data: — enrichment of the entries with more labels, glosses, notes, etc.; — more in-depth study of depreciative forms, non-obvious genders for nouns, and other debatable phenomena; — systematic classification of inflectional patterns; — introduction of information on corpus frequency of lexemes (e.g., a view of the 1000, 10,000 or 50,000 most frequent lexemes of Polish). It is also possible to make some improvements in the interface, mainly to add: — possibility of filtering by inflectional patterns; — possibility of user-defined views of the list of entries (by grammatical classes, inflectional; patterns, frequency, conditions on endings, arbitrary forms, etc.). 8.2. Conclusion The first edition of SGJP has just been published. It provides an extensive grammatical description of Polish words. We believe that it is the first time such a rigorous description was applied with sufficient precision. However, we see a real possibility of many improvements, so we hope the first edition of SGJP will not be the last.
Bibliography DOROSZEWSKI, Witold, ed. (1958–1969): S³ownik jêzyka polskiego PAN. 1–11. — Warszawa: Wiedza Powszechna — PWN (abbr. SJPDor.). GROCHOWSKI, Maciej (1997): Wyra¿enia funkcyjne. Studium leksykograficzne. — Kraków: IJP PAN. GRUSZCZYÑSKI, W³odzimierz (1989): Fleksja rzeczowników pospolitych we wspó³czesnej polszczyŸnie pisanej. — Wroc³aw: Ossolineum. GRUSZCZYÑSKI, W³odzimierz, SALONI, Zygmunt (1978): Sk³adnia grup liczebnikowych we wspó³czesnym jêzyku polskim. — [In:]; Roman LASKOWSKI and Zuzanna TOPOLIÑSKA (eds.): Studia gramatyczne II, Wroc³aw: Ossolineum, 17–42. GRZEGORCZYKOWA, Renata, PUZYNINA, Jadwiga, eds. (1973): Indeks a tergo do S³ownika jêzyka polskiego pod redakcj¹ Witolda Doroszewskiego. — Warszawa: PWN. LASKOWSKI, Roman (1984): Wyraz — Funkcjonalna klasyfikacja leksemów — Podstawowe pojêcia fleksji — Kategorie morfologiczne jêzyka polskiego — fleksja funkcjonalna. — [In:] Renata GRZEGORCZYKOWA, Gramatyka wspó³czesnego jêzyka polskiego. Morfologia; Roman LASKOWSKI, Henryk WRÓBEL (eds.): Warszawa: PWN (2nd ed. 1998) 33–65 and 125–224. MAÑCZAK, Witold (1956): Ile rodzajów jest w polskim? — Jêzyk Polski, XXXVI, 116–121.
24
Grammatical Dictionary of Polish MEL’ UK, Iigw1974): !"# #$%&'' (')*+',#'-$,.'/0%1$($2 «30",( 4$.,#» — 3$05)#'.5, 3')#5.,',. — !"#$%&: '&($&. SALONI, Zygmunt (1974): Klasyfikacja gramatyczna leksemów polskich. — Jêzyk Polski, LIV, 3– 13 and 93–101. SALONI, Zygmunt (1976a): Cechy sk³adniowe polskiego czasownika. — Wroc³aw: Ossolineum. SALONI, Zygmunt (1976b): Kategoria rodzaju we wspó³czesnym jêzyku polskim. — [In:] Roman LASKOWSKI (ed.): Kategorie gramatyczne grup imiennych. Materia³y konferencji, Wroc³aw: Ossolineum, 43–78 and 96–106. SALONI, Zygmunt (1977): Kategorie gramatyczne liczebników we wspó³czesnym jêzyku polskim. — [In:], Roman LASKOWSKI and Zuzanna TOPOLIÑSKA (eds.): Studia gramatyczne [I] Wroc³aw: Ossolineum, 145–173. SALONI, Zygmunt (1979): [Rev. of:] ZALIZNJAK (1977). — International Review of Slavic Linguistics 4, 241–250. SALONI, Zygmunt (1981): Uwagi o opisie fleksyjnym tzw. zaimków rzeczownych. — [In:] Acta Universitatis Lodziensis. Folia Linguistica II, £ódŸ: Wydawnictwo Uniwersytetu £ódzkiego, 143–153. SALONI, Zygmunt (1988): O tzw. formach nieosobowych rzeczowników mêskoosobowych we wspó³czesnej polszczyŸnie. — Bulletin de la Société polonaise de linguistique XLI, 155–166. SALONI, Zygmunt (1992b): Rygorystyczny opis polskiej deklinacji przymiotnikowej. — Zeszyty Naukowe Wydzia³u Humanistycznego Uniwersytetu Gdañskiego. Prace Jêzykoznawcze 16, 215–228. SALONI, Zygmunt (2001): Czasownik polski. — Warszawa: Wiedza Powszechna; 3th rev. ed. 2007. SALONI, Zygmunt, ed. (1987): Studia z polskiej leksykografii wspó³czesnej, Tom II. — Bia³ystok: Dzia³ Wydawnictw Filii Uniwersytetu Warszawskiego. SALONI, Zygmunt , ed. (1988): Studia z polskiej leksykografii wspó³czesnej. — Wroc³aw: Ossolineum. SALONI, Zygmunt, ed. (1989): Studia z polskiej leksykografii wspó³czesnej, Tom III. — Bia³ystok: Dzia³ Wydawnictw Filii Uniwersytetu Warszawskiego. SALONI, Zygmunt, GRUSZCZYÑSKI, W³odzimierz, WOLIÑSKI, Marcin, WO³OSZ, Robert (2007): S³ownik gramatyczny jêzyka polskiego. — Warszawa: Wiedza Powszechna (abbr. SGJP). SALONI, Zygmunt, ŒWIDZIÑSKI, Marek (1981): Sk³adnia wspó³czesnego jêzyka polskiego. — Warszawa: Wydawnictwa Uniwersytetu Warszawskiego; 5th ed. 2007, Warszawa PWN. SALONI, Zygmunt, WOLIÑSKI, Marcin (2003): A Computerized Description of Polish Conjugation. — [In:] Peter KOSTA et al. (ed.): Investigations into Formal Slavic Linguistics. Proceedings of the 4th European Conference on Formal Description of Slavic Languages in Potsdam, Part I; Frankfurt am Main: Peter Lang, 373–384. „Czasownik polski”. — Bulletin de la Société polonaise de linguistique LX, 145–156. SGJP, [see:] SALONI, Zygmunt, GRUSZCZYÑSKI, W³odzimierz, WOLIÑSKI, Marcin, WO³OSZ, Robert (2007): S³ownik gramatyczny jêzyka polskiego. SJPDor., [see:] Witold DOROSZEWSKI (1958–1969): S³ownik jêzyka polskiego PAN. ŒWIDZIÑSKI, Marek (1992): Gramatyka formalna jêzyka polskiego. — Warszawa: Wydawnictwa UW. TOKARSKI, Jan (1951): Czasowniki polskie. — Warszawa: Wydawnictwo S. Arcta. TOKARSKI, Jan (1958): Formy fleksyjne. — [In:] DOROSZEWSKI, (1958–1969), vol. 1, XLIX–LXXIV. TOKARSKI, Jan (1969): Perspektywy S³ownika. — Poradnik Jêzykowy 7, 385–394. TOKARSKI, Jan (1973): Fleksja polska. — Warszawa: PWN. TOKARSKI, Jan (1993): Schematyczny indeks a tergo polskich form wyrazowych, ed. Z. SALONI. — Warszawa: PWN; 2nd ed. 2002. WOLIÑSKI, Marcin (2006): Morfeusz: A Practical Tool for the Morphological Analysis of Polish. — [In:] Mieczys³aw A. K³OPOTEK, S³awomir T. WIERZCHOÑ, and Krzysztof TROJANOWSKI (eds.): Intelligent Information Processing and Web Mining, IIS:IIPWM’06 Proceedings; Stuttgart: Springer, 503–512. WOLIÑSKI, Marcin (2007): A Relational Model of Polish Inflection. — [In:] Zygmunt VETULANI (ed.): Proceedings of the 3rd Language & Technology Conference; Poznañ, 59–63. WO³OSZ, Robert (2005): Efektywna metoda analizy i syntezy morfologicznej w jêzyku polskim. — Warszawa: EXIT. ZALIZNJAK, Andrej A. (1967): 67,,.%$ '0$))%$ ,(%+%'80$)$)'$. — !"#$%&: '&($&. ZALIZNJAK, Andrej A. (1977): 9&5005#'-$,.'2 ,(%+5&: &7,,.%*% ;8".5. — !"#$%&: )(##$*+ ,-.$; 4th rev. ed. 2003. — !"#$%&: )(##$*/ #0"%&1*.
25