A finite state approach to Abkhaz morphology and stress (Draft, 2010-03-04)
Paul Meurer Uni Digital, Bergen
[email protected] Abstract. The West Caucasian language Abkhaz is characterized by a rich but rather regular agglutinative morphology. Word stress, however, is free and dynamic and difficult to predict. A theory of stress in Abkhaz has been developed by V. Dybo, A. Spruit and L. Trigo which predicts word stress correctly in the majority of cases. Although stress is not orthographically marked, its position determines the surface representation of Schwa. Thus, in a morphological analyser for the language, stress rules have to be incorporated in order to be able to properly parse and generate orthographic forms. I show how a finite state morphological analyser for Abkhaz can be built that uses the rules developed by Trigo et al. Key words: Abkhaz, finite state morphology, word stress.
1
The Abkhaz language
Abkhaz is a West Caucasian language genetically related to Abaza, Ubych, Adygean and Circassian (Kabardinian). The number of speakers is estimated at roughly 100.000, although these numbers are disputable. Abkhaz has a Cyrillicbased alphabet with many additional letters unique to Abkhaz. Phonetically, Abkhaz is characterized by a large number of consonant phonemes (between 56 and 65, depending on the dialect), whereas there are few vowel phonemes. Only a and ә (Schwa) are phonemic, whereas e, i, o and u can appear as phonetic realizations of a and ә in certain contexts and in loan words. The nominal morphology is rather simple; there are no cases, but the language exhibits noun-noun and noun-adjective compounding. The verbal morphology can be characterized as agglutinative and polysynthetic, there are a huge number of verbal prefixes and suffixes. There is however little suppletivism, few irregularities and few phonological processes. Abkhaz has free and dynamic word stress which is not coded in the orthography.
2
A morphological analyzer for Abkhaz
The aim of this project is to build a morphological analyzer for Abkhaz, as a first step towards a computational syntax for the language in the Lexical Functional Grammar (LFG) framework.
II
The morphological analyser is being developed as a finite state transducer using the Xerox finite state tool (fst, [1]). Input to the analyser is an orthographic form, with or without stress, whereas the output is a dictionary entry form (the masdar resp. the determinate noun or adjective form), plus morphosyntactic features. For ambiguous word forms more than one output is generated. The analyser is reversible: when a dictionary entry form together with a set of morphological features is input, the transducer generates all possible word forms that correspond to the given analysis. An example analysis is given in (1).1 (1) sәpoit. ‘I am jumping’ сыҧоит ↔ +Subj1Sg-á-ҧа-ра+V+Dynamic+Present+Finite One of the main challenges in constructing a morphological analyser for Abkhaz is the proper determination of word stress and Schwa realization. I will return to this topic in Section 3, but first I will give a short, simplified outline of the overall architecture of the analyser. Architecture of the analyser It is conceptually easiest to imagine the transducer as a set of sub-transducers that operate on a phonemic representation of a given word form, where each syllable is of the form CV (see (6)) and the syllables are marked for accent. There is one transducer, the up-transducer, which transforms this representation into the dictionary entry form including the features, and another transducer, the down-transducer, which applies stress rules and morphophonemic changes to the phonemic representation in the middle to generate the orthographic form. Since internally, an ASCII representation is used to represent Abkhaz phonemes, at both ends there is optionally a ‘cyrillifying’ transducer that transforms this representation into the Cyrillic Abkhaz script. If orthographic forms are to be analysed, diacritic stress marks are dropped by an additional transducer. Thus, the full analyser can be envisaged as the concatenation (with ‘.o.’ as concatenation operator) of a set of sub-transducers: (2) define abkhazAnalyser .o. .o. .o. .o. .o. .o.
cyrillifyUp stressRulesUp phonemicForms stressRules morphophonemicChanges cyrillify dropDiacritics ;
The middle transducer labeled ‘phonemicForms’ itself is a disjunction of transducers for different parts of speech: 1
Note that the order of the features reflects roughly the order of the corresponding affixes in the verb; some occur before the stem and some after. It would seem more reasonable to list all features after the stem; this, however would blow up the transducer tremendously.
III
(3) define phonemicForms Verb | Noun | Adj ; The verb transducer ‘Verb’ is built up as a sequence of transducers for the morphology slots that make up an Abkhaz verb form. (4) define Verb ColumnI ( ConjunctionalInterrogativePrefix ) ( Markers ) ( CauseePrefix ) verbStem {-rá}:0 ( Extension ) SuffixedMarker [ DynamicSuffixGroup | StativeSuffixGroupFinite ] ;
Transducers in parentheses are optional. Note specifically the line ‘{-rá}:0’, which introduces the masdar marker -ra on the upper side, whereas the lower side is empty (‘0’). The transducer for the column I prefixes looks like this: (5) define ColumnI [ 0:{sy} "@U.IPers.1@" "@U.INum.Sg@" [ "@U.ColumnIII.+@" "+Obj1Sg":0 | "@U.ColumnIII.-@" "+Subj1Sg":0 ] ... | 0:{dy} "@U.IPers.3@" "@U.INum.Sg@" "@U.IGen.Hum@" [ "@U.ColumnIII.+@" "+Obj3SgHum":0 | "@U.ColumnIII.-@" "+Subj3SgHum":0 ] ... ] ;
At this level, concrete affixes are introduced for the first time. The column I transducer is a disjunction of transducers for the different column I affixes, the first one for example represents the 1st singular affix s-, with its canonical CV representation as ‘sy’ on the lower side, and empty upper side. But in addition, flag diacritics are emitted, in a peculiar format (e.g. ‘"@U.IPers.1@"’). These flag diacritics are a very important device in fst: they can be set at one position in the transducer, and tested for in another position, which makes the implementation of long-distance dependencies easy and convenient. For instance, the flag diacritic ‘"@U.ColumnIII.+@"’ tests for the existence of a column III marker in the verb, and labels the column I affix as Object or Subject accordingly. Moreover, the flag diacritics can be translated into grammatical features on the upper side. They are removed later on, so they leave no trace in the final purely finite state transducer. The ‘verbStem’ transducer is at the heart of the analyser. It represents the verb lexicon and is a disjunction of transducers for each verb lexicon entry, coded as a combination of prefix, root, and a set of markers that code transitivity and argument structure of the verb. The transducers for nouns and adjectives are built up similarly. The analyser as it is presently implemented encodes the Abkhaz verbal and nominal morphology rather completely, but there are only a few dozen encoded verb and noun entries.
IV
3
Word stress: Rules and implementation
Although word stress is not marked in modern Abkhaz orthography,2 word stress is important for parsing or generating orthographic forms, because main and secondary stress (accent) determine the phonetic and orthographic surface realization of Schwa. At first glance, no regularity is apparent in stress positioning. This is also the state of knowledge in much of the literature on Abkhaz. The World Atlas of Language Structures [2] for example states that Abkhaz has “no predictability in stress location whatsoever”. K. Lomtatidze, the most distinguished Georgian linguist working on Abkhaz, makes similar comments. T. Gvanceladze states in his recent Abkhaz grammar [3] that stress position is free and notes that “in one group of words, the addition of certain affixes changes the stress position, whereas the addition of the same affixes to another group of words does not alter stress position. In such cases it is difficult to establish rules”. 3.1
Dybo and Spruit: The basic rule
The Russian scholar Valerij Dybo was the first to discover the fundamental rule governing stress positioning in Abkhaz ([4], 1977). His findings are now known as Dybo’s rule. Dybo formulated his stress rule in terms of morphemes. In his doctoral thesis ([6], 1986), Arie Spruit builds on Dybo’s work and formulates Dybo’s rule properly in terms of syllables rather than morphemes. He states first that: (6)
– At an underlying level, a syllable has the structure CV, where V = a or ә, and C can be any consonant. – Syllables of Abkhaz roots and affixes can be characterized as either dominant (∗) or recessive (-).3
On this basis he can formulate Dybo’s rule: (7) Basic rule for stress assignment (Dybo’s rule) – The stress falls on the first accented syllable that is not followed by an accented syllable. – Other such accented syllables bear secondary stress. In addition, he formulates rules that govern the surface realization of Schwa: (8) Schwa deletion rule Unstressed Schwa is dropped, with the following exceptions: – Schwa with secondary stress is kept – clusters consisting of three and more consonants are broken up by Schwa insertion 2
3
It was, however, marked in the first Cyrillic-based orthography in use from 1862 to 1926 and in the Latin-based orthography that was used in the years 1926–28. In the following, I will use the more common terms accented (∗) and unaccented (-).
V
The examples in (9) and (10) illustrate the application of Dybo’s rule and Schwa deletion. In these examples, a- is the generic article, -k. is the indefinite marker, -ra is the masdar ending, and all other syllables are stem elements. The first row shows the accent status of the syllables in the second row, the third row displays stress after application of Dybo’s rule (7). The last row shows the surface form after Schwa deletion (8). (9) dog ∗ ∗ a- la a- lá
(10) to keep in ∗ ∗ ∗ ∗ a- ta k.ә -ra a- ta k.ә -rá a- ta k. -rá
eye
a cat
∗ a- la á- la
∗ ∗ cә g◦ ә cә g◦´ә c g◦´ә
-k. -k. -k.
to go out of
to go next to
∗ ∗ - ∗ a- tә c.ә -ra a- tә´ c.ә -ra a- tә ´ c. -ra
∗ - - ∗ a- va la -ra á- va la -ra
In the regular expression language of fst, rules (7) and (8) can be formulated as follows: (11) define simpleDybo [ á -> a, ý -> y || _ \[a|y]* [á|ý] ] # 1. .o. [ y -> 0 ] .o. [ á -> a, ý -> y || [á|ý] ?* _ ] ;
# 2. # 3.
I will not give an introduction to fst’s regular expression language here (see [1] for a thorough introduction), but rather explain the code in (11) in informal terms. (11) defines a transducer named ‘simpleDybo’ that accepts an underlying form in which accented syllables are marked with an accent on their vowels.4 The first line in (11) is a context rule which replaces an accented vowel by its unaccented correspondent if it is immediately followed by an accented syllable. The part left to ‘||’ (i.e., ‘á -> a, ý -> y’) defines the replacements, whereas the right part defines the context in which the replacements can take place. The underscore ‘_’ markes the position of the character to be replaced (i.e., ‘á’ or ‘ý’), which may be followed by an arbitrary number of occurrences of characters other than ‘a’ or ‘y’ (i.e., ‘\[a|y]*’, where ‘\’ denotes complement and ‘*’ is the Kleene star), followed by an accented vowel (i.e., ‘[á|ý]’). If the context is not appropriate, no replacement is done, and the output of the rule is equal to its input. 4
By technical reasons, Schwa is coded as y.
VI
The output of the first line is fed to the second line, which deletes all unaccented Schwas (‘y -> 0’). The context rule in the last line deletes accents on syllables that are preceded by an accented character. Example (12) illustrates the application of the successive transformations in (11) in the derivation of the surface form ‘аҭыҵра’ a-t´әc.-ra ‘to go out of’. (12) 0. 1. 2. 3.
construct accented input string drop accents followed by an accent drop unaccented Schwa keep first accent only
∗ áaaa-
∗ tә´ tә´ tә´ tә´
c.ә c.ә c. c.
∗ -rá -rá -rá -ra
Determining the accent status of roots and affixes The accent status of a given root or affix syllable can be determined by a recursive application of the following diagnostic procedure, applied to word forms with known stress position: (13)
– Identify the status of basic grammatical affixes (e.g. those used in dictionary entry forms). – Identify the status of monosyllabic roots. – Deduce the status of syllables in more complex words. – The status of some syllables cannot be determined in principle; they can be assigned an arbitrary status.
Often, it is sufficient to examine dictionary forms, like in (14) and (16). Most dictionaries of Abkhaz, e.g. the large Abkhaz-Russian Dictionary [5], have stressed lemma forms. In other cases, inflected forms whose stress pattern is less readily available – most dictionaries do not mark stress in example phrases – have to be analysed. (14) ‘to see’ абара a-ba-rá
‘to write’ аҩра á-y◦ -ra
The two word forms in (14) differ only in their root syllable. Since the word forms are stressed differently, the accent status of the two roots must be different. It is easy to see that the only possible accentuations for the syllables in (14) are: (15) á- generic article, -bá- ‘see’, -rá infinitive suffix (all accented), -y◦ ә- ‘write’ (unaccented) Knowing that the generic article á- is accented, we readily get for example: (16) ‘horse’ аҽы a-čә ´ ⇒ -čә ´ ‘horse’ (accented) (17) ‘grey horse’ аҽыхәа a-čә ´-x◦ a ⇒ -x◦ a ‘grey’ (unaccented)
VII
Problems with Spruit’s analysis Even though Dybo’s rule in many cases correctly predicts the stress position in an inflected word form, it has two main problems: Dybo’s rule is not applicable if all syllables are unaccented, and it often makes false predictions, as in (18). (18) сыҧама sә ´ pama ‘did I jump?’ - - sә ´ pa ma
дысмыхәеи dәsmә ´ x◦ei ‘didn’t he help me?’ - ∗ - dә s mә ´ x◦ e i
Spruit does not offer rules or (correct) explanations for these cases, he simply notes the deviating stress patterns and roughly states in which contexts they occur. This is unfortunate both from a theoretical standpoint and when trying to implement stress assignment. The problem cases where Spruit’s version of Dybo’s rule makes wrong predictions are numerous, they include onesyllable stems, noun-noun compounds, the negation prefix, the causative, the relative/reciproque marker, and directional preverbs. 3.2
Trigo’s new insights
In 1992, Loren Trigo published an article ([7]) where she accounts for most of the problem cases that Dybo’s rule is not able to handle correctly. She works in the theoretical framework of Lexical phonology (Kiparsky) and Metrical phonology (Halle, Vergnaud). Here, I am not concerned with the theoretical motivations for her results, rather, I am interested in their practical application. Therefore, I will not refer specifically to those theories in my presentation of her new insights, which include: – Reformulation of Dybo’s rule as three steps: Default accentuation, Accent deletion and Word stress marking – A Default accentuation rule to deal with unaccented-only forms – Recognition of a Strong morpheme boundary between the root and suffixes, and between compound elements that affect Default Accentuation – Recognition of Extrametrical stems invisible to Default accentuation – Interpretation of negation and causative as Infixation Strong morpheme boundary and Default accentuation Trigo states that there exists a Strong morpheme boundary (marked by ‘=’) between the verbal root and following suffixes, and between compound elements in compounds of type N=N and N=Adj. This boundary is motivated by similar observations in Abaza, where a monophtongization rule applies across the Strong morpheme boundary, but not across regular affix boundaries or morpheme-internally (see Allen, [8]). Accent rules and other phonological processes are sensitive to ‘=’, among others the Default accentuation rule.
VIII
(19) Default accentuation If no syllable before ‘=’ is accented, the accent should be placed on the last syllable Default accentuation is implemented by the following fst context rule: (20) define defaultAccentuation [ a -> á, y -> ý || .#. \[á|ý|"="]* _ \[Vowel|"="]* [.#.|"="] ] ;
The input for the rule is an underlying form with Strong morpheme boundary marked by ‘=’. The rule adds an accent to an unaccented vowel if there is no accented vowel to the left (‘.#.’ marks the word boundary), and if there is no vowel to the right preceding ‘=’ or the right word boundary (‘.#.’). After application of the Default accentuation rule, Dybo’s rule is applied to yield the surface form, as exemplified in (21). The accent status of syllables that received their accent from Default accentuation is marked by ‘(∗)’. (21) цҳажәк chá-ž◦ -k. ‘an old bridge’ Default accentuation Dybo’s rule (void) Schwa deletion
cә c
- = (∗) = há = há =
ž◦ ә ž◦
k.ә k.
Extrametricality A verbal stem in Abkhaz consists of an optional preverb, an optional column III affix, and the verbal root. Verbal stems will be marked as ‘h...i’. Trigo states that unaccented monosyllabic stems are extrametrical, which makes them invisible to Default accentuation. Thus, a modified version of the Default accentuation rule can be stated as: (22) Default accentuation with Extrametricality If no syllable before ‘=’ is accented, the accent should be placed on the last metrical syllable Before stating the fst rule for Extrametrical default accentuation, I first define Extrametrical stems:5 (23) define extrametricalStem ("[" ?* "]") \[Vowel]* [a|y] \[Vowel]* ; The context rule for Extrametrical default accentuation makes use of the previous definition: (24) define extrametricalDefaultAccentuation [ a -> á, y -> ý || .#. \[á|ý|"="]* _ \[Vowel]* "" "=" ] ; 5
The optional part ‘("[" ?* "]")’ denotes infixed affixes that are disregarded in Extrametricality (see the section on infixation).
IX
This rule differs from (19) in that the existence of an extrametrical stem before the strong boundary is required. Applicability of Extrametricality is exemplified in (25). All syllables of both verb forms are unaccented. The stem of the first verbal root is extrametrical: it consists of the root only, thus it is monosyllabic, and it is unaccented. Hence, Extrametricality can be applied, and Extrametrical default accentuation places an accent on the syllable preceding the stem. The stem of the second root, on the other hand, contains a preverb. Thus, Extrametricality cannot be applied, and Default accentuation places the accent on the syllable next to the Strong boundary.6 (25) сыҧама sә ´ pama ‘did I jump?’ √ 1sg.I Q (∗) sә ´ h{pa}i = ma
дҟалама dqaláma ‘did he come into being?’ √ 3sgh.I PV Q - (∗) d h qa lá i= ma
Infixation Trigo argues that the negative marker -m-, the causative marker -r- and the causer prefix are infixes in Abkhaz (marked as ‘[ ... ]’). Infixation happens after default accentuation and triggers reapplication of default accentuation starting at the infix. This enables her to account for the stress pattern in the negated verb forms in (26) and the negated causative in (27). (26)
смыҧеи smә ´ pej ‘didn’t I jump?’ 1sg.I Neg
Default acc. Infixation Default acc.
6
√
имыҧада imә ´ pada ‘who didn’t jump?’ Q
(∗) s h {pa} i = i s h [mә] {pa} i = i (∗) (∗) s h [mә ´ ] {pe} i = i
rel.I Neg ∗ i i ∗ i
√
Q
h {pa} i = da h [mә] {pa} i = da (∗) h [mә ´ ] {pa} i = da
The abbreviations in this and the following examples mean: 1sg.I : column I 1st person singular prefix, 3sgh.I : column I 3rd person singular human prefix, rel.I : column I relative prefix; PV : PV : preverb; 1sg.C : 1st person singular causer prefix, √ Caus: causative marker, Neg: negative marker, : verbal root; Q: interrogative suffix.
X
(27)
дкасмырҧама dk.asmә ´ rpama ‘didn’t I cause him to jump down?’
Default acc.
Default acc. Infixation Default acc. Infixation
3sgh.I PV 1sg.C Neg ∗ d h k.a (∗) [ sә (∗) [ sә [mә] (∗) (∗) [ sә [mә] ∗ (∗) (∗) d h k.a [ s mә ´
Caus
√
Q pa i = ma
{rә} ] {rә} ] {rә} ] r ] pa i = ma
In the implementation of Default accentuation in presence of a negative infix, a simple Negative extrametrical stem is defined as the stem part following the negative infix: (28) define negExtrametricalStem \[ Vowel ]* [ a | y ] \[ Vowel ]* ; In the main rule, we can then specifically refer to the negative infix, since other infixes are not available in this situation: (29) define negInfixDefaultAccentuation [ y -> ý || "[" m _ "]" negExtrametricalStem ">" "=" ] ;
In Default accentuation with causative infix, there are two possibilities for infixation. Either, only the combination causer prefix + causative marker is infixed. In this case the causer prefix is accented by Default accentuation, since the causative marker is unaccented and extrametrical in the context of the infix. Or, the negative marker is inserted in addition, and application of Default accentuation puts an additional accent on the negative marker. In both cases, the effect is that all vowels in the infix preceding the causative marker are accented. If we insert the causative marker -r- without vowel, the implementation of the rule for Default accentuation with causative infix is surprisingly simple: it unconditionally makes all vowels in infix position accented.7 (30) define causInfixDefaultAccentuation [ a -> á, y -> ý || "[" ?* _ ?+ "]" ] ;
(31) define inducedRightAccent [ a -> á, y -> ý || Rel OblLoc "