Design, Implementation and Evaluation of an ... - Semantic Scholar

4 downloads 50414 Views 203KB Size Report
achieve coverage of 89% on unrestricted text. Finally we ..... expressions and triggers shown in bold text) is given in Figure 4. ..... fsa.html (consulted 1/10/2003).
Machine Translation (2003) 18:173–189 DOI 10.1007/s10590-004-2480-9

© Springer 2005

Design, Implementation and Evaluation of an Inflectional Morphology Finite State Transducer for Irish ´ IN ´ 2 and JOSEF UI´ DHONNCHADHA1 , CAOILFHIONN NIC PHAID 3 VAN GENABITH

1 ´ Institi´uid Teangeola´ıochta Eireann, 31 Fitzwilliam Place, Dublin 2, Ireland, E-mail: [email protected]; 2 Fiontar, Dublin City University, Dublin 9, Ireland, E-mail: [email protected]; 3 National Centre for Language Technology, School of Computing, Dublin City University, Dublin 9, Ireland, E-mail: [email protected]

Abstract. Minority languages must endeavour to keep up with and avail of language technology advances if they are to prosper in the modern world. Finite state technology is mature, stable and robust. It is scalable and has been applied successfully in many areas of linguistic processing, notably in phonology, morphology and syntax. In this paper, the design, implementation and evaluation of a morphological analyser and generator for Irish using finite state transducers is described. In order to produce a high-quality linguistic resource for NLP applications, a complete set of inflectional morphological rules for Irish is handcrafted, as is the initial test lexicon. The lexicon is then further populated semi-automatically using both electronic and printed lexical resources. Currently we achieve coverage of 89% on unrestricted text. Finally we discuss a number of methodological issues in the design of NLP resources for minority languages. Key words: computational morphology, finite state transducer, Irish, Celtic languages, minority languages

1. Introduction In this article, we describe the design, implementation and evaluation of a morphological transducer for Irish. Initially we survey the existing available resources. This is followed by a description of the morphological phenomena particular to Irish. We then present the basic design of the system and describe the iterative process of implementing the phenomena and testing the system. We look at methods of populating the lexicon semi-automatically in order to achieve a full-scale model of the language. Currently we achieve coverage of 89% on unrestricted text. Finally, we discuss a number of methodological principles for the development of Natural Language Processing (NLP) resources for minority languages.

174

UI´ DHONNCHADHA ET AL.

2. Survey of Existing Linguistic and Electronic Resources for Irish In building a morphological transducer for any language, both computational and linguistic resources are necessary. We require software for implementing the system, as well as knowledge of the linguistics of wordformation (morphology) of the language. It makes sense to use existing software resources. A toolkit saves a great deal of development time and allows the developer to concentrate on the system design and linguistic issues. Finite state technology is a mature technology with a number of sophisticated, language-independent toolkits currently available: Xerox Finite State Technology,1 HTK,2 FSM Library (Mohri et al., 2003), PC-KIMMO (SIL, 2004), UNITEX,3 Finite State Utilities (Daciuk, n.d.), and Grail+.4 Xerox Finite State Tools were used in developing this finite state morphology for Irish. Full details of the tools may be found in Beesley and Karttunen (2003) or on the XRCE website.5 The linguistic resources available to the project include standard Irish ref´ an Aistriuch´ ´ ain, 1988; erence grammars (Christian Brothers, 1988; Rannog ´ n.d.) and a pocket Br´aithre Cr´ıosta´ı, 1999), a corpus of Irish texts (ITE, dictionary (An Roinn Oideachais, 1986) in machine-readable format. 3. Issues in Irish Morphology Irish belongs to the Celtic branch of the Indo-European family of languages. In common with most Indo-European languages, Irish is an inflectional language, i.e. it displays grammatical relationships morphologically. The Celtic languages are the only Indo-European languages with verb-subject-object word order (Fife, 1993: 23). Irish morphology uses both prefixing and suffixing. Prefixes are mainly used derivationally while suffixes are mainly used inflectionally (Stenson, 1981: 17). Inflections also frequently include modification to the stem. These modifications may be divided into two categories: initial mutations, which affect the initial letter (phoneme) of the word, and final mutations, which affect the final syllable. Phonological accommodation at the juncture of two adjacent words is a common feature of many languages. Usually the end of the first word is affected. In Irish, however, the initial syllable of the second word is affected. The phenomenon of initial mutation is a typifying feature of Irish, and of the Celtic languages in general (Stenson, 1981: 18). As the language changed over time, many of the phonological conditions causing the mutations disappeared, but the mutations remain and have become grammat´ Cu´ıv, 1987: 395–400; Russell, 1995: 237; Campbell, 2000: 324). icalised (O

INFLECTIONAL MORPHOLOGY FOR IRISH

175

These mutations are now used to mark inflectional characteristics such as tense, number, case, gender and definiteness. 3.1.

INITIAL MUTATIONS

Irish has a number of types of initial mutation, known as lenition, eclipsis and vowel prefixing. For example, note the initial mutation of athair ‘father’ which becomes (na) haithreacha ‘(the) fathers’ after the plural definite article na. Initial mutation makes the task of using a conventional dictionary quite challenging for a learner of Irish. Examples of lenition (h), eclipsis (m in example below) and vowel prefixing (h in example below) are given for base forms with consonant-initial stem (1) and vowel-initial stem (2). (1) br og ´ ‘shoe’ Lenition: an bhr og ´ ‘the shoe’; mo bhr og ´ ‘my shoe’ Eclipsis: seacht mbr og ´ ‘seven shoes’; i mbr og ´ ‘in a shoe’ (2) athair ‘father’ Vowel Mutation: na haithreacha ‘the fathers’ 3.2.

FINAL MUTATIONS

Consonant harmony is central to Irish morphology. Stems and suffixes must agree regarding broadness or slenderness. Each consonant has a broad and slender counterpart. The broad and slender version of a consonant is denoted orthographically by its adjacent vowel. The broad vowels (back and centre vowels) are a a´ o o´ u u´ , and the slender vowels (front vowels) are e e´ i ´ı. To maintain harmony either the stem or the suffix may be broadened or slenderised as the case may be, depending on the lexical item in question. Broadening and slenderising are also frequently employed, without affixation, to mark grammatical functions such as number and case. A stem is judged to be broad or slender based on the vowels in the final syllable. Additionally, some stems undergo a process known as syncopation. Some polysyllabic stems, whose final syllable is unstressed, lose the vowels of the last syllable when a suffix is added. Collectively, broadening, slenderising and syncopation are known as final mutations, examples of which are illustrated in for a broad-stemmed base form in (3), slender-stemmed base form in (4), and polysyllabic slender stem in (5). (3)

br og ´ ‘shoe’ Broad suffix − a appended to mark nominative case plural: na br oga ´ ‘the shoes’ Stem slenderised and slender suffix − e appended to mark genitive case singular: na br oige ´ ‘of the shoe’

176 (4)

(5)

UI´ DHONNCHADHA ET AL.

s uil ´ ‘eye’ Slender suffix − e appended to mark genitive case singular: na s uile ´ ‘of the eye’ Stem broadened to mark genitive case plural: na s ul ´ ‘of the eyes’ cathair ‘city’ Syncopation of final syllable, resulting in a broad stem cathr-, to which broad suffix -ach is appended to mark genitive case singular: na cathrach ‘of the city’

4. Architecture of a Finite State Morphology A finite state morphology consists of two main components: (a) a lexicon of stems, in which each stem is associated with the appropriate class(es) of affixes, and (b) replace rules encoded as either regular expressions, twolevel rules or graphs, which implement spelling alternations (e.g. mutations). Xerox Finite State Tools were used to implement these components. The lexc tool was used to create lexicon transducers, and xfst to create the rule transducers. Irish morphology is implemented as a cascade of finite state transducers. Code examples below follow Xerox encoding formats. The morphological phenomena of Irish can be divided into two main types: (a) stem affixation, consisting of prefixing and suffixing of stems, and (b) stem modification, consisting of initial and final mutations. Affixation (concatenative morphology) is implemented in the lexicon by listing stems, and by defining the various classes of affixes which may attach to each stem, for example type-1 nominative plurals, type-2 nominative plurals, type-1 future tense, type-2 future tense, as appropriate for the language in question. These classes of affixes are referred to as continuation classes or (sub)lexicons. For convenience, stems are categorised according to part-of-speech (word-class), i.e. there is a separate verb lexicon, noun lexicon, adjective lexicon, and so on. Each lexicon is compiled as a finite state transducer. Stem modifications (i.e. initial and final mutations) of the productive inflected parts-of-speech (nouns, verbs and adjectives) are modelled using replace rules. Each replace rule is also compiled as a finite state transducer. Lexicon transducers are composed with the appropriate replace rule transducers. The resultant transducers are then merged (unioned) to form one overall morphological transducer, as shown in Figure 1. 5. The Irish Finite State Lexicon The lexicon contains stems each of which has a two-level representation, i.e. an upper level and a lower level representation, which correspond

INFLECTIONAL MORPHOLOGY FOR IRISH

177

Figure 1. Architecture of morphological transducer.

to a lexical description and a surface form of the word. In the case of inflected parts-of-speech the lexicon will contain a ‘pre-surface’ form which requires internal modifications to be carried out using replace rules. These pre-surface forms contain lexically provided replace rule triggers of the form ˆtrigger. The actual surface word form is arrived at only when all the necessary replace rules have taken effect and the triggers, having served their purpose, are discarded. Two-level representations are specified in a text file in the form: upper:lower ContinuationClass;, for example +Pl:eacha Pl-Initial;. When a mapping (e.g. upper:lower) is not explicitly stated, the string is assumed to map onto itself. Therefore athair Nm5-2; in Figure 2 is interpreted by the compiler as athair:athair Nm5-2;. (Any text following an exclamation mark is a comment.) Each stem is associated with zero or more continuation classes (implemented in lexc as (sub)lexicons), which in turn may point to further continuation classes. Each continuation class concatenates lexical and/or surface material to the end of the current string as illustrated in Figure 3, where the path for athair in the lexicon is traced. Each string is finally terminated by a # symbol. The initial test lexicons for Irish contained approximately 500 stems and 90 continuation classes. New stems are added to the system by listing

178

UI´ DHONNCHADHA ET AL.

Figure 2. Part of noun lexicon in lexc format.

the stem and the appropriate “initial continuation class”, e.g. br´ athair Nm5-2;. 6. Regular Expression Replace Rules for Irish Internal Modifications The lexicon handles only concatenative morphology. Stem internal modifications are implemented using replace rules which are encoded as regular expressions of the form: String -> Replacement String || Left-context Right-context;

An example of an initial mutation rule, insert h before a vowel, is shown in (6), where the empty string ([..]) is replaced by h in the context that

INFLECTIONAL MORPHOLOGY FOR IRISH

179

Figure 3. Example of one path through the lexc lexicon.

the empty string is at the start of a word (.#.) and is followed by a vowel which is followed by one or more other characters (?+). (A variable Vowel has been defined as [a|e|i|o|u|´ a|´ e|´ı|´ o|´ u] for use in the replace rules). (6) [..] -> h || .#. Vowel ?+

7. Combining the Lexicon and Replace Rules A replace rule is applied to all pre-surface strings defined in the lexicon. If the conditions specified in the rule are present in a particular string, e.g. in (6) a string beginning with a vowel followed by one or more characters, then the replacement takes place. The conditions are, however, often too liberal. We do not wish to prefix h- to every vowel-initial word form. The replacement should only take place in specific grammatical contexts, i.e. in common plural forms and feminine genitive singular forms. In order to supply extra context information for the rule we can append additional (temporary) information in the lexicon, in the form of replace rule triggers. Conventionally these triggers are prefixed with a ˆ symbol. In the final pre-surface form (as given in Figure 3), there are a total of six triggers (bold text): athairˆCˆCoimˆCaoleachaˆS´eˆUr´uˆhv, two of which, ˆC and ˆhv, are relevant to the initial mutation replace rule as restated in (7). (7) [..] -> h || .#.– Vowel ?+ [[%ˆ F %ˆ G] | [%ˆ C]|[%ˆ Verb] ]?* %ˆhv

This rule is constrained to fire only where the “h before vowel” trigger is present (ˆhv), and only in the case of feminine genitive singular nouns

180

UI´ DHONNCHADHA ET AL.

(ˆFˆG), common plural nouns (ˆC) or verbs (ˆVerb). A cascade of transducers is composed, passing through many intermediate lexical:(pre)surface stages, before the final lexical:surface pair athair+Noun+Masc+Com+Pl+Def:haithreacha is achieved. The sequence of lexicon and replace rule transducers (regular expressions and triggers shown in bold text) is given in Figure 4. Triggers and replace rules are used extensively to deal with the heavily grammaticalised internal stem modifications of Irish. In general, triggers are introduced only in the lexicon, and replace rules systematically consume these trigger tags. This policy results in a clear division of labour between the lexicon and replace rules, and allows more flexibility in the ordering of replace rules during the development phase. At the end of the design stage there were approximately 90 inflectional morphology replace rules. 7.1.

EXCEPTION HANDLING

Exceptions to the morphological rules can be dealt with in one of two ways depending on the degree of irregularity. If a stem’s inflected forms are completely irregular (e.g. involving suppletion) then it is most efficient to handcode all of the inflected forms in a separate lexicon. Replace rules may be applied to these forms if necessary (e.g. in Irish, initial mutation rules apply in the same way to regular and irregular word forms). If the stem inflection is only mildly irregular then it may be included in the lexical category with which it shares the most characteristics. Separate transducers may be defined which filter out the incorrectly generated forms in the final transducer, and which add the desired irregular forms (Beesley and Karttunen, 2003). In this way, irregular forms can be handled in a systematic and practical manner. 8. Testing Testing is an integral part of the development process. There are two main areas of concern during the development of the system: (a) linguistic accuracy, and (b) well-formedness of the lexical, surface and intermediate representations. 8.1.

REGRESSION TESTING FOR ACCURACY

As the system grows in size and complexity it is important to have means of monitoring progress. As with all software systems which are developed

INFLECTIONAL MORPHOLOGY FOR IRISH

181

Figure 4. Composition of lexicon and ordered replace rule transducers.

incrementally, regression testing is of vital importance. During development, there is a constant danger that parts of the system which were tested and working will be disrupted by the addition of new rules or when fixing a particular problem. Confidence in the integrity and accuracy of the system is maintained by carrying out regression testing consistently from the start.

182

UI´ DHONNCHADHA ET AL.

The Xerox xfst tool has several very useful features, as documented by Beesley and Karttunen (2003), which enable rigorous and consistent testing. Using finite state operations such as projection and subtraction, the network of old surface forms can be subtracted from the network of new surface forms giving the list of new word forms which have just been added. By examining this list one can check that only correct word forms have been generated. It is equally important to identify forms that have been lost. By subtracting the new network from the old network we can list the word forms which have been lost. This may be as intended, or it could signify a problem. If these checks are performed after each change to the system, any unintentional effects can be quickly spotted and the problem can be rectified before continuing.

8.2.

WELL-FORMED STRINGS

It is important for both analysis and generation that the lexical tags are consistent in naming and in order of concatenation. Although the same information is provided, it would be undesirable for example for some nouns to be specified as nounstem+Noun+Com+Sg and others as nounstem+Noun+Sg+Com. The tags must appear in a specified order to provide a consistent interface to other systems. In the case of nouns, there are mandatory tags to describe lexical class, gender, case, number and definiteness as well as a number of optional tags. Some tags are mutually exclusive, for example gender must be either +Fem or +Masc. The well-formedness of lexical strings can be checked using a lexical tag grammar written in the form of a regular expression (Beesley and Karttunen, 2003), which defines the sequence and optionality characteristics of all well-formed lexical representations. By subtracting the lexical grammar network from the upper level of the noun transducer, we are left with any strings that do not conform to the grammar, thus signalling a problem which requires attention. Either the grammar or the strings will need to be modified as appropriate. It is also necessary that the triggers in the pre-surface forms conform to a specified standard in order to ensure that replace rules will fire as intended. A tag grammar for the lexicon pre-surface strings is used to maintain accuracy and consistency in the interface between lexicons and replace rules.

INFLECTIONAL MORPHOLOGY FOR IRISH

183

9. Language Coverage and Semi-Automatic Population of the Lexicon After the design phase, the inflectional morphology rules were implemented and the test lexicons contained approximately 500 manually entered items. A list of the most frequently used word forms in a corpus of Irish texts ´ n.d.) used to select a further 600 items which were also manually (ITE, added (U´ı Dhonnchadha, 2003). As well as ensuring a basic level of language coverage, the manual addition of these extra items served as a test of the completeness and usability of the system. All morphological processes required by the new stems were found to be in place, and only one new suffix needed to be added. It also ensured that the documentation was clear and comprehensive, as this was essential in order to select the correct initial continuation class from more than 90 available classes (60 relating to nouns and the remainder covering all other parts of speech). If the finite state transducer is to achieve (near) full coverage on unrestricted text, the lexicons must contain many thousands of stems. It is impractical to expect to achieve this by manual means alone, especially in a minority-language context where financial resources are scarce. The process must be automated as far as possible. Data in electronic format may be converted to the format required by lexc. Data that only exists in printed format can be scanned and processed using optical character recognition (OCR) software. There are two steps involved in data conversion: data preparation and data processing.

9.1.

DATA PREPARATION

In the current project, a bilingual dictionary An Roinn Oideachais (1986) (converted from WordPerfect to plain text) was used to populate the lexicons. The text relating to the Irish headwords was converted using Perl to a formatted text file based on the relatively structured layout of the dictionary, as illustrated in Figure 5. athair1 ah r m, gs -ar pl aithreacha father; ancestor, ... athair2 ah r f, gs athrach creeper ´ bro:g f2 boot, shoe,∼ adhmaid clog brog cathair kah r f, gs -thrach pl -thracha city; circular stone fort, ... ´ su:l f2, gs & npl ∼e gpl sul ´ eye; expectation, hope ... suil e e

e

Figure 5. Examples of dictionary entries.

The reformatted text file, with record layout as in (8) was imported into a database (Table I) and cleaned up as necessary. Subsets (e.g. verbs

184

UI´ DHONNCHADHA ET AL.

only) were then exported for further processing (i.e. determination of continuation class). (8)

headword|superscript|pronunciation|part-of-speech|definition

Table I. Structured dictionary text Sup

Pron

Pos

Definition

athair athair ´ brog cathair ´ suil

1 2

ah r ah r bro:g kah r su:l

m f f2 f f2

gs -ar pl aithreacha father; ancestor, . . . gs athrach creeper boot, shoe,∼ adhmaid clog gs -thrach pl -thracha city; circular stone fort, . . . ´ eye; expectation, hope . . . gs & npl ∼e gpl sul

e e

e

9.2.

Head

DATA PROCESSING

The orthography of the headword, the part-of-speech category, and inflectional information in the definition field were processed using Perl, in order to output a list of stems and initial continuation classes in lexc format. The definition field contains valuable inflectional information, which can be exploited using the regular expression features of Perl. For example, “gs -ar pl” indicates that the genitive singular of athair is formed by replacing the final -air of the headword with -ar, while “gs & npl ∼e” indicates that the genitive singular and nominative plural of s´uil are formed by appending -e to the headword. It is not possible in all cases to assign an initial continuation class with complete certainty. For example in the case of br´og one can tell from the part-of-speech information “f2” that it is a feminine noun of the second declension, but since no inflectional information is given in the definition we cannot say for certain which of the twelve initial continuation classes in the finite state lexicon, relating to the second declension, it should be assigned to. In these cases the most likely class is assigned, and the output is flagged as requiring manual checking. Automatic class assignment for verbs and adjectives proved very successful; however almost 2,500 nouns required manual checking. Nevertheless this is a very significant improvement over checking all of the nouns manually. The overall results of this process are given in Table II. The morphological transducer was evaluated, both before and after incorporating the automatically assigned items, against a selection of five

185

INFLECTIONAL MORPHOLOGY FOR IRISH

Table II. Semi-automatic processing of dictionary

Number of headwords Automatically assigned continuation class Items requiring manual attention

Verbs

Adjectives

Nouns

1,600 95% 5%

3,000 98% 2%

10,700 77% 23%

´ texts, each containing approximately 1,000 words, from Corpas N´aisiunta na Gaeilge (ITE). The results obtained are given in Table III. Table III. Average recognition rates by finite state transducer Before semi-automatic population of lexicon After semi-automatic population of lexicon After inclusion of named entities

70.0% 89.0% 91.5%

Most proper nouns (i.e. named entities) were not recognised by the morphological transducer. To supplement the lexicons with this type of data, a number of lists of names and places were scanned (An Roinn ´ Droighne´ain, 1991: 92–101; O ´ Siochfhradha, Oideachais, 1986: 509–514; O 1998: 259–265), resulting in approximately 1,500 named entities. OCR software specifically for Irish was not available to us at the time, but a Portuguese version (which contains all the relevant diacritics) was used, which proved to be adequate. 10. Methodological Principles During the development phase, a number of design decisions have to be made, which have far-reaching consequences for the maintainability of the system. We emphasise the re-use of existing standards and linguistic resources and discuss issues such as modularity and maintainability. 10.1.

USE OF EXISTING STANDARDS

In the interests of compatibility and information exchange, existing standards were used wherever possible. Three areas of standardisation are

186

UI´ DHONNCHADHA ET AL.

discussed here: grammatical categories, naming of lexical tags and standard versus dialectal word forms. The use of traditional grammatical categories makes the manual entry of lexical items more straightforward for the lexicographer/linguist. It also facilitates the re-use of existing lexical resources to populate the finite state morphology lexicon. As already stated, in order to achieve the greatest possible coverage of the language it is desirable that machine-readable dictionaries, or printed sources which can be transformed with a limited amount of effort into electronic form, be utilised. Most dictionaries contain some grammatical information and if this is available in electronic format, semiautomatic population of a lexicon in a finite state morphological resource is possible (cf. Section 9.1). The noun and verb finite state lexicons are organised using the traditional declensional and conjugational paradigms detailed in standard ref´ an Aistriuch´ ´ ain, 1988; erence works (Christian Brothers, 1988; Rannog Br´aithre Cr´ıosta´ı, 1999). In the case of adjectives new categories were created, as there did not appear to be a common standard in use in the various reference grammars. The lexical tags follow closely the Parole common morphosyntactic tagset.6 This tagset was used in POS tagging of the Irish National Corpus, ´ n.d.) as well as 13 other European Corpas N´aisi´unta na Gaeilge (ITE, corpora. As regards standard and dialectal word forms, both can be accommodated easily in the lexicons. Currently the tags shown in Table IV are defined, corresponding to the major dialects. Those stems not marked for dialect are assumed to be standard forms.

Table IV. Tags for dialectal word forms Tag

Meaning

+CC +CU +CM

´ Canuint Chonnacht ‘Connaught Dialect’ ´ Canuint Uladh, ‘Ulster Dialect’ ´ Canuint na Mumhan, ‘Munster Dialect’

Using the appropriate tags, it is possible to extract various subsets of the language from the final transducer, e.g. standard forms only, or standard forms plus a particular dialect. The same principle could also be applied in order to include historical forms if desired, e.g. sg´ eal + Hist + Noun

INFLECTIONAL MORPHOLOGY FOR IRISH

187

‘story’ representing an older form of the contemporary sc´ eal + Noun ‘story’. 10.2.

MODULARITY

The importance of a modular design cannot be overemphasised. It allows the sharing of common code and avoids the problems associated with repetition. It also facilitates easier reorganisation and extension of the system – a regular occurrence during the development phase. The lexicon and replace rules are designed in a strictly modular fashion. In the lexicon commonly used items, such as suffixes, replace rule triggers and so on, are placed in their own (sub)lexicon (continuation class) and thus can be accessed and shared by different lexicons. This avoidance of repetition ensures that updates to an item need only be carriedout in one place rather than at multiple locations. For example, the rules for initial mutations are common to all parts of speech. Therefore verbs, nouns and adjectives share those rules. Final mutations are more specific to the part of speech; therefore two sets of final mutation transducers are used: one for verbs and one for nouns and adjectives. A modular approach was also used in the organisation of replace rules. Replace rules dealing with a similar type of linguistic phenomenon are placed together in the same text file and are composed together as one transducer. For example, the various replace rules dealing with broadening, are held in one text file and composed as one transducer. This makes the addition of new rules more straightforward. Rules that relate to similar functions are grouped together and therefore the addition of a new rule is less likely to have detrimental or unforeseen effects on other parts of the system. During development, a modular system means that the order of composition of the transducers can be varied until the optimum sequence is arrived at. Modularity also makes it easier to extend the system. Extensions to handle historical, dialectal and slang word forms can be accommodated within this framework as described by (Beesley and Karttunen, 2003). It also enables a number of developers to work on different parts of the system at the same time.

10.3.

MAINTAINABILITY

A primary objective in the design of the morphological transducer is to make it as intuitive as possible to add new items to the lexicon, either manually or automatically. Once the classes of words in the finite state lexicon

188

UI´ DHONNCHADHA ET AL.

are defined and documented, the task of adding new words can be carried out by a lexicographer who is not required to have any knowledge of lexc or programming in general (Beesley and Karttunen, 2003). In order to generate the inflectional forms for a new stem, the stem must be added to the relevant stem lexicon and assigned the correct initial continuation class(es). The stem then acquires all the lexical descriptions and inflected forms that have been defined for this class. In order to keep the choice of initial continuation classes to a minimum, complexity whenever possible is handled by replace rules rather than the lexicon. In Irish there are often two versions of a suffix, one for broad stems and one for slender stems. Rather than having two initial continuation classes in the lexicon, we create a broad suffix class only, and make adjustments for slender stems using replace rules. This results in fewer classes in the lexicon, but a greater number of rule triggers. However since rules are hidden from the user, this is not a cause for concern. 11. Conclusion Minority languages must endeavour to keep up with the language technology advances available to the more dominant languages if they are to survive and prosper in the modern world. In order to do so, a sound policy of NLP development for the language must be implemented. A key requirement in the development of NLP technology for minority languages or lesser-studied languages is that the development process uses a solid foundation upon which to build in the future. The combined use of finite state techniques, which are proven to be successful in implementing language resources for many major and typologically diverse languages, together with the judicial reuse of existing resources for the language, can play a positive role in the context of NLP for lesser-studied languages. Notes 1

Xerox Finite State Technology, Xerox Research Centre Europe, www.xrce.xerox. com/competencies/content-analysis/fst/ (consulted 1/10/2003). 2 Hidden Markov Model Toolkit (HTK), Cambridge University Engineering Department, htk.eng.cam.ac.uk/ (consulted 1/10/2003). 3 University of Marne-la-Vall´ee, www-igm.univ-mlv.fr/˜unitex/ (consulted 1/10/ 2003). 4 Department of Computer Science, University of Western Ontario, Canada, www.csd. uwo.ca/research/grail/ (consulted 1/10/2003). 5 See note 1. 6 www.ite.ie/pos.htm

INFLECTIONAL MORPHOLOGY FOR IRISH

189

References An Roinn Oideachais: 1986, Focl´oir P´oca English–Irish/Irish–English Dictionary [Pocket ´ ´ dictionary], Baile Atha Cliath: An Gum. Beesley, K. and L. Karttunen: 2003, Finite State Morphology: Xerox Tools and Techniques, Stanford, CA: CSLI. Br´aithre Cr´ıosta´ı: 1999, Graim´ear Gaeilge na mBr´aithre Cr´ıosta´ı [The Christian Brothers’ ´ ´ Irish Grammar], 2nd ed., Baile Atha Cliath: An Gum. Campbell, G. L.: 2000, Compendium of the World’s Languages, (2nd ed.), London: Routledge. Christian Brothers: 1988, New Irish Grammar, Dublin: Fallons. Daciuk, J.: n.d., Finite state utilities. juggernaut.eti.pg.gda.pl/∼jandac/ fsa.html (consulted 1/10/2003). Fife, J.: 1993, ‘Introduction’. in M. J. Ball and J. Fife (eds), The Celtic Languages, London: Routledge, pp. 3–25. ´ ´ ´ ITE´ (Institiuid Teangeola´ıochta Eireann): n.d., Corpas N´aisiunta na Gaeilge, [National Corpus of Irish]. www.ite.ie/corpus (consulted 1/10/2003). Mohri, M., F. C. N. Pereira and M. D. Riley: 2003, AT&T FSM LibraryTM – Finite-State Machine Library, www.research.att.com/sw/tools/fsm/ (consulted 1/10/2003). ´ Cu´ıv, B.: 1987, ‘Sandhi phenomena in Irish’, in H. Andersen (ed.), Sandhi Phenomena O in the Languages of Europe, Berlin: Mouton de Gruyter, pp. 395–414. ´ Droighne´ain, M.: 1991. An Sloinnteoir Gaeilge agus an tAinmneoir [Irish Surnames and O ´ Names]. Baile Atha Cliath: Coisc´eim. ´ O Siochfhradha, N.: 1998, Focl´oir Gaeilge/B´earla – B´earla/Gaeilge [Irish/English – ´ ´ English/Irish Dictionary], Baile Atha Cliath: An Comhlacht Oideachais, Clo´ Thalboid. ´ an Aistriuch´ ´ ain: 1958, Gramadach na Gaeilge agus Litri´u na Gaeilge: An Rannog Caighde´an Oifigi´uil [Irish Grammar and Irish Spelling: The Official Standard]. Baile ´ Atha Cliath: Oifig an tSol´athair. Russell, P.: 1995, An Introduction to the Celtic Languages. London: Longman. SIL International: 2004, PC-KIMMO, A morphological parser. www.sil.org/pckimmo/ (consulted 1/10/2003). ¨ Stenson, N.: 1981, Studies in Irish Syntax, Tubingen: Gunter Narr. U´ı Dhonnchadha, E.: 2003, ‘Finite-State Morphology and Irish’, in Proceedings of the Workshop on Finite-State Methods in Natural Language Processing, 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 43–49.

Suggest Documents