A Platform for Developing Language Resources and Applications - racai

W. Teubert, R. Markincevicene (eds.) Proceedings of the European Seminar on Language Resourses, Kaunas, 1997

A GENERIC PLATFORM FOR DEVELOPING LANGUAGE RESOURCES AND APPLICATIONS Dan Tufiş Romanian Academy Center for Artificial Intelligence “Casa Academiei”, 13 “13 Septembrie” Ro-74311, Bucharest, Romania [email protected]

Abstract The paper describes a unification-based language engineering platform meant for development of reversible language resources and linguistic applications. The platform, called EGLU (Environnment Generique Linguistique d’Unification) is an enhanced generalized port of ISSCO’s original ELU from SUN-OS Allegro Common Lisp to Macintosh Common Lisp and Carnegie Mellon Lisp (under Solaris). Several language resources have been developed, ranging from small demo MT systems (French-English), medium-size grammars (French) to large dictionaries (Romanian language).

Introduction For a long time, NLP practitioners treated parsing and generation as distinct and independent processes with different tools, techniques and methodologies. Consequently, the linguistic knowledge necessary for the two ways of natural language processing has mostly been designed and implemented unidirectionally, incorporating procedural knowledge sensitive to the processing direction (analysis or generation). In the mid of ‘80s, there were carried successful researches concerning unitary treatment of the two processes and unification of the knowledge representation for both of these [1, 2, 3]. The researches into automatic translation were very productive in linguistic resource technology of ‘90s, which could be called reversibility technology. The reversibility of the linguistic descriptions became possible by the adoption of declarative formalisms, mainly those based on unification of attribute-value structures. The linguistic knowledge reusability is a natural requirement due to the rapid technological and conceptual advance seen in the last years and also, predictable for the future. If one considers the effort needed to implement a processing environment and the effort required for a wide coverage language description (according to some estimates this ratio is at least 1:100) the reusability criterion in encoding language descriptions should normally prevail efficiency. A linguistic description intimately related to a particular theory, however performant it might be, is


at risk of partial or even complete reformulating, if the grammar theory or formalism has to be changed. ELU, and its descendants MacELU and EGLU have been designed to answer some of these questions and to offer the computational linguist an easy to use platform for experimenting with several linguistic theories and language resource development and evaluation. Being primarily a development tool, the platform may be used quite satisfactorily for the development of medium-sized applications (NL front-ends, MT systems, etc.).

1. From ELU via MacELU to EGLU Based on the UD implementation [4], the group at ISSCO developed ELU in AllegroCommon Lisp under SUNOS4. The initial UD environment containing the basic unification machinery, the continuation-based lexicon, the parser and the generator was extended with a transfer module and a new lexical component based on inheritance hierarchies [5]. As an unification-based system, ELU can be characterized as weakly typed, with negation (~val) and disjunction (val1/val2/.../valn) over the atomic constants and allowing for recursive and parametric relational abstractions. During 1993-94, within a bilateral cooperation between ISSCO and RACAI [6], ELU environment has been ported on Macintosh computers under MCL. This port has been done preserving the full compatibility with the SUN version, but rewriting large parts of the code for efficiency and portability. The new code has been developed so that the same files would compile both under MCL on MACOS and under AllegroCL on SUNOS. Been intensively used at RACAI for the development of the Romanian morphology and of a large lexicon (see third chapter) meant for the public domain, we decided to make another port, this time using a public domain LISP. Our choice was CMU-LISP, running under SOLARIS, probably the best freeware implementation of Common-Lisp. The decision was supported not only by the efficiency of this CL implementation, but also by the API facilities offered by CMU-CL allowing for an almost null-cost future ports on other platforms (HP, SGI, Linux, etc.). The full access to X graphical functions which CMU-CL ensures was another very attractive feature of this environment (we plan to add a graphical interface and a feature-structure browser). This new port, which we baptized EGLU (Environnment Generique Linguistique d’Unification) preserves the full compatibility with the initial ELU, but due to conditional compilation special forms, its code compiles and runs without any modification under MCL(MacOS), AllegroCL (SunOS) and CMU-CL (SOLARIS). It takes better advantage of the operating system (for instance any command not recognized by the interpreter is sent to an operating system shell and only in case the return code is an error-code, the complaining is addressed to the user). Building applications and patching have been optimized (with appropriate conditional compilation forms) for the CMU-CL. For instance, initially, a dumped application needed 21 Mb of harddisk. After filtering out the unnecessary code for the execution of EGLU applications, the harddisk requirement decreased down to 6Mb. Besides the ISO-Latin1 character set existing in the initial ELU implementation, we incorporated ISO-Latin2 covering Romanian diacritics. The main processing functions of EGLU inherited from ELU are language description compiling, lookup, parsing, generation, transfer, tracing and debugging. They are briefly described, based on [7], [8], in the next sub-sections.

1.1 Compiling language descriptions


The information about a language supplied to EGLU is generically referred to as a description and this term covers grammars, lexicons, various types of declaration, etc. A description consists of a number of sections, each introduced by '# keyword', where keyword instructs the compiler what information is contained in that section (structural declarations, restrictors, lexical entries, grammar rules, etc.). A description may be spread over a number of files, the compiler directive '# Include filename' causing the compiler to read the contents of filename. All the descriptions specific to a given language can be collectively referred to as a setup. Several setups can be defined (the compiler directive '# Language language' ), which is extremely useful in the development of multilingual language descriptions and applications. Any linguistic processing takes place in a specific setup but data communication among setups is possible via the transfer process. Translating sentence S from language L1 into language L2 is achieved by parsing S in the setup of L1, transferring the parsing result (feature structure) into the setup of L2 and, here, generating from the transferred feature structure. Being a descendant of PATRII, the syntax of the description language in EGLU is very close to that of PATRII. The building elements in each section of a description are constants, variables, path specifications and restrictions. For documenting descriptions the compiler allows to user to write comments (lines starting with ‘##’) which will be ignored during the compilation phase. Constants begin with a character in the range a...z. When needed, any other character strings may be 'escaped' by enclosing them in single quotes ('&@**!','EGLU','+', '1984' etc.). The quotes don't appear when EGLU prints the constant onto the screen. Variables begin with a character in the range A...Z, or with the underscore character '_'. Like in Prolog, a single underscore is interpreted as an 'anonymous variable' that never receives a binding, but matches anything. There is a distinguished variable ‘*’(without the quotes), which stands for the feature structure (FS) currently being interpreted. Within a lexical entry, ‘*’ refers to the FS associated with the word in question, while in a grammar rule it refers to the 'left-hand side' FS. A path specification has the form (with the angular brackets part of the specification) where each si is a constant or a variable. In most cases, the first segment, sl, is a variable (such as *, S, NP, VP, Subj, etc.) naming a FS. There are three basic restriction types: equations, type coercions and preferences. An equation has the form LHS = RHS, where LHS and RHS are path specifications, variables, or constants and ‘=‘ implies unification of the two terms. In case LHS is a list-typed FS, RHS may be further specified as an append (++) or difference (--) between a list-typed feature structure and any type of FS. A type coercion has the form LHS==RHS, where LSH is a feature structure, RHS is a declared type and ‘==‘ implies LSH’s type checking versus the RHS declared type. A preference has the form Better > Worse constraint_info, where Better and Worse are variables (identifying generic feature structures) and constraint_info is a feature structure identification of Better and Worse. Preference restrictions are used to define a partial order over the multiple solutions resulted from parsing a potentially ambiguous input. With these syntax basics, we can now proceed with the examining of the proper language descriptions. A language description consists of several sections which could physically reside either in one file or several files. Declarations. Among the attributes of a FS, some have a privileged status from the parser’s point of view: it needs to know which feature in a FS represents its category, and, for lexical items, which feature represents its form. This information is given in a section headed by


Declare, and containing the keywords Form and Category. The Declare section may contain other optional declarations, but they are mainly intended as means for automatic precompilation of path specifications. Restrictors. Unifying complex FSs is an expensive operation, requiring many individual unifications of atomic FSs. If the two FSs concerned are not unifiable, the system may not discover this until a large amount of useless work has been done. For this reason, it is possible to declare in a section headed by the Restrict keyword, one or more paths as 'restrictors'. The unifier will attempt to unify the values of these paths first, and will only go on to unify the entire FSs if these initial unifications succeed. The choice of restrictors (e.g. the path encoding category distinctions) has a great effect on the efficiency of parsing and generation. Type definitions. Besides the build-in list data type (its syntax is the same as in Edinburgh Prolog, except that comma delimiter is not needed), the user may define and use his/her own tree-like data types. Such a definition section is headed by the Typedef keyword. The unification of typed feature-structures checks first for type congruency and then progress in a very similar way Prolog unifies two terms. By declaring types for certain feature structures, the efficiency of the whole system can be significantly improved. Relational abstractions. Grammars and lexicons descriptions often contain blocks of information that must be stated repeatedly, perhaps with some variations, in a number of rules or entries. The relational abstractions facility allows users to remove this information, and state it just once. The relational abstractions are a means for abstracting away from particular instances of FSs. In a simple-minded interpretation, relational abstractions can be assimilated to PATRII macros. A more thorough analysis of the expressive power of relational abstractions will be given in section 2, but here we should mention that they accept arguments, they may be defined recursively and could have multiple definitions with contextual activation. The relational abstractions are defined in one or more sections headed by the Define keyword, and are explicitly invoked where needed, by the ‘!’ call operator, prefixing their names. Morpho-lexical definitions. Morpho-lexical definitions are given in one or more sections, containing several entries each of which consists of a headform followed by some information. The entries may specify lexical entries or morphological generalizations. There are two possibilities to encode morpho-lexical descriptions: in a finite-state manner (continuationclasses lexicon) or in a inheritance-based way (inheritance hierarchy lexicon). A continuationclasses lexicon is headed by keyword Lexicon and each paradigm by the keyword Paradigm, in both cases the keyword being followed by a name for that lexicon or paradigm. An inheritance lexicon is introduced by the keyword Nlexicon (New Lexicon), the lexical entries are introduced by the keyword Word while the morpho-lexical generalizations by the keyword Class. If it is necessary (for several reasons) to separate entries for nouns, verbs, etc. or to separate types of lexicon information (phonological, morphological, lexical), more than one lexicon section may be used. Intimately related to morpho-lexical definitions are the Lookup (obligatory), Unknown (optional) and Separators (optional) sections. The Lookup keyword headed section contains information instructing the parser and the generator which lexicon to look for in order to find the feature-structures satisfying some discriminating criteria (for instance the statement “nouns = n”, says that if feature-structures with cat(egory) n are needed, they are to be found in the lexicon named nouns). The Unknown section contains statements allowing the parser to assign an implicit lexical interpretation to the unknown words. The Separators section defines punctuation characters significant in lexical segmentation of the input. Grammar. The grammar rules and associated restrictions, described in the Grammar section(s) are not necessary to perform morphological analysis or generation (collectively referred to as lexical lookup) but only for processing sequences of two or more lexical items. The information here is pertinent for both parsing and generation. As usual in unification grammar


systems, the body of a rule (which, for tracing and debugging purposes, can be named using the keyword Rule followed by the given name) consists of a context-free rewriting part (including empty rules) and an information part, consisting of feature-structures equations. The user may specify top level categories (i.e. those that can label the root of a valid parse tree) either by using as LHS of a rule the special symbol $ or by specifying, in the Initial keyword-headed sub-section of the grammar, the names of the rules containing the desired categories as LHSs. 1.2. Language Processing Lexical Lookup. Based on the morpho-lexical descriptions, EGLU can analyze wordforms or generate the wordforms making the paradigmatic or lexical family of a given lemma. Setting the direction (analysis or generation) is controlled by a toggle variable (Analyze on/off) and the process itself is launched by the Lookup command sent to EGLU either by choosing it from a pull-down menu (using the graphical interface) or in a UNIX-like command line. The result of the analysis of a word-form is a set of feature-structures associated with each analysis variant. The result of lexical generation from a given lemma is a set of word-forms (morpheme boundaries marked) each of them being associated with the corresponding FS. If the morpho-lexical description of the considered language is an inflectional description, the set of generated word-forms would be the paradigmatic family. For a derivational morpho-lexical description, generating from a lemma would result in the paradigmatic families of all members of the lemma’s lexical family. Parsing. EGLU’s parser is a unification-based chart-parser which works in a monotonic and breadth-first manner on feature-structure grammars. The parser works in two phases: in the first one, it builds all possible phrase structure trees; in the second phase, each parse tree is checked in a depth-first manner for satisfaction of the constraints annotating the rules that generated the tree (each node in a parse tree is uniquely associated with the rule that created it). Several facilities of chart parsing (local parses, partial parses, active rules, etc.) have been exploited and offered to the user as Tracing or Debugging options. The user can control the number of parses to be returned by the parser and in case a filtering is needed, ordering criteria may be described in the Prefer section. When several preferences are given, their order in the Prefer section corresponds to the decreasing priority. The parses are ordered according to the specified preferences and the first n parses (as specified in set accept command) are returned (if no set accept command was issued before parser invocation, than all the parses are returned). Generation. Starting from a representation of the sort created by parsing, EGLU will attempt to apply the rules of its current grammar so as to discover the string(s) which the grammar relates to that representation. Put simply, the generator works by taking a FS and searching for grammar rules and lexical entries which can relate subparts of the FS to words. The generator is based on the Head Driven Generation Algorithm described in [9]. The generator partitions the grammar rules into two sets - the 'chaining rules' (applied bottom-up), in which the semantics of the left-hand side and the semantics of the head daughter are the same, and the 'non-chaining rules' (applied top-down), in which the semantics of these two objects differ. In order to do so, the generator needs to know where the semantic information is located in a FS, which paths to consider and which FS correspond to the semantic head of a rule. This information is given in the Sempaths and Genpaths sections of the language description and by prefixing the head daughter constituent of each grammar rule with an H. At certain stages of the process, only information present in the values of paths declared in the Sempaths section are taken into consideration. It is important for the generator to be able to distinguish efficiently between different words in the lexicon. It employs restrictors to select possible rules and words


during its operation, so the choice of restrictors has an important effect on its behavior. Words can be distinguished, for example, by ensuring that the semantics of each word does not unify with the semantics of any other word, and this information can be supplied to the generator most directly by making this distinctive value a restrictor. If grammar rules partitioning is not possible (this happens when the Sempaths section is missing) the generation process proceeds in a pure top-down manner, degrading (in most cases) the response time. A special command (gstatus) allows the grammar developer to display some of the information about the current grammar that is used by the generator. Transfer. ELU (and its successor EGLU) have been designed to permit research and experimentation with machine translation. Transfer is defined as the process of mapping FSs into FSs. The process is governed by a transfer grammar. A transfer grammar is a collection of 2 types of rules: a) lexical (atomic) transfer rules which consider the relationship between the FS corresponding to a lexical item in one language L1 and its translation equivalent in L2; b) structural transfer rules which consider the relationship between FSs corresponding to phrasal structures in language L1 and their appropriate equivalents in L2; at this level, grammar writer specifies mappings of syntactic relations from one language to the other. In [7] is given a classical example of transfer rule for a head-modifier relation in German which in French is replaced by an infinitival complementarization (the German modifier gern is turned into the French main verb aimer and the German main verb -the head- is turned, in French, into the infinitive complement of the French head aimer ). While in a monolingual environment, grammar reversibility is both conceptually and technically a sound and desirable concept, transfer grammar reversibility is still debatable. Bidirectionality of the transfer, would assume only one transfer grammar for each pair of languages, which is precisely what EGLU offers. However, reversibility can be overruled by explicitly working with two transfer grammar, one for each transfer direction. Yet, even when unidirectional transfer grammars are used, the dictionaries and the grammars for the analysis and generation in the two language remain reversible.

1.3. Tracing and Debugging In order to control and/or examine the operation of the different running processes, the user is offered a set of powerful facilities. Debugger. The command set debug , where n is a number from 0 to 9, causes the system to print out messages describing its operation. When the debugging level is set to 0, no messages are printed, but higher numbers show progressively more detailed information (see [7] for the significance of each level of debugging). Tracer. The debugger displays a lot of information, even when set at low levels. This might be useful in case of unfocused search, when no hints or intuitions are available. When the grammar developer wants to trace specific rules or relational abstractions, he/she may filter out the debugging information by specifying the names of those rules or relational abstractions in the trace command. Chart. As shown before, the parser uses a “chart” to store objects that it has created during analysis of a sentence. This chart can be thought of as a series of partial analyses of substrings of the sentence. If interword positions of a sentence containing N words are numbered sequentially from 0 (in front of the first word) to N (after the last word), than the parser can be


asked to display the partial analysis of any contiguous substring of the initial sentence by typing in the command chart start stop. Here, start (defaulting to 0) and stop (defaulting to N) are numbers (between 0 and N) identifying the beginning and the end of the substring. Thus, after analyzing the sentence 'word1 word2 word3 word4', there may be up to ten chart entries, in (n+1)*n general . In debugging a grammar, inspecting partial analysis is extremely useful for 2 localizing potential errors in language descriptions. Unlike the trace command, the chart command is independent of debugging level. Rules. The command rules displays the feature structures into which the rule has been compiled. This is very useful (in connection with other debugging and tracing facilities) in checking for instance that the intended re-entrancies have been correctly created.

2. Relational abstractions This section, based on [10] and [11], aims at outlining some of the merits of the relational abstractions as initially encoded in UD and afterwards imported in ELU and EGLU. The purpose of abstraction mechanisms is essentially to permit various kinds of linguistic information to be encoded in a uniform manner, abstracting over this information in such a way that large descriptions remain tractable and can be developed by groups of linguists working in collaboration. All existing formalisms use such abstraction mechanisms. While for instance in PATR-II, abstraction is achieved by means of templates, in HPSG by types, in CUF by parametric sorts, etc., in ELU/EGLU this is done by means of relational abstractions. The simplest form of abstraction is the PATRII type of template. The example below, provides such a template for abstracting the general properties of a transitive verb: Let Transitive be Verb = NP = NP = end = = Let Verb be = v The template Transitive includes information from another template, Verb, revealing a simple form of inheritance hierarchy, where Transitive inherits from the more general Verb. This kind of inheritance is directly related to subsumption ordering over the FSs. In order to permit more complex forms of inheritance it should be necessary to define some kind of precedence relations over the templates (potentially introducing non-monotonicity). It should be noted that such a template can be seen as a unary macro where the hidden argument is the root of the FS containing it. The main problem with the expressive power of templates (even with a possible extension to parametric templates) is that they provide information about object in whose description they occur. The same kind of abstraction which could be evaluated in relation to a local object would ensure much more descriptive power. In the basic formulation of HPSG [12], this is exactly what types take care of. The HPSG types, besides being applicable to any local FS of a given FS, can be combined by logical operators (conjunction, disjunction and implication) which allows for stating definitions for new types. The types (and the operators) make a lattice with the top type being sign and the bottom being ⊥(the absurd type). Negation is simply


expressed as the implication of the bottom type. Implications are mainly used to express the principles governing the composition of phrasal signs and usually the antecedent is simply specified by the typing information (implicitly universally quantified) while the consequent contains feature-structures with potential variables interpreted existentially. For instance, the following definition from [12] describing the Head Feature Principle should be read as: for any sign which has a DTRS feature with a value of the headed-structure type, there exist a sign as value for the HEAD feature of the mother sign which is shared by the head feature of the head daughter sign. é SYNSEM| LOC | CAT | HEAD

[DTRS headed - structure ] Þ êë DTRS |HEAD − DTR | [ ]

⊗1

ù ú SYNSEM| LOC | CAT | HEAD ⊗1 û

Figure 1: Head Feature Principle Since a type locally refers to a FS (contrary to the templates), it is obvious that types in HPSG can be recursively used. However, from an implementation point of view, a recursive definition has to be somehow disjunctive in nature, otherwise its evaluation would never bottom out. In HPSG as a linguistic formalism the recursive use of types is kept implicit, but any implementation has to consider it which might be a source of serious efficiency problems [13]. The association of parameters with an abstraction is an extension to the notion of template or type. Parameters permit more abstract generalizations leaving elements of internal structure to be provided by argument parameters. In a fully extended form with disjunctive and recursive definitions they can greatly add to the expressive power of the formalism, but they also raise certain implementation problems. Considering abstractions as relations makes it apparent that all kinds of abstraction can be thought of as taking an implicit argument, and the difference between types and templates is the role that argument is to play. In a template the implicit argument will be the root of the current feature structure. In a type it will be the node to which the type is applied. The initial version of relational abstractions which is probably still one of the most general was developed by Rod Johnson and M. Rosner [4] as part of the UD environment for experimentation with constraint-based grammars in translation research. The basic machinery of UD environment was imported with minor modifications into ELU and later on into EGLU. On the way from ELU to EGLU most of the original encoding was rewritten for better performance and portability (the EGLU runs at least 3 times faster than ELU). The equations and calls of relational abstractions within the body of a relational abstraction are interpreted conjunctively. Several definitions of relational abstractions with the same name are interpreted disjunctively. It is easy to demonstrate that relational abstractions are an extension of both macros and types, because PATR-II macros can be encoded as zero-argument relations, and types can be encoded as single argument relations. The previous PATRII formulation of the transitivity abstraction can be written, easier, in EGLU as follows: Transitive !Verb = [Subj ,Obj] = = np = = Verb = v


é PHON likes ê é ê ê ê ê ê ê ê é HEAD ê ê ê verb [TENSE present ê ê ê ê êCAT ê ê ê ê ê ê SUBCAT < ⊗1 ⊗2 > ê ë ê ê ê êSYNSEM | LOC ê ê ê é ù ê ê ê RELN LIKE ú ê ê ê ú ê ê ú ê ê ê ú ê ê ú êCONT ê LIKER ⊗1 ê ú ê ê ê ú ê ê ê ú ê ê ê ú ê LIKED ⊗ ê úû êë ê ë 2 main ∧ thrd sgn ∧ strict − trans ë

ù ú ú ú ú ùú úú úú úú úú úú ûú ú ú ú ú ú ú ú ú ú ú ú ú ú úû

]

ù ú ú ú ú ú ú ú ú ú ú ú ú úF ú ú ú ú ú úû

igure 2: HPSG sign for “likes” Encoding as relational abstraction the information expressed by the sign in Figure 2 is extremely simple: likes

!main !thrdsng !strict-trans = “likes” = present = [Obj, Subj] = “LIKE” = Subj = Obj

General information is given in terms of types (main, thrdsng, strict-trans), expressed as zero-argument relations, and information specific to this FS is given in path equations, instead of a feature term. Since the types were referring to the entire entry, they could be defined as zeroargument relations (templates). The type restrictions would have had the same effect if instead of zero-argument one used unary relations with the distinguished variable ‘*” as argument (remember that this refers globally to the FS containing it). Locality, when needed, can be ensured by unary relations, where the argument is just a variable identifying the FS referred to. The extension of HPSG's unary types as relations with a single argument, demonstrates that a type can be defined solely in relation to one of its arguments without explicit reference to the global context. This means that while types have more expressive power than templates their relational extensions permit at least the same abstractions to be defined. The most striking extension of expressive power provided by the addition of parameters is the ability to make generalizations over syntactic rules. The feature structures associated with the daughters of a rule can be passed to an abstraction either as individual arguments or as a list. Dörre & Eisele [13] demonstrate that this permits a reencoding of HPSG principles, without the need of a DTRS feature, because a generalization can be made about a sequence of feature structures without the need to embed them in the mother's syntactic representation. A similar observation was made in a more intuitive manner in [11] where HPSG syntactic representations were rejected on the basis of the size of the feature structures that they generate. Consider the description of Head Feature Principle in Figure 1. A simple-minded direct reformulation in terms of relational abstraction would be: Head_Feature_Principle !headed-structure =


This very simple definition is computationally inefficient since it requires that information about daughters be incorporated into the mother FS. This direct representation of daughters within mother’s FS leads to very large representations with a high degree of redundancy. A better solution would be to define a unary relation taking as the argument the head daughter FS. Head_Feature_Principle (H) !head-daughter(H) = This way, the presence of the entire description of the daughters (and therefore the DRTS feature) is not needed anymore within the mother FS. The next example further substantiate this point, describing a relation governing the combination of the head constituent with its complement in a binary rule (subcategorization principle). The relation may be easily generalized (using list recursion) to cover n-ary rules. Subcategorization-Principle (H, C) = -- C The first argument will be the head daughter, the second the other daughter. The built-in operation over lists (--) will remove an element from any position in a list and return the remainder of the list. In a similar manner can be shown how one could encode all HPSG principles in terms or relational abstractions. In [11] there are given several other very interesting examples of reformulation of HPSG principles as relational abstractions (semantic principle, a generalization of subcategorization principle to deal with both complements and adjuncts, etc.) as well as other comparisons with LFG (functional uncertainty treatment of quantifier storage) and with parametric sorts as implemented in CUF [13]. It also discusses the implementation in terms of relational abstraction of some language engineering techniques: “gap threading”, abbreviatory definitions of paths, rule lexicalization and many others.

3. Encoding linguistic knowledge in EGLU Several language descriptions have been developed within the framework of EGLU, embedded in applications ranging from toy reversible machine translation systems, to wide coverage lexical processors. The largest language description, up to now, developed in EGLU is the feature-structure paradigmatic morpho-lexical description of Romanian [14]. This description completely covers the inflectional morphology of Romanian and deals with the most productive lexical affixes. The lexicon contains almost 35.000 lemma entries, based on which, in accordance with morphological description, more than 1 million word-forms can be interpreted or generated1. Starting from the set of inflectional paradigms as resulted from several learning sessions with our PARADIGM learning system [15,16], since 1992 we have begun an exhaustive description of the Romanian morphology using a neutral and reversible language (FAVR-Flat Attribute-Value Representation [17]), which does not commit itself to a particular formalism. The rationale for this representation was to ensure not only formal accuracy as requested by computer applications, but also readability and easy migration to whatever computational representation. In estimating this number, we counted homographs and homonyms as individual word-forms; for instance, the string “vin” counts as three word-forms (wine, (to) comeI and comethey)

1


As a consequence of systematic tests applied in the last two years, the implementation of the morphological model has been stabilized. There were detected some errors and omissions (mostly idiosyncratic paradigms with few lexical representatives which often were regional or bookish forms). The corrections and addition made in EGLU implementation have been operated in the underlying description FAVR too, which, to the best of our knowledge, constitutes the most complete computational linguistic resource of the Romanian language morphology. The aim of the FAVR modeling was to cover completely the inflectional (grammatical) morphology. But, once we decided to migrate the FAVR descriptions in the EGLU formalism, we stepped ahead and considered the derivative (lexical) processes too. The lexical suffixes (about 600 in Romanian) and the prefixes are very productive. Out of these, our implementation deals only with some of them, namely the most common ones. According to our description, a word is an entity composed of tree types of fundamental units: the heading, the stem and the ending. While the heading and the ending of a word may be empty, the stem is always present. The heading of a word is a sequence of zero or more prefixes which change the lexical meaning of the stem. The ending of a word is made up of zero or more lexical suffixes, and a grammatical suffix which can be empty. The stem together with the first adjacent lexical suffix makes up a lexical theme. This theme, if it is followed by a new lexical suffix, makes up a new theme and so on. When in the structure of an word there is no lexical suffix, the stem stands for what we call the implicit lexical theme. The generic structure of a word is described by the following regular expression, where the usual conventions (i.e. the Kleene’s star) have been used (in addition, the empty suffix is considered a proper grammatical suffix): (prefix)* stem (lexical_suffix)* grammatical_suffix. A prefix description (from the morpho-lexical point of view) is relatively simple: for every prefix, one specifies the grammatical category or categories of the stem that can be prefixed, as well as a series of restrictions meant for filtering out illegal combinations. The suffixes are described through lexical or grammatical global paradigms. The global grammatical paradigms are composed of partial paradigms, which encode all the grammatical suffixes along with appropriate information and restrictions. The partial paradigms stand for partial regularities in the declension or conjugation of inflecting parts of speech. Finally, each stem is associated with one or more feature-structures providing contextfree grammatical and lexical information: the grammatical category or categories, the lemma form (associated to the implicit theme), restrictions with respect affixation (lexical or grammatical). In addition, the implicit themes provide syntactic information (for instance, subcategorization lists) and semantic ones (theta-roles structures, selectional restrictions). The feature-structure for a word-occurrence, that is the output provided by the morpholexical analysis, or the input expected by the morpho-lexical generator, will contain information congruently provided by the heading, stem and ending. The congruency of this information is ensured through feature-structure unification. Having analyzed the specific attributes of each part of speech, for the morpho-syntactic description of Romanian, we have used 20 grammatical categories. The classification had into account not only the requirements of morpho-lexical processing, but also the granularity necessary to the syntactic parsing and generation respectively. The grammatical and derivative morphology of Romanian is specified by means of several global paradigms, each of them being a combination of partial paradigms. The verb morphology is encoded by means of 48 global grammatical paradigms, three of them being specific to the auxiliaries). A global paradigm is made up of several partial


paradigms each corresponding to simple moods and tenses (there are 107 such partial paradigms). Besides grammatical verbal paradigms (exhaustively encoded), we considered the most frequent and productive 27 lexical paradigms attaching lexical suffixes to the verbal stems. These suffixes change the grammatical category of the verb by yielding nouns, adjectives and adverbs. The nominal declension has been covered by a set of paradigms applying for nouns, adjectives and numerals. The common noun declension has been modeled by means of 52 global inflectional paradigms: 14 for masculine nouns, 22 for feminine nouns and 16 for neuter nouns. With appropriate restrictions in the lexical entries, these paradigms may also be used for proper nouns. Additionally, there are 7 paradigms specific to proper nouns. These global paradigms resulted from the combinations of 78 partial paradigms which correspond to grammatical suffixes (providing information on gender, number, case and definiteness). There are also partial semi-lexical paradigms describing formation of diminutives (47) and augmentatives (7), plus gender changing suffixes (11). For the most productive proper lexical suffixes applying to nominal stems, we encoded 46 lexical paradigms. The declension of adjectives and numerals which contextually become nouns or adjectives obeys the patterns of the inflectional paradigms for the common nouns. The other inflecting parts of speech (the article, the indefinite pronouns and adjectives, the demonstrative pronouns and adjectives) have been modeled in terms of a smaller number of inflectional paradigms (19 paradigms). Unlike the inflectional parts of speech, the non-inflectional ones (adverbs, prepositions, conjunction, interjections) have flat feature structures, with few attributes. That is why, all the information specific to a given word is written in the appropriate entry from the lemma lexicon (where the lemma coincides with the occurrence form of the word). Obviously, the field of the paradigmatic information is empty. Special cases are the personal pronoun and the possessive pronoun and adjective. Although these categories are inflectional, (they have various forms depending on gender, number, person, and strong-weak form distinction), they have been encoded as non-inflectional items. This decision was simply a matter of efficiency since they form a closed set and isolating their stems turned out to be a too much expensive operation. The lemmas are collected in a separate lexicon (in our implementation the lemmas lexicon is split into several files). The lemmas lexicon contains the basic information pertaining for whatever inflectional or derivative word-form related to a given lemma (here is the place for lexical semantics). A different lexicon, containing stems, describes the morphology for each lemma. The coreferentiality between a lemma and a stem is explicitly given in the stems lexicon.

4. Applications of the EGLU Lexicon Given EGLU’s facility of morpho-lexical generation, we were able to blow-up the 35.000 lemma dictionary to more than 1.000.000 word-forms feature structures. By means of a small set of filters we converted the graph representations of the word-forms provided by EGLU into pairs of the form where MSD (Morpho Syntactic Description) represents a flat linear encoding of the corresponding feature structure. The MSD specification was a concerted effort within MULTEXT-EAST to extend the EAGLES specifications for covering several Central and Eastern European languages (Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene; for a detailed specification see http://nl.ijs/si/ME). This word-form lexicon was developed as a deliverable for the MULTEXT-EAST Copernicus Project. However, its use went beyond the initial purpose.


The TELRI Copernicus Concerted Action, within the Research Working Group, constructed a multilingual corpus consisting of translations of Plato’s “Republic”. The text has been segmented, lemmatized and manually morpho-syntactically annotated. The option for manual annotation was motivated by our intent to use this text (and some other texts from different registers, all of them making up almost half a million of hand-tagged text) as a test-bed for training and evaluating a probabilistic tagger. In manual annotation of the “Republica”`s text (more than 130.000 words) the EGLU morphology and lexicon proved to be again extremely useful. The words missing from the MULTEXT-EAST word-form lexicon, but appearing in “Republica” (about 1350) were fed into the EGLU lexicon. For these new words we generated as described above, a number of new entries for the word-form lexicon. The new lexicon was then used as the main language resource for tokenizing the text of Republica and MSD tagging of each lexical token. From the lemmatized and morpho-syntactically annotated text (not only Republica, but the whole corpus) we extracted several interesting statistics on word frequencies, MSD ambiguity classes, etc., which hopefully will give hints in designing a proper tagset for automatic tagging of Romanian. The full words “lexicon” was the basis for building a spell-checker for Romanian within the context of Ispell for UNIX. This spell-checker works with several encodings for the Romanian diacritics: SGML entities, ISO-LAT2 8-bit codes and TEX-like.

References [1] Shieber,S.M. - “An Introduction to Unification-Based Approaches to Grammar”, Lecture Notes CSLI, No. 4, 1986. [2] Shieber, S.M - “A Uniform Architecture for Parsing and Generation”. Proceedings of the 12th International Conference on Computational Linguistics, Budapest, Hungary, 1988. [3] Dymetman, M, Isabelle, P - “Reversible Logic Grammars for Machine Translation”. Proceedings of the Second International Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages, Pittsburgh, 1988. [4] Johnson, R., and M. Rosner. - “A Rich Environment for Experimentation with Unification Grammars”. In Proceedings of the Fourth Conference of the European Chapter of the Association for Computational Linguistics, Manchester, 1989. [5] Russell, G., Ballim, A., Carroll, J., Warwick-Armstrong, S. -”A Practical Approach to Multiple Default Inheritance for Unification-Based Lexicons”, in Computational Linguistics, Special Issue on Inheritance II, vol. 18, no. 3, September 1992 [6] Estival, E., Tufiş, D., Popescu, O. - “Développment d'outils et de donnés linguistiques pour le traitement du langage naturel”. Rapport Final - Projet EST, ISSCO, Geneva, 1994 [7] Estival, D. - “ELU User Manual”, ISSCO, Geneva, October 1990. [8] Russell, G. - “ELU User Notes”, Part I-V, ISSCO, Geneva, December 1991 [9] Shieber, S., van Noord, G., Moore, R. C. , Pereira, F.C.N. - “A Semantic-Head-Driven Generation Algorithm for Unification-Based Formalisms”, in Proceedings of the 27th Annual Meeting of the Association for Computational Linguistics, Vancouver, 1989 [10] Johnson, R., Rupp, C., J.- “Evaluating Complex Constraints in Linguistic Formalisms”, Technical Report, IDSIA, December, 1991 [11] Rupp, C., J. - “Abstraction Mechanisms in Constraint-Based Formalisms”, Technical Report No. 6, IDSIA, February, 1993


[12] Pollard, C., Sag, I.A. - “Information-Based Syntax and Semantics: Volume I Fundamentals”. CSLI Lecture Notes, Number 12, 1987 [13] Döre, J., Eisele, A. - “A Comprehensive Unification-Based Grammar Formalism”. Dyana Report No. R3.1.B, Center for Cognitive Science, Edinburgh, 1991 [14] Tufiş. D., Diaconu, L., Barbu, A.M., Diaconu, C. - “Morfologia limbii române, o resursă lingvistică reversibilă şi reutilizabilă”, in D. Tufiş (ed.) Limbaj şi Tehnologie, Editura Academiei Române, Bucureşti, 1996 [15] Tufiş, D. - “It Would Be Much Easier if WENT Were GOED” Proceedings of 4th European Conference of the Association for Computational Linguistics U.K. Manchester, 1989. [16] Tufiş, D. - “Paradigmatic Morphology Learning”, Computers and Artificial Intelligence, Vol.9, No.3, Bratislava, 1990. [17] Tufiş, D. - “The FAVR Description of Romanian Morphology”, Version 2, Research Report nr. 7, Center for Research in Machine Learning, Natural Language Processing and Conceptual Modeling, Bucharest, 1995.

A Platform for Developing Language Resources and Applications - racai

A Platform for Developing Language Resources and Applications - racai

Suggest Documents

A Platform for Developing Language Resources ... - Semantic Scholar

Developing Language Resources for a Transnational Digital ...

TEMPOS: A Platform for Developing Temporal Applications on top of ...

Developing an e-Learning Platform for the Greek Sign Language

Developing an e-Learning platform for the Greek Sign Language

Developing Applications for the Java EE 6 Platform

A Multiagent Platform for Educational Resources

SANSARN LOOK!: A PLATFORM FOR DEVELOPING ...

TEMPOS: A Platform for Developing Temporal ... - TimeCenter

OpenTrek: A Platform for Developing Interactive ... - CiteSeerX

MSMinerâa developing platform for OLAP B

Language Resources for Icelandic

electronic courses: towards unified curricula in language and ... - racai

TXL - A Language for Programming Language Tools and Applications

TXL - A Language for Programming Language Tools and Applications

a monitoring platform for distributed java applications

A Pattern Language for Service-based Platform

Lexical token alignment: experiments, results and applications - racai

New language resources for the Pashto language

Developing a Multidimensional Checklist for Evaluating Language ...

Developing Applications for Wearable Computers: A ...

PDAgent: a platform for developing and deploying mobile agent

A Didactic Platform for Testing and Developing Routing ... - ThinkMind

Language and Education Multilingual resources for

A Platform for Developing Language Resources and Applications - racai