A normalized syntactic lexicon for Arabic verbs ...

A normalized syntactic lexicon for Arabic verbs Evaluation within the LKB platform Abstract Lexicon reusability in several formalisms and frameworks is known to be among the most complicated issues in natural language processing. In this paper, we present a syntactic lexicon of Arabic verbs based on the Lexical Mark-up Framework. This framework is a new ISO standard created to enable versatile lexicon description and NLP system interoperability. We describe the syntactic information associated to verbs and the model we adopt to structure and represent the lexicon. To prove the usability of the resource, we implemented a rule-based system that translates it into Type Description Language. The generated lexicon is tested with a previously written HPSG grammar for Arabic built within the Language Knowledge Builder platform. Finally, we discuss experimental results and further improvements.

Keywords: Syntactic lexicon for Arabic, Lexical Mark-up Framework, Transformation rules, TDL, HPSG, LKB

1. Introduction Natural Language Processing (NLP) depends on computational lexicons to retrieve linguistic knowledge. It is largely accepted that the quality and coverage of the lexicon affect the output quality of NLP applications. A lexicon can include morphological, syntactic or semantic information, or a combination of two levels like morpho-syntactic lexicons. Computational lexicons that encode syntactic information are known to be difficult to construct. Such a resource provides specifications of syntactic behaviour like surface properties, subcategorization frames, argument realizations and morphology syntax interaction. This kind of information is very useful especially for grammar parsers. The construction of such resources with the required quality and coverage is time and effort consuming. This is due to the absence of dictionaries or existing lexical databases from which we can extract the syntactic knowledge. This kind of resources is not yet available for less studied languages like Arabic. On the other side, the NLP community is using a plethora of grammatical theories and representation formats making it difficult to share the valuable resources. In effect, diversity in lexicon representations does not encourage diffusion and exchange between different communities and tend to be an obstacle to build large coverage lexicons. In such situation, the need for a standardization effort is evident and the adoption of a standard will enable NLP application interoperability and ease of maintenance of large lexicons. Evaluation and sharing of results provided by NLP applications is not an easy task because the majority of works generally use proprietary formats and descriptions rather than normalized ones. Facing those difficulties, the NLP community, interested in all kinds of normalization initiatives, defended the idea that it was necessary to have a generic model for monolingual and multilingual lexical data. ISO-24613: 2008 Lexical Markup Framework, a recent normative standard for lexical resources description, was an answer to this situation. In this paper, we identify and discuss possible syntactic information that can be embedded in the syntactic lexicon of verbs in Arabic, based on the syntactic extension of LMF. Then, we present a system of rules enabling the transformation of the LMF compliant lexicon into TDL (Type description Language), a typed feature-based language and inference system, which is specifically designed to support highly, lexicalized grammar theories like HPSG. Finally we discuss the main results and issues that we faced in this work.

2. Related work Although the fact that syntactic lexicons are valuable resources for NLP systems, the available lexicons are very limited in number. For English, COMLEX Syntax (C. Macleod & al., 1994) includes detailed subcategorization frames of 6 000 verbs. VERBNET provides 4 000 verbal senses from 191 semantic classes and 52 syntactic frames. For French, many electronic lexicons are available but the majority includes morphological information rather than syntactic information. The LEFFF lexicon (Benoît Sagot 1

& al. , 2006) includes 6798 verb lemmas manually constructed and associated information is partially syntactic. The majority of syntactic resources for French are based on the Maurice Gross Lexicon-grammar (Maurice Gross, 1975; Maurice Gross, 1994), which is a detailed description of syntactic properties of verbs, predicative nouns and adjectives. It contains pools of tables. Each pool is associated with a specific syntactic category (verbs, auxiliary verbs, nouns, etc.). Within a group, a table describes some syntactic constructions and includes all words involved in those constructions. The Lexical Mark-up Framework or LMF provides a common, standardized framework for the construction of computational lexicons. As well, LMF ensures the encoding of linguistic information in a way that enables reusability in different applications and for different tasks (Gil Francopoulo, 2005). It is composed of a core and eight extensions. According to ISO 24613 (Data Category Registry DCR), LMF is decorated by data categories selected from a register regrouping the whole data categories (DCR) or added by NLP community. Several works addressed the task of modelling and building a standard LMF compliant reusable lexicon for Arabic. In (Faten Bakkar & al. , 2008), the authors attempted to model and instantiate the morphology of different Arabic POS based on LMF morphological extension. Many contributions were proposed to cope with Arabic, especially the inherent notion of "root". Syntax modelling was explored in (Noureddine Loukil & al. , 2007) and (Noureddine Loukil & al. , 2008) where the authors investigated the possible content and structure for the syntactic lexicon of Arabic verbs. Those efforts were consolidated by applicative research to enable the test and evaluation of proposed models. (Héla Fehri & al., 2006) studied the theoretical equivalence between the LMF lexicon and the HPSG lexicon and proposed a rule based system to automate the conversion from LMF to HPSG attribute value matrixes.

3. Arabic LMF syntactic lexicon LMF provides an extensible architecture that is relevant for modelling both Machine Readable Dictionaries and NLP lexical resources. LMF models consists of a core model and zero, one or more lexical extensions, along with a set of data categories used in specifying the model. The actual representation of an LMF compatible lexicon is based on XML. As presented in figure 1, the core model is a hierarchical structure consisting of multiple components. The “Lexical Entry” component represents an actual elementary lexical entry. The “Syntactic Behaviour” component provides a representation of all the possible syntactic constructions of a lexical entry. Finally, the “Sense” component provides a semantic description, which can be divided into multiple senses.

3.1 Syntax modelling The Syntax extension specification is given in figure 2 and is built around the “Syntactic Behaviour” component. A “syntactic behaviour” is a syntactic formation pattern, which may be adopted by several lexical entries to capture syntactic redundancy in the lexicon. A syntactic behaviour is described by the set of permitted syntactic formations eventually grouped in semantically disjoint subsets. A “Subcategorization frame” represents the set of possible syntactic constructions associated to a predicate and actually realized by the combination of several complements or positions. Within it, possible instances of the positions can lead to syntactically correct phrases. In other words, a “Subcategorization frame” can be seen as valence pattern providing a specification about the order and the nature of permitted positions instances. A lexical entry may have several frames providing each several mandatory or optional positions. Each position proposes possible realizations and their morphological-syntactic descriptions given within the “Syntactic Argument” component. The “Lexeme Property” component describe syntactic features special to the lexical entry like tense and mood. In our model, we assume simply that an acceptable syntactic formation for a given verb is embedded in a subcategorization frame (SF). An SF consists of an ordered list of the arguments required by the verb, and a set of constraints on those arguments like information about complement introducers. Thus, we describe, firstly, the type and the number of arguments, and secondly, the types of constraints associated with arguments. Arabic is a Semitic language characterized by a rich morphology and a relatively free word order. The extensional description of the lexicon is a very practical solution to deal with the complexity of morphology. Thus, we can avoid the compilation process common to intensional lexicons in order to 2

obtain the extensional form used by parsers. While using the extensional model, the redundancy inherent to the lexicon can be captured by capturing common descriptions and behaviours

Figure 1: LMF core model

Figure 2: LMF syntax extension

A verb has two voices: active and passive with some exceptions of verbs with no passive form. Arabic verbs have five moods: indicative, subjunctive, jussive, imperative and energetic. The syntactic constructions accepted by a certain verb depend on its meaning, voice and mood. The verb in Arabic subcategorize for one, two or three required complements, no more. The complement may be direct, i.e. a simple noun, or introduced by a preposition particle from the finite set {min, ila, bi, ala, an,fi, lem}. In the case that a verb is transitive by particle, it selects which particle can be used. For instance the verb “taharraka” (to move) accepts only the particles {ala, bi, min, lem}. This constraint is described by the “restriction” property that states selected particles for a particular verb. Arabic shows two different kinds of sentence formations: Verbal and nominal. Nominal sentences adopt the SVO word order and are less frequent. Verbal sentences uses the VSO word order that is more practiced. The fact that each verb can be invoked in SVO or VSO construction varieties is modelled in LMF by using two subcategorization frame sets. The first states the possible SFs in verbal sentences. The second states the possible SFs in nominal sentences. Our lexicon is built semi automatically using the editor Lexus, a generic tool for the creation of LMF compliant lexicons. Lexus offers a web interface and takes as input the structure of the lexicon and the information for entries. It generates a lexicon compatible with LMF in XML format. The lexicon structure is based on the verbal type hierarchy described in (Loukil, 2008) and (Boukedi, 2009). We considered also proposals on the structure for Arabic LMF lexicon (Loukil, 2007). The followed method has 3 steps: firstly, we specify the subcategorization frames accepted by verbs in Arabic. This is done manually and offline. Then we edit those SFs (17 in our lexicon), as those of figure 3 and figure 4, with the Lexus editor, which perform a compatibility check of the proposed structure with LMF. Finally, we edit Arabic verb lemmas and we affect one or many SFs to each entered verb. The lexicon contains 2500 verb lemmas with an average of 2 SF per verb. ”إتّ في حاههتحرّك‬

”صوبه تحرّك‬

Figure 3: SF example 1

Figure 4: SF example 2 3

3.2 Type Description Language TDL (Krieger & al., 1994) is a specification language based on type inheritance. The general syntax to define a TDL type is: Type:=body {, option}* where Type is a symbol representing the type name to define. Subtypes are defined by: Subtype:=Supertype & constraints, where Subtype is a type that inherits from the ancestor type Supertype. In a TDL lexicon, entries are encoded as types. Each type is represented by an attribute value matrix (AVM) in which, value can be an atom, a feature structure, an index, a list or even a functional constraint. In case where no value has been specified for a certain attribute, it will have an empty feature structure as a value. Figure 5 shows the TDL specification for the verb “jalasa” (to sit). Figure 6 introduce a new TDL description for the same verb with syntactic information inherited from the type intransitiveVerb. JALASA_verb:= Lex-verb & IntransitiveVerb [PHON , SS [LOC [CAT [HEAD [VFORM complet, RADICAL trilitère, MOJARRAD +, ROOT ‫س ل ج‬,

JALASA_verb:= Lex-verbe & [PHON , SS [LOC [CAT [HEAD [VFORM complet, RADICAL trilitère, MOJARRAD +, ROOT ‫س ل ج‬,

PATTERN ‫فعل‬,

PATTERN ‫فعل‬,

ASPECT accompli, PERSON 3]]]].

ASPECT accompli, PERSON 3]]]].

Figure 6: TDL specification with Syntax inherited from Intransitive verb

Figure 5: TDL specification of the verb “jalasa” (to sit)

With TDL, it is intuitive to bring together general information in super types and let more particular types inherit from them. Following this approach, we can have a lattice of types that is mandatory to release lexicons from redundancy. Consequently, the first stage in building a new TDL lexicon is the establishment of a type hierarchy. Figure 7 depicts a portion of the type hierarchy developed for the lexicon. In order to use and test LMF compliant prototype lexicons, we have to transform it to TDL. This description can be used in the Language Knowledge Builder (Anne Copestake, 2002) system, which is used to test and evaluate previously written HPSG grammar for Arabic. Before dealing with the identification of transformation rules, it is mandatory to study TDL language specificities. Indeed, TDL allows a user to define (possibly recursive) hierarchically ordered types, consisting of types and feature constraints. Thus, constraints on types are propagating from super types to subtypes. Verb Transitive Transitive1

Intransitive

Transitive2

Transitive3 Transitive3 gen

Transitive1 gen

Transitive2 gen

Transitive1 acc

Transitive 3 acc

Transitive2 gen acc Ttransitive2 acc

Figure 7: portion of the verb type hierarchy 4

3.3. Lexicon transformation rules In this section, we discuss a set of rules allowing the transformation from an LMF compliant representation to a TDL equivalent description. We notice that most of the information present in LMF possesses its own correspondent in TDL. The identified rule system is based on the correspondence between LMF and TDL. Every LMF lexical entry is described in TDL using features structures and every LMF data category will be translated into a TDL attribute. In order for the system to function correctly, there is an offline step that has to be fulfilled. In fact, we have to construct TDL types for each syntactic behavior along with the general TDL type of the verb. So, it is generally a one to one correspondence. Each rule possesses a determined structure as illustrated in figure 8. Rule

IN [class1 [… [Class n [attribute value]]]] OUT [TDLType value || SS [… [Feature value]]]]]

Figure 8: General structure of a rule A transformation rule is composed of two parts: The first one titled IN contains the path to reach the LMF attribute value. The second part titled OUT contains two parts joined by the symbol “||”. The first is a simple feature structure represents a TDL type and denotes the corresponding type in TDL. The second is a complex feature structure. This structure has the corresponding path of the feature whose value will be affected by the transformation. To illustrate that, we give the following rule: R1

IN [Lexicon [LexicalEntry [PartOfSpeech 1 ]]]] OUT [SS [LOC [CAT [TETE [MAJ 1 ]]]]]

In rule R1, the data category PartOfSpeech corresponds to MAJ in TDL. Indeed, PartOfSpeech gives the grammatical category of the lexical entry, in the same way for the feature MAJ in TDL. PartOfSpeech and MAJ represent the same morphological information. Indeed, if PartOfSpeech possesses the value verb, so MAJ takes it also. We define also specific rules for the syntactic features transformation. The following rule illustrates this type of rules. R2

IN [Lexicon [LexicalEntry [SyntacticBehavior [Id 2 ]]]] OUT [TDLTYPE 2 ]]]]]

Rule 2 states that for a given lexical entry, the id (identifier or name) of a syntactic behaviour is translated in TDL by forcing the TDL AVM of the lexical entry to inherit from a TDL type of the same name. This TDL type is just the offline constructed equivalent of LMF “SyntacticBehavior” Component. The overall transformation system is based upon the specifications of 15 rules.

4. Evaluation and discussion In order to verify the correctness and the usefulness of the transformation tool, the semi-automatically created lexicon of Arabic verbs is transformed to a TDL lexicon. The generated lexicon is evaluated qualitatively by the LKB system: A parsing system according to the HPSG theory and TDL as a representation language for lexical input. (Sirine Boukedi, 2009) proposed a type hierarchy and a minimal HPSG grammar for Arabic running on the LKB system. We use this grammar to compare a manually sample edited HPSG lexicon with its equivalent automatically generated one. To test our HPSG grammar, we used a corpus of 205 transliterated sentences. This corpus was created from a lexicon of 781 words. It covers various structures of nominal and verbal sentences. Results are presented in table 1. In the original test, 85% of sentences were analyzed correctly. The failure cases (0 analyzes) are due to the absence of rules treating some particular syntactic phenomena (i.e., relative phenomenon, coordination phenomenon). The ambiguous cases (2 analyze) are due to a no precise specification of the constraints specification of some syntactic rules. The same experiment with the new 5

generated lexicon shows a slight progress in parsing results due to the introduction of syntactic information. The quality of the generated lexicon is thus approved to be as useful as manually created lexicons. # derivation trees

Analyzed Sentences (handwritten lexicon)

Analyzed Sentences (generated lexicon)

0

25

17

1

175

190

2

5

8

205

205

Table 1: Parsing results The assessment in terms of acceptability and parsing results has shown a comparable quality in TDL entries. The results are very encouraging. In fact with a system of 32 rules, we can transform the entire lexical information in the LMF lexicon to its TDL correspondent. Tests with the LKB system do not seem to show parsing improvements. But we think this is because the HPSG grammar does not contain specifications that encompass sufficient syntactic phenomena. Changing the rules without touching the source code can easily modify the transformation system. Such flexibility is obtained thanks to the modular design of the system.

5. Conclusion In this paper, we presented a transformation system enabling the generation of a TDL lexicon from an LMF representation. A system of rules has been proposed following a study depended of the two formalisms. The system will be a very useful tool in settling different applications as the Arabic syntactic analysis constitutes an essential stage or even in the linguistic processing, the translation, the grammatical correction, the extraction of information or the applications of questions-answers type. Many improvements are planned like extending the rule system to cope with semantic information and enhancing the lexicon with more syntactic information like diathesis and morphology-syntax interaction.

6. References Faten Bakkar and Aida Khemakhem and Bilel Gargouri and Kais Haddar and Abdelmajid Benhamadou. 2008. Modélisation normalisée LMF des dictionnaires électroniques éditoriaux de l’arabe. In Proceedings of the Traitement Automatique des Langues Naturelles (TALN), Avignon (France). Yahia Bahou and Lamia Belguith and Abdelmajid Benhamadou. 2006. Adaptation et implémentation des grammaires HPSG pour l’analyse de textes arabes non-voyellés. In the 15th Convention French-speaking AFRIF-AFIA Recognition of the shapes and Intelligence Artificial RFIA 2006, January 25-27, Tours/France. Benoît Sagôt and Leonel Clément and Eric De La Clergerie and Pierre Boullier. 2006. The Lefff 2 syntactic lexicon for French: architecture, acquisition, use. In Proceedings of the Language Resources and Evaluation Conference (LREC), Genua (Italy). Sirine Boukedi and Noureddine Loukil and Kais Haddar and Abdelmajid benhamadou. 2009. The experimentation of a HPSG grammar for the Arabic language on the LKB system. In proceedings of the 3rd International Conference on Arabic Language Processing (CITALA’09) sponsored by IEEE Morocco Section, 4-5 May 2009, Rabat, Morocco, pp 43-49. Gil Francopoulo. 2005. Lexical Markup Framework (LMF = ISO 24-613). INRIA-Loria. Krieger H-U., and Schäfer U. (1994). TDL: A Type Description Language for HPSG. Part 2: User guide. Technical Reports, German's research center for artificial intelligence, Saarbrücken, Germany. Noureddine Loukil. 2006. A proposition of representation normalized of the lexicons of the unification grammars. In Rencontre des Étudiants Chercheurs en Traitement Automatique des Langues (RECITAL), Louvain. Belgique. Noureddine Loukil and Kais Haddar and Abdelmajid Benhamadou. 2007. Normalisation de la représentation des lexiques syntaxiques arabes pour les formalismes d’unification. In the proceedings of the Lexicon Grammar Conference. Pages 89--95. Corsica (France). Maurice Gross. 1975. Méthodes en syntaxe. Hermann, Paris Maurice Gross. 1994. Constructing Lexicon-Grammars. In Computational Approaches to the Lexicon, pages 213-263, Oxford. Oxford University Press. Carl Pollard and Ivan Sag. 1994. Head Driven Phrase Structure Grammars, CSLI Notes. University Press. Chicago. Laurent Romary. 2001. Un modèle abstrait pour la représentation de terminologies multilingues informatisées TMF-Terminological Markup Framework. In the Acts of the convention Gutenberg, May 14-17, (pp81--88), University of Metz.

6

C. Macleod and R. Grishman, and A. Meyers. 1994. Comlex syntax : Building a computational lexicon. In the Proceedings of COLING 1994, pages 268–272. Noureddine Loukil and Kais Haddar and Abdelmajid Benhamadou. 2008. Towards a syntactic lexicon of Arabic verbs. Arabic & Local Languages workshop- 6th Language Resources and Evaluation Conference, Marrakech, Morocco. Héla Fehri and Noureddine Loukil and Kais Haddar and Laurent Romary and Abdelmajid Benhamadou. 2006. Un système de projection du HPSG arabisé vers la plate-forme LMF. In Journées d’Études pour le Traitement Automatique de la Langue Arabe (JETALA 2006). Morocco. Anne Copestake. 2002. Implementing Typed Feature Structure Grammars, CSLI Publications, Stanford University.

7