Document not found! Please try again

a matrix representation of the inflectional forms of arabic words

0 downloads 0 Views 302KB Size Report
inflections of Arabic words in a compact form which ... verb enclitics and noun possessive pronouns have .... denote the sets of singular and dual/plural .... Subject Pronoun (for verbs). Case. Ending. (for nouns). Object Pronoun (for verbs).
A MATRIX REPRESENTATION OF THE INFLECTIONAL FORMS OF ARABIC WORDS: A STUDY OF CO-OCCURRENCE PATTERNS

H, E. Mahgoub, M.A. Hashish IBM Cairo Scientific C e n t r e 56 G a m e a a t

AI D o w a l

/%1 A r a b e y a

Street

M o h a n d e s s e n , Giza, E g y p t

A.T. Hassanein Arabic Department, American U n i v e r s i t y in Cairo ABSTRACT

s e q u e n c e of t h r e e l e t t e r s ,

A p r o p o s e d "Matrix" m e t h o d f o r the r e p r e s e n t a t i o n of the inflectional p a r a d i g m s of A r a b i c w o r d s is p r e s e n t e d . T h i s r e p r e s e n t a t i o n r e s u l t s in a c l a s s i f i c a t i o n of A r a b i c w o r d s into a t r e e s t r u c t u r e ( F i g ( l ) ) whose leaves r e p r e s e n t unique conjugational or derivational paradigms, each r e p r e s e n t e d in the p r o p o s e d "Matrix" f o r m .

F o r e x a m p l e , i f we c o n s i d e r the r o o t ~ ,~ (K T B ) , we can f o r m w o r d s s u c h as .l -, .r _..£~ (?aKTuB - I write) and , ~ (KiTa:B - book), by s u b j e c t i n g the r o o t to v a r i o u s " f o r m s " o r " m o u l d s " and by undergoing certain morpho-phonemic (and p o s s i b l y also m o r p h o - g r a p h e m i c ) c h a n g e s . F o r a full d i s c u s s i o n of t r a d i t i o n a l A r a b i c m o r p h o l o g y see a n d . In this p a p e r , we shall d e f i n e s u c h an i n f l e c t e d form to be a "STEM".

A study of about 2,500 stems from a high frequency Arabic wordList due to Landau has revealed a systematic set of co-occurrence patterns for the encLitic pronouns of Arabic verbs and for the possessive pronouns attached to Arabic nouns. Each co-occurrence pattern represents a subcategorization frame that reflects the underlying semantic relationship.

Thus a stem may contain prefixes which are part of the contain any suffixes. Suffixes and object pronouns, w h i l e possessive pronouns.

infixes and certain "mould" but may not for verbs are subject for nouns they are

One further definition which is used in the proposed representaion is the "Core"; this is simply the inflected form with all prefixes and suffixes stripped off. The core may or may not be a valid word.

The k e y f e a t u r e t h a t d i s t i n g u i s h e s t h e s e s e m a n t i c p a t t e r n s h a s b e e n o b s e r v e d to be w h e t h e r the a t t a c h e d s u f f i x e s r e l a t e to t h e a n i m a t e o r i n a n i m a t e . In some c a s e s f o r v e r b s , the n u m b e r of the s u b j e c t is also a s i g n i f i c a n t f e a t u r e . T h e s e s e m a n t i c f e a t u r e s also e x t e n d to n o n - a t t a c h e d s u b j e c t s a n d o b j e c t s ( f o r v e r b s ) a n d to p o s s e s s i v e n o u n c o m p l e m e n t s ( f o r n o u n s ) . T h e r e f o r e the s e m a n t i c c l a s s e s p r e s e n t e d in this p a p e r also a s s i s t in

In c o m p a r i s o n w i t h o t h e r w o r k in the a r e a of t r a d i t i o n a l A r a b i c m o r p h o l o g y (,), w h e r e the c o n c e r n is w i t h the r u l e s w h i c h c a u s e the i n f l e c t e d f o r m to be d e r i v e d f r o m the ROOT, we h a v e s t u d i e d the rules governing the derivation of all possible inflected forms from the STEM, as defined above.

syntactic/semantic analysis. The first application that Was developed, based upon the proposed representaion is a stem-based Arabic morphological ans/yser, from which a spell checker (on a PS/2 microcomputer) emerged as a by-product. Currently, the system is being used to interact with an Arabic syntactic parser and there are plans to use it in a machine assisted translation system.

2. T H E

MATRIX

REPRESENTATION

Sample "MATRIX PARADIGMS" are shown in Fig(2) for verbs and Fig(3) for nouns. Table(1) gives the keys in English to the columns on the Matrix Paradigms. The inflected form for a given Person/Number~Gender/Mode combination for verbs (obtained from the relevant "row" of the Matrix Paradigm) is constructed by concatenating the prefix, core and both subject and object pronoun column entries. The inflected forms for nouns are sinfilarly constructed for a particular Number/Gender/Case combination.

i . INTRODUCTION O v e r the p a s t few y e a r s t h e r e h a s b e e n a m a r k e d i n c r e a s e in the u s e of c o m p u t e r s in the A r a b i c s p e a k i n g c o u n t r i e s . Many a p p l i c a t i o n s p r o g r a m s in A r a b i c h a v e b e e n d e v e l o p e d , b u t the field of c o m p u t a t i o n s / l i n g u i s t i c s is r e l a t i v e l y new in A r a b i c a n d p r e s e n t s a u n i q u e c h a l l e n g e , d u e to the h i g h l y i n f l e c t e d n a t u r e of t h e A r a b i c l a n g u a g e .

The various "cells" of the object pronoun columns indicate whether a particular entry is valid (indicated by "U', an Arabic numeral one). Invalid entries are indicated by a " ' " , an Arabic zero. It is due to this matrix of ones and zeros that the representation was named the "Matrix Paradigm".

In the p r e s e n t w o r k , we h a v e a t t e m p t e d to r e p r e s e n t the m o r p h o l o g i c a l r u l e s g o v e r n i n g the i n f l e c t i o n s of A r a b i c w o r d s in a c o m p a c t f o r m w h i c h can s i m p l i f y t h e p r o c e s s i n g of A r a b i c w o r d s b y c o m p u t e r s a n d w h i c h is i n d e p e n d e n t of the a particular application. There have been other a t t e m p t s to s h o w the c o n j u g a t i o n s of A r a b i c v e r b s b u t the t r e a t m e n t does n o t delve into s u f f i c i e n t d e p t h a n d not all e n c l i t i c s , w h i c h a r e an e s s e n t i a l p a r t of A r a b i c v e r b s , a r e c o n s i d e r e d . M o r e o v e r , the t r e a t m e n t in d o e s not e x t e n d to n o u n s .

3. TAXONOMY OF ARABIC WORDS Fig(l) shows a tree diagram representing the taxonomical c l a s s i f i c a t i o n of A r a b i c verbs and n o u n s . T h e r e a r e d i f f e r e n t " l e v e l s " in the t r e e c o r r e s p o n d to d i f f e r e n t t y p e s of v a r i a t i o n s of the i n f l e c t e d form f r o m one c l a s s to a n o t h e r . T h e f i r s t t y p e of v a r i a t i o n c o i n c i d e s more o r less w i t h the traditional classification and is r e s p r e s e n t e d at levels 2 a n d 3 f o r v e r b s a n d at level 2 f o r n o u n s .

B y s t u d y i n g some 2,500 s t e m s o u t of a h i g h f r e q u e n c y A r a b i c w o r d l i s t d u e to L a n d a u , certain systematic co-occurrence patterns governing v e r b enclitics a n d n o u n p o s s e s s i v e p r o n o u n s h a v e b e e n o b s e r v e d . T h e s e p a t t e r n s a r e w h a t we call "Matrices" in this p a p e r ; e a c h u n i q u e " M a t r i x " reflects a different semantic behaviour.

Each Matrix Paradigm also reflects two further types of variation, which can be considered separately from one another. The first is the variation in the core with the different rows; this dimension corresponds, for example, to the t r a d i t i o n a l s t u d y of v e r b c o n j u g a t i o n s ( s e e ).

To s u m m a r i z e A r a b i c m o r p h o l o g y in a n u t s h e l l , a b o u t 80 ~, of A r a b i c w o r d s can be d e r i v e d f r o m a - I -

416

called a t r i l i t e r a l r o o t .

The other type of variation is that in the distribution of the Matrix of ones and zeros, which is essentially a variation in the co-occurrence of object pronouns (for transitive verbs) and possessive pronouns (for nouns). This variation is reflected at level 4 of the taxonomy. In the following sections 3.1 and 3.2, we will discuss the study of these co-occurrence patterns in more detail for verbs and nouns separately.

(A) No p o s s e s s i v e p r o n o u n s can be a t t a c h e d . (B) All p o s s e s s i v e p r o n o u n s can be a t t a c h e d . (C) Only possessive pronouns related to inanimate ( s e t -H) can be a t t a c h e d .

An a d d i t i o n a l s t u d y was made to d e t e r m i n e what N u m b e r / G e n d e r (NG) c o m b i n a t i o n s a r e valid f o r a p a r t i c u l a r n o u n s t e m . T h e s e h a v e b e e n f o u n d to be an i m p o r t a n t f e a t u r e of A r a b i c n o u n s , as not all NG c o m b i n a t i o n s a r e valid f o r a stem• Each stem n e e d s to be e x a m i n e d s e p a r a t e l y a n d this i n f o r m a t i o n is p u t into the lexicon of s t e m . The NG c o m b i n a t i o n s a r e r e p r e s e n t e d at level 3 of the t a x o n o m y , f o r nouns (see Fig(l)).

3.1 CO-OCCURRENCE PATTERNS FOR VERBS On examination of the Landau high frequency wordlist, the following features seemed to distinguish classes of verbs apart:

A l t h o u g h t h e r e is no s y s t e m a t i c , t h e o r e t i c a l m e t h o d for deternfining what all the different NG combinations are for comprehensive coverage of nouns, yet by examining more and more nouns from Landau's wordlist, some form of convergence occurred. For the 2,500 stem shortlist, there were only 17 NG combinations.

1- W h e t h e r the s u b j e c t is h u m a n o r n o n - h u m a n ( f o r both transitive and intransitive verbs). 2- W h e t h e r the object is h u m a n o r n o n - h u m a n ( f o r transitive verbs only). 3- T h e n u m b e r of the s u b j e c t ( f o r i n t r a n s i t i v e "verbs o n l y ) .

This curious feature of Arabic nouns can be mainly attributed to the presence of words of foreign origin and to the pragmatics of the noun in question.

in A r a b i c , t h e r e is a s e t of object p r o n o u n s w h i c h r e f e r s to a n o n - h u m a n object: ( t . , ~ , ~ , a ) a n d this will be d e n o t e d b y -H. T h i s is a s u b s e t of the complete s e t of p r o n o u n s +H, w h i c h d e n o t e s h u m a n a n d n o n - h u m a n . Below, we will d i s c u s s the f e a t u r e s for transitive and hitransitive verbs separately:

4. APPLICATIONS D E V E L O P E D

(a) Transitive Verbs:

As a first a p p l i c a t i o n , an A r a b i c s t e m - b a s e d m o r p h o l o g i c a l a n a l y s e r h a s b e e n d e v e l o p e d on an IBM P S / 2 m i c r o c o m p u t e r . Morphological f e a t u r e s of the w o r d a n a l y s e d a r e c o m p u t e d .

As s h o w n in the table below, t h e r e can o n l y be 4 c o m b i n a t i o n s of the f e a t u r e s +H a n d -H. Each of the f e a t u r e s e t s in the table h a s b e e n d e s i g n a t e d a c l a s s cede. O n l y v e r b s w i t h f e a t u r e s c o r r e s p o n d i n g to the f e a t u r e s e t s B , C a n d D h a v e b e e n f o u n d in the L a n d a u s h o r t l i s t e x a m i n e d .

Feature Set c o d e B C D

?

the

As a b y - p r o d u c t of the a n a l y s e r , an A r a b i c s p e l l i n g verifier has been developed, by including unification of the morphological and co-occurrence features of the morphemes.

~ +H +H *H -H

T h e s y s t e m is c u r r e n t l y b e i n g d e v e l o p e d for. u s e in the i n t e r a c t i o n with an A r a b i c s y n t a c t i c p a r s e r .

+H -H -H -H

ACKNOWLEDGEMENT (b) Intz-ansitive V e r b s :

The authors sincerely wish to thank Dr. John Sowa for reviewing tl~is paper and for his invaluable comments and suggestions.

It was f o u n d out t h a t the s u b j e c t n u m b e r is an additional d i s t i n g u i s h i n g feature for transitive v e r b s . M o r e o v e r , the s u b j e c t n u m b e r is sigmificant only in the case of human subjects. For non-human subjects, this feature is not significant.

REFERENCES

B a s e d u p o n the a b o v e o b s e r v a t i o n s , we will define the d i s t i n g u i s h i n g f e a t u r e s f o r i n t r a n s i t i v e v e r b s to be + H ( s ) , + H ( d p ) a n d -H, w h e r e s d e n o t e s s i n g u l a r and dp denotes dual/plural. +H(s) and +H(dp) d e n o t e the s e t s of s i n g u l a r a n d d u a l / p l u r a l s u b j e c t s , r e s p e c t i v e l y . By d e f i n i t i o n +t{(s) U + H ( d p ) -H, w h e r e U d e n o t e s the u n i o n of the two f e a t u r e s e t s . T h e table below s h o w s the p o s s i b l e c o m b i n a t i o n s of t h e s e f e a t u r e s ; o n l y features designated by A,E and F were found for L a n d a u ' s s h o r t t i s t . F e a t u r e Set Code A E F ?

Jacob Landau, "A Word Count of Modern Arabic Prose", American Council of Learned Societies, New York, 1959. Peter F. Abboud, Ernest N. MeCarus (Eds.), "Elementary Modern Standard Arabic", Parts l & 2 (2nd. Edition), Cambridge University Press, 1986, T. Ei-Sadany & M. Hashish, "Arabic Morphological System", IBM Natural Language Processing Conference, Thornwood, New York, October 1989. T. Ei-Sadany & M. Hashish, "Arabic Morphological System", IBM Systems Journal 25th. Anniversary for Scientific Centres issue, Vol. 28, No. 4, 1989. ,~,,L~I ,9._,~J , ~ . , , ~ i i~3J ~ J ~ ~ ,.~L., ~ • ~.~.,mJ~ ~ Y ~ U %.~JJ ~ r ,~3..~J ~1~.£ll o-~ ~.,.FJI.~ ~' 2J ,~,,.~l~J ~:,~t

S_uu:bjegtFeature___& +H(s) U +H(dp) -H +H(dp) +H(s)

3 . 2 CO-OCCURRENCE PATTERNS FOR NOUNS



T h e same s e t of o b j e c t p r o n o u n s f o r v e r b s d e n o t e s the p o s s e s s i v e p r o n o u n s f o r n o u n s , with the e x c e p t i o n of a s l i g h t d i f f e r e n c e in form of the f i r s t p e r s o n s i n g u l a r . The -H s e t is e x a c t l y the same. T h r e e d i s t i n c t c l a s s e s of Matrix p a t t e r n s ( s e e level 3 of F i g ( l ) ) h a v e b e e n o b s e r v e d f o r n o u n s :



~q¥,

,~WI

'~/ ~

,~91

3ma , e + ~ J / ~JJJI

, ~ o ~ l c~ ~,a Jo,JI ~;~ ,~;~,.t~.~J~

~qq.,c,~o_ ~

, i .~.dtGJl a~~.AJI ,a~t_a

~ 2 -

417

FIG(l): Teee Structure Cbl~ifloltlon tff tlut Arabic I~lntlluago

LEVEL I e:

I

__L

p'TI

~

]

,~t.,,,t,,.ll

Last I.etter OIottalIzedj

k

r

La~t

LEVEL '2

Lettee Soun¢ I

I

1

NO I NG 2 NG 3 NG 4 " !

LEVEL 3

NGI NG2 NG3 "'I

Lr

]

,r m ,-5.,:q [--

1__

L(' ....... 'I .

.

.

.

.

.

.

(D)

(c)

~,

.................

=========~==========================

'

'

'

---I

---I

~,

t',

~ --q

==~=

'

'

---I

---I

I',

1;

I I

' "--: 1',

= =======

'

' "'-I

""[

11

>~I

========:======

'

'

---I

---I

'I

'l

~',

4 "'> ~ 1

( f o r verbs) (for nouns)

Ending

:

~l~.~ i~,~

(Nouns

only):