A hidden Markov model for Persian part-of-speech ...

0 downloads 0 Views 187KB Size Report
In this paper, a part-of-speech tagging system on Persian corpus by using hidden .... different subjects such as economics, social, art, culture, sport and religious.
Procedia Computer Science 3 (2011) 977–981 Procedia Computer Science 00 (2010) 000–000

Procedia Computer www.elsevier.com/locate/procedia Science www.elsevier.com/locate/procedia

WCIT 2010

A Hidden Markov Model for Persian Part-of-Speech Tagging Morteza Okhovvat a*, Behrouz Minaei Bidgolib a

Iran University of Science and Technology, School of Computer Engineering, Distributed Systems Laboratory,Tehran 16846-13114, Iran b

Iran University of Science and Technology, School of Computer Engineering, Tehran 16846-13114, Iran

Abstract One of the important actions in the processing of languages is part-of-speech tagging. Against of this importance, although numerous models have been presented in different languages but there is few works have been done in Persian language. In this paper, a part-of-speech tagging system on Persian corpus by using hidden Markov model is proposed. Achieving to this goal, the main aspects of Persian morphology is introduced and developed. To evaluate the accuracy of proposed approach, this approach is applied in simulations which are done on both homogeneous and heterogeneous Persian corpus. Getting results with 98.1% accuracy in the experiments demonstrate the suitable efficiency of the proposed approach on Persian corpus. c 2010 Published by Elsevier Ltd. Selection and/or peer-review under responsibility of the Guest Editor. ⃝ Keywords: Part-Of-Speech Tagging; Natural Speech Processing; Hidden Markov Model; Morphology; Viterbi Algorithm;

1. Introduction Part-Of-Speech (POS) tagging is known as a necessary work in many areas Natural Language Processing (NLP) systems like translation machines and text-to-speech (TTS) applications and prosody synthesis systems [1]. The POS tagging is known as assigning grammatical tags to words and symbols making a text which include a large amount of lexical information like gender, person and quantity of the words [2] where the input data is a text and the output is the words of input text with their tags. In TTS applications, POS tagging can be applied for many purposes like homograph disambiguation, morphological analysis [3]. In the Persian language, POS tagging is applied to realize Ezafe which is an unstressed vowel –ye or –e after vowels [1]. In fact, Ezafe is used to link two words and sometimes is not written. In prosody synthesis, POS tags are applied as characteristic in some of the fields, for example, Duration and Intonation Model and Pitch Contour Estimation [4]. POS tagging is also applied to get annotated corpora which can be utilized in such cases as linguistic researches, building better taggers or be used as statistical proofs in the other language processing associated goals [5]. Generally, taggers are categorized in three groups which are statisticalbased, rule-based and transformation. In statistical tagging, a labelled corpus is used for training and as a result a model is generated. Considering this model and statistics styles, the tag with the most probability of each word is selected. Rule-based tagging contains a rich database of lexical rules. Using created hypothesis and this database, taggers try to select the best tag for each word and finally, transformation technique is a hybrid method in which both rule-based and statistical-based methods are used to assign the best tag for each of the words of input text.

* Morteza Okhovvat. Tel.: + 98-21-7391-3307; Fax: + 98-21-7322-3307. E-mail address: [email protected].

c 2010 Published by Elsevier Ltd. 1877-0509 ⃝ doi:10.1016/j.procs.2010.12.160

978

M. Okhovvat, B. Minaei Bidgoli / Procedia Computer Science 3 (2011) 977–981 M. Okhovvat and B. Minaei/ Procedia Computer Science 00 (2010) 000–000

However, although various methods of tagging have been applied in many languages, but there is few works in Persian languages which are done in recent years. In this paper, we proposed a part-of-speech tagging system on Persian corpus by using hidden Markov model which is applied on both homogeneous and heterogeneous Persian corpus. Resultants demonstrate the accuracy of presented approach. The rest of paper is organized as follows. In Section 2 notable related works presented. Section 3 describes the challenges of POS tagging in Persian language. Our approach is proposed in section 4. Section 5 presents experimental results and Section 6 concludes the paper. 2. Related Work So far, numerous POS tagging methods have been presented in some languages like English which are often Rulebased or statistical. But there are few activities in this field which was being done in the last few years [6]. One of the most recent work for POS tagging is done by Jabbari and Allison [7]. Their approach is transformation based and previously been used in English by Brill and Hepple [8,9]. The construction of this tagger contains a trained learner machine which includes approximated rules. In fact, they applied an implementation of Error-Driven Transformation Based Learning. They believe that the accuracy of their approach is 93%. Assi and Abdolhossini have presented a POS tagging method based on the Schuetze hypothesis [10]. This hypothesis expresses grammatical actions is reflected in co-occurrence patterns. They assumed for a given window size, the POS tags can be approximated. Achieving this, both the left and the right context vector of each word is sorted and all similar vectors are clustered. Then each cluster is annotated manually [10]. This approach was used for annotation of FLDB corpus [11]. The resultant accuracy for different classes of verbs and nouns has been reported to be 69 to 83%, but the overall accuracy of the automatic part is 57.5% reported. However, their approach is not suitable since some of the Persian tags refer to ambiguous words. Orumchian tagger is presented for Persian POS tagging which is follows the TNT POS tagger [12]. The TNT tagger is based on Hidden Markov Models theory. This system uses 2.5 million tagged words as training data and the size of the tag-set is 38. Reported accuracy of this approach is 96.64% reported. Another research for Persian POS tagging is done by Megerdoomian [13]. Explaining some of the linguistically challenges in the development of Persian POS tagging with no experimental result is the only resultant of this research. 3. POS tagging challenges Persian is classified in Indo-European languages where basic word order in Persian language is Subject-ObjectVerb [6]. According to the different structure of Persian language, there is some challenges which are not seen in some other languages like English. Hence, firstly we explain some aspect of this language and then the challenges of POS tagging in Persian is presented. In Persian language tenses are fewer than English language. This language has wide derivational and inflectional morphology. Persons inflect Verbs and the syntax is not influenced from gender. Like English language, Derivational Persian words are extracted by prefixing and suffixing their stems [6]. Considering the mentioned features of Persian languages, the most important POS tagging challenges can be categorized generally as follows: A) Numerous categories of verb in Persian language with various inflections in relation to persons which lead to variety forms of words. B) Same forms can mean various morphemes. For example the suffix “̵” can be consider as a connecting part for the second person e.g. “̶Θϓέ” singular or as the indefinite piece of a word e.g. “̶ΑΎΘ̯”. This challenge is known as ambiguities in Persian morphology. C) Word boundaries detection is the third challenge. In Persian texts, blanks make serious problems in the process of POS tagging. As a sample, plural morpheme “Ύϫ” can be emerging in various forms for nouns, e.g. the plural form of the word “ϦϴηΎϣ” can emerge in three forms which are “ΎϬϨϴηΎϣ”, “ΎϫϦϴηΎϣ”,”ΎϫϦϴηΎϣ”. D) The fourth challenge is made because of the morphology of Persian language. In means that in Persian language if different affixes such as possessive, indefinite and plural pronoun emerge in a single word, all of them attempt to join to each other [6] e.g. “ϢϳΎϬϨϴηΎϣ” which means “my cars”.

979

M. Okhovvat, B. Minaei Bidgoli / Procedia Computer Science 3 (2011) 977–981 M. Okhovvat and B. Minaei/ Procedia Computer Science 00 (2010) 000–000

Considering these challenges in proposing POS tagging approach is indispensable which are made these approach complicated. However, as mentioned before, not handling the third stated challenge caused Assi and Abdolhossini method to be inefficient. 4. Proposed Approach The Persian POS tagging method of this paper is based on equation ' which is the first order Markov Model [13] and Viterbi algorithm. š t 1, n

arg max

p ( t1, n w1, n )

n

| arg max

– >p ( w t ) * p (t t )@ i i

i i 1

(')

i 1

Equation ' includes two kinds of probabilities, word probabilities and tag transition probabilities. The word probabilities, P (wi|ti), show the probability, given that we see a certain tag, that it will be related with a given word. The tag transition probabilities, P(ti|tií1), signify the probability of a tag given the previous tag. In the equation ' , w1,n represents a sequence of words with the purpose of finding most likely sequence of tag from the set of tags, t 1,n . In addition to this model and Virerbi algorithm, some work is also done to perform POS tagging in Persian language which are said in next sections.

4.1. Text normalization As Stated in the section 3, one of the challenges of POS tagging in Persian language is the detection of word boundaries. For example affixes can be appear in three forms which are separate, and with half interspaces and connected. As result the word “ΎϬϨϴηΎϣ” (mashinha) which means “cars” can be written in three forms as follows: “ΎϫϦϴηΎϣ” as separate form, “ΎϫϦϴηΎϣ” as with half-space form and “ΎϬϨϴηΎϣ” as connected form. This challenge cause many problems in Token-to-Word transformation commonly, words are identified with their interspaces by tokenizer systems. Hence, the affixes that are not appeared in the connected form will be evaluated as two words and caused error in POS tagging. Overcoming this problem, the affixes can be attached to their stems everywhere. But this is not useful for all cases since some affixes are homographs with some words and for instance, “̶ϣ” which is read “mi” is an affixes for verbs but can be pronounced “mey” and be meant as “wine” [1] and cause error. Therefore, considering the context of words is essential to differentiate these cases. In this paper similar to [1], firstly, the Persian letters are changed to English letters which are then extracted by space and punctuation characters. In the next step, the affixes are reconnected to their stems. So we saw two types of affixes which should be reconnected: 1. Affixes which are not similar to stems like “ϥΎΗ” which is read “tan” with meaning “yours”. 2. Those are homograph with stems like “ήΗ” which is read “tar” and can be considered as elaborative suffix or as a word which means “dampness”. So a decision tree is made to reconnect the words and affixes in which the first set of words is trivial and they are reconnected to the preceding word. The second set of words makes use of another feature to explain the context [1]. For example, the record of training data in this tree for the word “̶ϣ” is: ((Boolean attach_sign) (“ΏΎΘ̯” (book)) (“̶ϣ”) (“Ϣϧ΍ϮΧ” (I study))). Since Hidden Markov Model for tagging is work on sentences and the boundary of sentences is not verified in the corpus, some rules is used to determine the sentences boundary. Relation : [14] shows the rules which are extracted by examination of the configuration of Persian text. Verb+ (conjunction, preposition ‘.’, ‘?’, ‘,’, ’:’ or ‘,’)

(: )

5. Proposed Approach In this section we state the results of applying the proposed approach which is appraised in term of the accuracy on the two separate parts of the corpus which is developed by RCISP [15]. This corpus is contains several text with different subjects such as economics, social, art, culture, sport and religious. The annotated part of the corpus is

980

M. Okhovvat, B. Minaei Bidgoli / Procedia Computer Science 3 (2011) 977–981 M. Okhovvat and B. Minaei/ Procedia Computer Science 00 (2010) 000–000

applied in this paper which has nearly 7.5 million annotated tokens including 10 million words. We used the 25 tags in the experiments which are the main tags from 168 singles tags of the corpus. In the other word, the rest tags that are 143 single tags are used for Persian morphology. As mentioned in section 3, unknown words cause problem in POS tagging like lots of NLP systems. To handle this challenge, by applying 15-fold cross validation, possibilities of various POS tags for unknown words are approximated. These approximated possibilities are shown in figure 1 which used in the prediction of possible POS tags for unknown words.

Fig. 1. Probabilities of POS tags for unknown words.

However, to evaluate presented approach, two different experiments are performed. Firstly, we applied proposed approach on the homogenous text. Hence, the Economic part of the corpus is selected. This part of corpus has 1073924 words in which 1052362 of them are known words. According to this experiment, the accuracy of our approach is 97.8. The results of this experiment are shown in table 1.

Known words Unknown words Total

Table 1. Resultants of first experiment Number of Words 746198 14582 760780

Accuracy 98.5% 68.7% 97.9%

In the second experiment, we applied our approach on heterogeneous text. To do so, a part of the five different types of the corps is selected which contains 1073924 words. According to this experiment, overall accuracy 98.3% is obtained. Table 2 represents the results of second experiment.

Known words Unknown words Total

Table 2. Resultants of second experiment Number of Words 1052362 21562 1073924

Accuracy 98.99% 63.7% 98.3%

Nevertheless, the resultants of the experiments demonstrate that although the training data in the second experiment is heterogeneous, but the accuracy of known words in the second experiment is a little more than the first experiment. This is because of the lager training data in the second experiment rather than the first one. But as the training data in the first experiment is homogeneous and selected from single type, the accuracy for unknown words is gained more than the second one. Finally, these experiments which are evaluated in term of the accuracy demonstrate the efficiency of the proposed approach in Persian POS tagging.

M. Okhovvat, B. Minaei Bidgoli / Procedia Computer Science 3 (2011) 977–981

981

M. Okhovvat and B. Minaei/ Procedia Computer Science 00 (2010) 000–000

6. Conclusion In this paper Persian morphology is briefly described and some main challenges of Persian POS tagging systems are introduced. Then an approach for Persian POS tagging based on Hidden Markov Model is proposed. Experimental results show that there is little deviation in the accuracy rate of the proposed tagger in the both homogenous and heterogeneous texts and can be applied as common model. High accuracy rate (98.1%) of our two experiments on a homogenous and a heterogeneous part of the corpus demonstrates that Hidden Markov Models are suitable for POS tagging in Persian language and quite comparable with best-case results reported from other models.

Acknowledgements The authors are grateful to Mr. Hassan Motallebi for his technical assistance.

References 1. A. Azimizadeh, M. M. Arab, S. R. Quchani, Persian Part of Speech Tagger Based on Hidden Markov Model9th International Conference on the Statistical Analysis of Textual Data (JADT), USA (2008). 2. D. Jurafsky, J. H. Martin, Speech and language Processing. Prentice Hall, 2nd Artificial Intelligence, Oxford University Press (1999), 283-443. 3. A. Azimizadeh, M. M. Arab, The Persian Morphological parser by Using POS Tagger, In the Proceedings of 2th Workshop on Computational Approaches to Arabic Script-Based Languages (CAASL-2), USA (2007). 4. A. W. Black, P. A. Taylor, R. Caley, The Festival speech synthesis system, (2002), 73-75, available at: http://www.cstr.ed.ac.uk/projects/festival/manual. 5. M. Padr´o, L. Padr´o, Developing Competitive HMM POS Taggers Using Small Training Corpora. Estal, (2004). 6. M. Mohseni, H. Motalebi, B. Minaei-bidgoli, M. Shokrollahi-far, A farsi part-of-speech tagger based on markov, In the proceedings of ACM symposium on Applied computing, Brazil (2008). 7. S. Jabbari, B. Allison, Persian Part of Speech Tagging, In the Proceedings of Workshop on Computational Approaches to Arabic Script-Based Languages (CAASL-2), USA (2007). 8. E. Brill, Transformation-Based Error-Driven Learning and Natural Language Processing: A case Study in Part of Speech Tagging, Computational Linguistics, USA (1995). 9. M. Hepple, Independence and Commitment: Assumptions for Rapid Training and Execution of Rule-based Partof- Speech Taggers, In Proceedings of the 38th Annual Meeting of the Association for Computational Linguistics (ACL). Hong Kong (2000). 10. S. M. Assi, Farsi Linguistic Database (FLDB), International Journal of Lexicography, 10(1997), 5-5. 11. H. Schuetze, Distributional Part-of-Speech Tagging From Texts to Tags: Issues in Multilingual Language Analysis, In the Proceedings of the ACL SIDGAT Workshop, (1995), available at: http://xxx.lanl.gov/find/cmp-lg, 1995. 12. T. Brants, TNT – a Statistical Part-of-Speech Tagger, In the Proceedings of 6th conference on applied natural language processing (ANLP), USA (2000). 13. K. Megerdoomian, Developing a Persian part-of speech tagger, In the Proceedings of first Workshop on Persian Language and computer, Iran (2004). 14. E. Charniak, C. Hendrickson, N. Hacobson,M. Perkowitz, Equation for part-of-speech tagging. Proceedings of the Eleventh National Conference on Artificial Intelligence, USA (1993). 15. Research Center of Intelligent Signal Processing, available at: http:// www.rcisp.com.

Suggest Documents