A Statistical Method for Adding Case Ending Diacritics ... - Google Sites

0 downloads 119 Views 248KB Size Report
represent the Arabic consonants. These are: ش. ,. س ... important for the development of data-driven approaches ... gr
A Statistical Method for Adding Case Ending Diacritics for Arabic Text Khaled Shaalan

Hitham M. Abo Bakr

Ibrahim Ziedan

The Institute of Informatics The British University in Dubai

Computer & System Dept. Zagazig University

Computer & System Dept. Zagazig University

[email protected]

[email protected]

[email protected]

reader is expected to infer or predict vowels from the context of the sentence. Written Arabic can be fully diacritized (this is the case with Qur'an and some heritage literature books), partially diacritized (this is the case when we want to disambiguate certain words like ‫( ﻋ ّﻤﺎن‬Amman – Captial of Jourdan) and ‫( ﻋُﻤﺎن‬Oman - Country), or it can be entirely undiacritized (unvoweled). There are three types of diacritics: vowel (Fat-ha " ‫" َـ‬, Dama ‫" ُـ‬ ", Kasra " ‫)" ِـ‬, nunation (Fathatan " ‫ًـ‬ " , Damatan " ٌ‫" ـ‬, kasratan " ‫) "ٍـ‬, and shadda ( “ّ‫[) ”ـ‬2]. Case ending diacritics play an important rule for understanding the meaning of Arabic statement Case– ending gives the correct understanding of the statement.

Abstract In this paper, the issue of adding Case Ending diacritics to undiacritized Arabic text using statistical methods is addressed. The approach requires a large corpus of fully diacritized text for extracting the case ending. We made the training for detecting the case ending diacritics for each token based on its Part Of Speech (POS) and BPchunk position as well as the position of token in the statement. The case ending diacritics is then efficiently obtained using the SVM technique. We presented an evaluation of the proposed diacritization algorithm and discussed various modifications for improving the performance of this approach. Keywords: Arabic NLP, MSA, CaseEnding diacritization, Statistical approach, SVM, YamCha

There are many related work dealing with the problem of Arabic diacritization in general [2-5]; all trying to handle this problem using statistical approaches but they handle the case ending (last letter) diacritics with the same technique used to handle the internal (any letter but last) diacritics. In the literature, the detection of case-ending diacritics is treated as a syntactic problem whereas detecting the internal diacritics is treated as a morphological problem. In this paper, we claim that this requires distinction in handling the case ending diacritization from the handling of the

1. Introduction Arabic script consists of two classes of symbols: letters1 and diacritics. Letters are always written whereas diacritics are optional. Diacritics play a key role in disambiguating Arabic text. Most written MSA is not diacritized. The 1

Arabic writing system consists of 36 letter forms which represent the Arabic consonants. These are: ‫ ذ‬, ‫ ر‬, ‫ ز‬, ‫ س‬, ‫ش‬ ‫ا‬, ‫ ﺁ‬, ‫ أ‬, ‫ إ‬, ‫ ئ‬, ‫ ؤ‬, ‫ ء‬, ‫ ى‬, ‫ ة‬, ‫ ب‬, ‫ ت‬, ‫ ث‬, ‫ ج‬, ‫ ح‬, ‫ خ‬, ‫ د‬, , ‫ م‬, ‫ ن‬, ‫ ﻩ‬, ‫و‬ ‫ ص‬, ‫ ض‬, ‫ ط‬, ‫ ظ‬, ‫ ع‬, ‫ غ‬, ‫ ف‬, ‫ ق‬, ‫ ك‬, ‫ ل‬, and ‫ ي‬.

1

The paper is organized as follows. The next section gives an overview of the proposed Arabic diacritizer. This is followed by the experiment conducted to evaluate our diacritizer. Finally, we conclude the paper and give directions for the future research.

internal diacritization. Nevertheless, this is main the reason why the performance of the current Arabic diacritizers is decreased when they included diacritics of case ending in their evaluation [2, 5]. The problem of automatic restoration of the diacritic signs of Arabic text can be solved by two approaches. The first approach is a rule – based approach that involves a complex integration of the Arabic morphological, syntactic, and semantic tools with significant efforts to acquire respective linguistic rules. A morphological analyzer gets the breakdown of the undiacritized word according to known patterns or templates and recognizes its prefixes and suffixes. A syntax analyzer applies specific syntactic rules to determine the case-ending diacritics, usually, using applying finite-state automata technique. Semantics handling helps to resolve ambiguous cases and to filter out hypothesis. As shown, the rulebased diacritization is a complicated process and takes longer time to process an Arabic sentence which is usually long. The second approach is the statistical approach, where a large tagged corpus (in particular a TreeBank) is used to extract language statistics for estimating the missing diacritical marks. The approach is practical and fully automated. Results are usually improved by increasing the size of the corpus. In this paper we will demonstrate a statistical method for detecting the case ending diacritics. The system is first trained using a Penn Arabic Treebank. The proposed method is efficient and can be processed in parallel with the detection of the internal diacritics which has achieved acceptable results.

2. The proposed Arabic Diacritizer 2.1 Overview of the Treebank Treebanks are language resources that provide annotations of natural languages at various levels of structure: word level, phrase level, and sentence level. Treebanks have become crucially important for the development of data-driven approaches to natural language processing. The Arabic Treebank was created on top of a corpus that has already been annotated with POS tags. The Penn Arabic Treebank (ATB) began in the fall of 2001 [1] and has now completed four full releases of morphologically and syntactically annotated data: Version 1 of the ATB has three parts with different releases, some versions like Part 1 V3.0 and Part 2 V 2.0 are fully diacritized trees. For example, the following undiacritized statement: ‫"ﻟﻠﻴﻮم اﻟﺜﺎﻧﻲ ﻋﻠﻰ اﻟﺘﻮاﻟﻲ ﺗﻈﺎهﺮ ﻃﻼب ﻳﻨﺘﻤﻮن اﻟﻰ‬ "....‫ﺟﻤﺎﻋﺔ‬ "llywm AlvAny ElY AltwAly tZAhr TlAb Yntmwn

Suggest Documents