A Rule Based Annotation system to extract Tajweed ...

27 downloads 0 Views 609KB Size Report
The Ayat is written using program “Mushaf AlMadinah Annabawiya Publishing Software” available for free on http://nashr.qurancomplex.gov.sa/site/ ...
A Rule Based Annotation system to extract Tajweed Rules from Quran Auhood Alfaries, Manal AlBahlal, Manal Almazrua, Amal Almazrua IWAN Research Group, IT Department, CCIS King Saud University Riyadh, Saudi Arabia aalfaries/albahlal @{ksu.edu.sa}, manal.almazrou/almazrou.amal @ {gmail.com}

Abstract—Quran Recitation relies on identifying and applying different Tajweed rules [‫ ]قواعد التجويد‬such as Muddud [‫ ]مدود‬and Tanween [‫ ]تنوين‬in the Quran text. This research is aimed at providing a tool that automatically finds and annotates letters that embody Tajweed rules in Quran text. This field remains an open research area due to the lack of open source NLP tools that support the Arabic language. Applying Natural Language Processing (NLP) techniques on Quran text to extract Tajweed letters is considered an important Information Extraction (IE) step. This research explores the field of applying IE techniques on Quran text. Rule based IE techniques are well known to achieve optimal results. This research explores NLP techniques on Quranic text using GATE, an open source flexible NLP environment. GATE is employed for this research to build the application that processes un-annotated Quranic text corpus. The developed application is evaluated using the well known IE evaluation metrics precision and recall. By comparing the system’s automatically annotated text with a gold standard (i.e. Quran text). The system proved to be efficient by achieving 100% precision and recall of the implemented Tajweed rules. Keywords; Arabic NLP, Arabic text analysis, Infomratio extraction, GATE,Tajweed Rules,Quran.

I.

INTRODUCTION

Tajweed(‫ )تجويد‬is an Arabic word that refers to the rules which dictate accurate pronunciation while recitation of the holy Qur'an. The root comes from the word 'ja-wa-da' which means to make well, make better or improve [1,2]. Muslim scholars wrote down the rules of recitation of the earlier generations by closely observing the perfect readers who read as they were taught by the Prophet Mohammed peace be upon him [1]. According to the Tajweed scholars, it is obligatory (fard alayn) to recite the Quran by applying Tajweed rules, but it is a communal obligation (fard al-kifayah) to have an individual who comprehends the whole Tajweed science with all its rules and terms [3] Generally, the most prevalent of ten (tawatur) schools of recitation is the recitation of Imam 'Asim as transmitted by Imam Hafs [2]. According to [3], the science of Tajweed is

classified into four parts: Madd (Stretching), Makharij alHuroof (Points of Articulation of Letters), Sifaat Aaridhah (Temporary Qualities, ex. Rules of Noon and Meem sakenah and Tanween) and Sifaat Laazimah (Quality of Letters, ex. Heavy VS Light Letters). This paper emphasized in four rules: Noon sakenah, Tanween, Meem sakenah and Madd. Noon sakenah or Tanween [2] refers to whenever there is Tanween or a sukun sign on a Noon. There are four ways it should be pronounced depending on the letter that immediately follows this Noon sound as illustrated in Table 1. The Meem sakenah [2] refers to whenever there is a sukun sign on a Meem. There are three ways it should be pronounced depending on the letter that immediately follows the Meem sound as shown in Table 1. These rules of Madd [2,4] refer to the number of beats that are pronounced when voweled letter is followed by a madd letter. The Madd letters are Alif preceded by a fatha, Yaah preceded by a kasrah and Waw preceded by a dammah. The number of beats can range from 2 counts ordinarily. 4 or 5 counts when there is a Hamzah (‫ )ء‬and the maximum six counts when it is followed by a shaddah. Madd Muttasil and Madd Munfasil are illustrated in Table 1. Tajweed of the Holy Qur'an is the knowledge and application of the rules of recitation so the reading of the Qur'an is as the Prophet Mohammed peace and blessings be upon him, recited. Automatic Information Extraction (IE) and Natural language processing has been an active research area due to its importance in knowledge extraction from textual sources [5,6,7]. Applying Arabic natural language processing to automatically extract information from Arabic text requires applying suitable Arabic natural language processing (NLP) tools. This requires further investigation and customization of current IE methods and tools on Arabic unstructured textual sources. Lexical syntactic pattern identification has been widely used for Information extraction [5,6,7]. The rule-based techniques are widely applied in information extraction providing accurate and promising results leading to increased precision [5,6,7]. These pattern-based techniques are classified as knowledge engineering approaches requiring language

engineers to analyze the textual sources to identify patterns and engineer transformation rules, in which the difficulty remains in identifying the right rule that finds and annotate the exact matching pattern. Tajweed extraction requires finding and annotation of the Tajweed pattern letters within Quran text which can be accuratcly accomplished by identifying and writing a clear and concise rule based Tajweed extraction pattern. General Architecture for Text Engineering (GATE) [12] is an open source language processing software, which is developed at the University of Sheffield in 1996 and widely used in NLP tasks, including information extraction in many languages. It has a user friendly and simple interface for both linguistics and IT personal. It offers a set of customizable NLP tools with a set of linguistic tools s and Java Annotation Patterns Engine (JAPE), which provides finite state transduction. GATE, supports Arabic natural language processing. This can be benefitted from to further explore the applicability of rule based information extraction on Quran text. Hence, automating the process of extracting and annotating Tajweed rules from Quran corpus forms an important IE step to enable further analyses of Quranic text and automate the process of knowledge extraction. Automating the process of syntactic and semantic analysis of Quran text is an important and yet an open research area. Therefore, automating the process of identifying and extracting Tajweed rules from the holy Quran is considered an important step that still lacks the necessary NLP tools. Based on the evaluation result of the available Quranic software in [8,9] it is noted that most of the current and past research on Quran text relies on manual annotation of Tajweed rules [10,11]. II. RELATED WORK The ArOntoLearn [13] presents a framework for building Arabic Ontology from textual resources. The proposed framework applies a set of NLP tools as GATE pipeline application. The application consists of Arabic tokenizser, sentence splitter, Arabic Morphological Analyzer. The framework also applies Stanford Arabic POS-Tagger and Syntactical Analyzer to build syntactical trees for each sentence. The framework uses rule based information extraction using GATE JAPE rules to annotate all known concepts. GATE JAPE transducer is applied as a pattern extraction engine to tag instances, concepts and sub-concept based on ‘has-a’ and ‘is-a’ relationships. The system was evaluated twice; first as it is, without Stanford Arabic Syntactical parser using 125 documents extracted from Arabic Wikipedia and achieved a precision of 50%; second it was applied with the Stanford Arabic Syntactical parser on a well formed sentences and achieved a precision of 83%. In [14] the system extracts collocations, such as NounNoun, Adjective-Noun, Verb-Noun, Noun-Preposition-Noun from POS and morphology annotated Quranic Corpus. It uses GATE to write Jape rules to label collocation patterns. The system was evaluated on the annotation Noun-Adjective using Gate’s Annotation Diff tool and achieved a precision of 66%. The system developed in [15] extracts Arabic person names using a rule based approach. It feeds the GATE gazetteers with

a.

From http://tanzil.net/

various key words lists such as (IVL) verb list that introducing person names and (IWL) word list the could linked to person names, stop words list and place, town, country, organization and Arabic person names that start with (AL) lists. Heuristic application algorithms will be performed on the text to extract all possible proper names using defined lists. Then Bukwlater Arabic morphology Analyser (BAMA) system will return all know related words and their classes. The system was tested using 700 news articles taken from Aljazeera television website and achieved a precision of 93%, a recall of 86% and an Fmeasure of 89%. In [16] they repeated the same work using 500 news article extracted from the Aljazeera television website and achieved a precision of 88%, a recall of 90% and an F-measure of 89%. A real time named entity recognition system is developed in [17], based on reducing the affixes impact to improve the recognition results. The system runs its classification algorithms when UTF-8 Arabic document is processed by the system. To define a word in a document, appropriate patterns are retrieved from a system Dictionary that currently contains 94,000 patterns associated with their attributes i.e. morphological categories and named entities collected from many integrated resources such as DBpedia/Wikipedia, GATE Arabic Gazetteer lists and ANERGazet. The result is then stored in XML file and contains the original word, the distances between the word and the selected pattern, the attributes, and the root in XML file. The system was evaluated on the ANERcorp that contains 150,000 terms by two experiments; one with Prefixes–Suffixes Verification and another without it to see the improvements. The results of the experiment with Prefixes–Suffixes Verification precision in person, location, organization, noun, and verb are 77.63%, 81.49%, 65.54%, 77.39%, and 85.79% respectively.

III. THE QURTAJ SYSTEM QurTaj is developed using GATE framework[12]. GATE is chosen for this research for: (1) Its availability as an open source for research; (2) Flexible environment that enables building a customizable pipeline consisting of diverse natural language processing tools. (3) It includes JAPE transducers, based on regular expression matching that can be used for automatic annotation of text by applying rule based IE techniques. Finding and annotating Tajweed rules in Quran text requires customizing GATE to process Quran diacriticised letters. Hence, the first challenge in using GATE is to be able to load and process diacritics Arabic text. This is accomplished by changing the Arabic tokenizer file to recognize the diacritic, the change was done by adding the needed encoding (NON_SPACING_MARK ) to the file. As summarized in Table 1, Tajweed has 9 annotation types, the application applies a set of natural language processing techniques on a plain Arabic corporaa, according to the selected Tajweed rules. The application starts with tokenization, the basic preprocessing of the Quran text that includes rule based information extraction.

TABLE 1. RULES OF NOON, MEEM SAKENAH TANWEEN AND MADD

Tajweed Rule Noon sakenah Tanween

Literature Meaning

Meaning

Letters

Idhhar ‫اظهار‬

Clarity

Noon sound is pronounced very crisp and clearly without any ghunnah when followed by letters

‫ءهعحغخ‬

Idhgam ‫ادغام‬

Merging

Idgham only applies between two words and not in the middle of a word. It divided to

‫لر‬

or

Examplea

-Idhgam without gunnah Noon sound is dropped when followed by letters -Idhgam with gunnah

‫ومني‬

Noon sound is dropped and also has a ghunnah if it is followed by letters

Meem sakenah

Iqlab ‫اقالب‬

Conversion

Noon sound is converted to a Meem sound, with a ghunnah

Ikhfa ‫اخفاء‬

Hidden

the Noon sound is suppressed (the tongue does not make full contact with the roof of the mouth) and has a ghunnah

Idhgam Shafawi

‫ادغام شفوي‬

Ikhfa Shafawi

‫اخفاء شفوي‬ Idhhar Shafawi

‫اظهار شفوي‬

Madd

Madd Muttasil

‫مد متصل‬

Madd Munfasil

‫مد منفصل‬

‫ب‬

‫ص ذ ث ك ج‬ ‫ش قسدطزف ت‬ ‫ضظ‬

Merging for the lips

Meem sound is merged with the following Meem and has a ghunnah

‫م‬

hidden the lips

for

the Meem is suppressed (lips are not fully closed) and has a ghunnah

‫ب‬

clarity the lips

for

the Meem is pronounced clearly with no special rules

any other letter besides (‫ ) م‬and (‫)ب‬

Connected Stretch

The duration of the letter of madd should be stretched between four to six counts. It appears in one word when the letters of al-madd followed by

hamzah (‫)ء‬

detached Stretch

The duration of the letter of madd should be stretched between three to five counts. It appears if a word ends with one of the three letters of alMadd and the next word begins with the letter

hamzah (‫)ء‬

a. The Ayat is written using program “Mushaf AlMadinah Annabawiya Publishing Software” available for free on http://nashr.qurancomplex.gov.sa/site/

A. QurTaj Approach As illustrated in Fig. 1, QurTaj architecture consists of two tokenizers and a JAPE processing resource as described in the next section: 

Two level of tokenization

The Word and Letter tokenizers that are explained below splits text into simple tokens, such as diacritics (Harakat), words and letters. The default tokenizers had to be modified in order to recognize different text types that contains a lot of symbols that weren't recognized in the original Arabic Tokenizer, our proposed tokenizers were: o

Word Tokenize

This level of tokenizing was different from the original tokenizer, by introducing the Harakat to identify the whole words with its Harakat as a token. It now recognizes many types of words, and symbols that will help us to ignore a certain word or between the Tajweed rules, such as (ۗ,ۗ,ۗ ). o

Letter Tokenize

While the second level was to identify the letters and Harakat to search for a certain letter or Harakat to highlight this pattern not the whole word or sentence. The main reason to add this level of tokenization is to make it clear where is the part of word that satisfy the pattern that the Tajweed rules applied in. Both the Word and Letter tokenizers were used in the JAPE rules later on. B. Jape Tajweed Rules The Quran Tajweed JAPE grammar was done by creating a new JAPE transducer (Quran Rules) which is the module that runs JAPE grammars, said JAPE could do tasks like matching pattern, annotate part of words, named entity recognition, etc. By default, GATE supports specific unified rules for recognitions. With the Quran JAPE Rules which we implemented now performs the Tajweed entity recognition for Arabic Quran and the four type of Tajweed (Noon sakenah, Meem sakenah, Tanween and Madd) which shows different pattern to annotated different rules, bellow is a sample code pattern to annotate the Meem sakenah and Noon sakenah Rules. Pattern1: Between two words. rule: Meem pattren Priority:100 ( (({letter.string == "ۗ"} | {letter.string != "ۗ"} | {letter.string != "ۗ"} | {letter.string != "ۗ"} ) ({letter.string== "‫) }"م‬ ({letter.string== "ۗ"})):one ({SpaceToken.kind == space} ) ({Token.kind == word, Token.string == "ۗ"} | {Token.kind == word, Token.string == "ۗ"} | {Token.kind == word, Token.string == "ۗ"} | {Token.kind == word, Token.string == "ۗ"})? ({SpaceToken.kind == space}) ?

Figure 1. QurTaj Application Pipeline.

( {letter}):two ):A7kam-Almeem--> { // annotated different Taween rules goes here } This pattern intents to find Meem sakenah between two words separated by space. For example: , which will be annotated as meem sakenah and to be more specific Idhhar shafawi. Pattern1: Within the same word. rule: Noon pattern Priority:100 ( ({letter.string == "‫) }"ن‬ ({letter.string == "ۗ"})? ( {letter.string == "‫{| }"ف‬letter.string == "‫| }"ث‬ {letter.string == "‫{ | }"ج‬letter.string == "‫}"د‬ |{letter.string == "‫{|}"ت‬letter.string == "‫| }"ذ‬ {letter.string == "‫{ | }"ز‬letter.string == "‫| }"س‬ {letter.string == "‫{ | }"ش‬letter.string == "‫| }"ص‬ {letter.string == "‫{ | }"ض‬letter.string == "‫}"ط‬ | {letter.string == "‫{ | }"ظ‬letter.string == "‫| }"ق‬ {letter.string == "‫) }"ك‬ ): A7kam-Alnoon { // annotated different Taween rules goes here }

V. This pattern intents to find Noon sakenah within the same word. For example: “Ikhfa”.

,which it annotate as noon sakenak

Each Quran Tajweed Rule have different pattern, and some are considered to be simpler than the others. Tajweed letters occures in different places in Quran words. They can be found either at the beginning, end or middle of the word. Alternatively, some letters are separated from each other by a space or by special characters. Fig. 2 shows the annotated words of Alshura Surah " ‫سورة‬ ‫ "الشورى‬Quran verse and each color represent one of the four Tajweed types along with a detailed description for each annotated pattern.

IV.

RESULTS AND EVALUATION

To evaluate QurTaj, a gold standard [18] based evaluation method is used to calculate the precision and recall, a wellknown information extraction evaluation metrics [19], which is used to evaluate the accuracy and coverage of the automatically extracted Tajweed letters from Quran text. These metrics are typically applied to evaluate automatically extracted information in comparison with manual extraction [11]. Recall is used to measure the number of correctly identified Tajweed rules by the system for example, if 10 rules are identified manually in the corpus and the system has automatically identified 7 of these 10 then 70% would be the recall figure. An ideal benchmark scenario for recall calculation is to use either a gold standard Quran corpus or a domain expert to identify Tajweed rules manually from the input sources upfront (precreate annotated Quran text corpus). QurTaj system is to extract the specified Tajweed rule from Quranic Corpus. In order to evaluate the QurTaj named entity recognition Jape Rule we ran an experiment which compares the performing of QurTaj system and a Gold standard annotated version taken from a project by King Saud University [18].

This research presents a Quran based information extraction system QurTaj to automatically extract and annotate Tajweed rules from Quran text. The novelty of this work lies in (i) Developing a fine-grained tajweed extraction rules that identities Tajweed letters in an un annotated Quran text. (ii) A rule based information extraction technique is used to achieve high extraction accuracy. (iii) Employing GATE; an open source, flexible Natural language processing tool/environment, (iv) customizing an existing NLP environment to process Arabic diacriticzed text. The developed tajweed extraction system implements the main Tajweed rules, which are; Idhhar, Idhgam, Iqlab, Ikhfa, Idhhar shafawi, Idhgam shafawi, Ikhfa shafawi, Madd Muttasil and Madd Munfasil. The application applies a set of natural language processing techniques and evaluates the extraction by comparing the automatically extracted tajweed rules with a Gold Standard Quran annotated text where precision is achieved at 100% and recall is also achieved at 100% illustrating optimal accuracy. Due to the fact that Quran text is the main reference for Muslim scholars and scientists, IE from Quran text is considered an important research area. This research can be considered an important step that gives direction to further exploration to applying rule based IE techniques to Quran text to assist in developing applications for both Muslim scholars and learners.

TABLE 2. COMAPRISON BETWEEN GOLD STANDARD WITH QURTAJ

Tajweed Rule

Tanween

Meem sakenah

We used Alshura Surah to test the performance with several type of query pattern: in the same word and between two words. Table 2 gives the precision and recall results: Precision shows the percentage of correctly retrieved data by the system. On the other hand Recall shows the percentage of corrected data detected from the system with the number of corrected rules inside the gold standard. The obtained precision and recall results as illustrated in Table 2.clearly showing that the system have 100% accuracy and coverage. Due to differences in Quran text formatting, The feed corpus used for evaluating QurTaj System was in Ima'ei script format, therefore the text that was written in Uthmani text format are ignored

CONCLUSION AND FUTURE WORK

Noon sakenah Madd

Recall

Correct

Gold standard

Precision

Retrieved

QurTaj

Idhhar Ikhfa Iqlab Idhgam

15 16 7 32

15 16 7 32

15 16 7 32

100 100 100 100

100 100 100 100

Idhhar Shafawi Ikhfa Shafawi Idhgan Shafawi Idhhar Ikhfa Iglab Idhgam Madd Muttasil Madd Munfasil

53

53

53

100

100

2

2

2

100

100

16

16

16

100

100

19 45 6 28 33

19 45 6 28 33

19 45 6 28 33

100 100 100 100 100

100 100 100 100 100

28

28

28

100

100

Figure 2. Snapshoot of the QurTaj result

REFERENCES “Tajweed.”[Online].Available:http://www.readwithtajweed.com/tajweed _Intro.htm. [Accessed: 14-Jul-2013]. [2] Qari` Saleem Gaibie, 2006"A Guide For the Reciter",Academy of Arabic and Islamic Sciences,avialable at: http://duai.co.za/Site/wpcontent/uploads/2012/05/murshid1.pdf [3] “Learn Tajweed Online.” [Online]. Available: http://www.essentialilm.com/tajweed.html. [Accessed: 14-Jul-2013]. [4] “Tajweed Study.” [Online]. Available: http://tajweedstudy.com/. [Accessed: 14-Jul-2013]. [5] Buitelaar, P& .Cimiano, P. (eds) 2008 ,Ontology Learning and Population: Bridging the Gap between Text and Knowledge , Amsterdam, The Netherlands, IOS Press. [6] Cimiano, P. 2007 ,Ontology Learning and Population from Text: Algorithms, Evaluation and Applications .New York: Springer. [7] Giovannetti, E., Marchi, S & .Montemagni, S. 2008, "Combining Statistical Techniques and Lexico-Syntactic Patterns for Semantic Relations Extraction from Text", Proceedings of the 5th Workshop on Semantic Web Applications and Perspectives (SWAP) Rome, Italy, 1517 December. pp. 10 [8] I.A. Alsughayeir & Y.O.M Elhadj. (2006). Computerized Quran Products: State-Of-Art (in Arabic), Proceedings of STCEX'06, Riyadh, Saudi Arabia, December 2-6, 2006. [9] Y.O.M. Elhadj, M. Alghamdi, M. Elkanhal, and A. Alansary, "Toward an Automatic Corrector of Quranic Recitation Integrated in an Environment for Self Learning of the Holy Quran",2012 [10] S. Zaidi, M. Laskri, and A. Abdelali, "Arabic collocations extraction using gate", IEEE ICMWi’10. Algiers, Algeria,2010 [1]

[11] K. Dukes, E. Atwell, and AB. Sharaf "Guidelines for the Syntactic Annotation Quranic Arabic Treebank.", Seventh International Conference on Language Resources and Evaluation (LREC2010).Valletta, Malta, (2010). Valletta, Malta (2010). [12] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL'02). Philadelphia, July 2002. [13] N.Ghneim, W.Safi and M. Al Said Ali, ” Building a Framework for Arabic Ontology Learning”, Knowledge Management and Innovation in Advancing Economies: Analyses 1730 & Solutions,200 [14] S. Zaidi, M. Laskri and A. Abdelali , "Arabic collocations extraction using Gate," Machine and Web Intelligence (ICMWI), 2010 International Conference on , vol., no., pp.473-475, 3-5 Oct. 2010 [15] A. Elsebai, F. Meziane and Belkredim, “ A Rule Based Persons Names Arabic Extraction System” , in: Proceedings of the IBIMA, 4-6 January, 2009, Cairo, Egypt. [16] A. Elsebai and F. Meziane, "Extracting person names from Arabic newspapers," in International Conference on Innovations in Information Technology, Abu Dhabi,UAE, 2011. [17] H. Al-Jumaily, P. Martínez, J. Martínez-Fernández and E. Goot, ” A real time Named Entity Recognition system for Arabic text mining”, Language Resources and Evaluation Journal, Volume 46, Issue 4 , pp 543-563, 2012 [18] "‫ مشروع المصحف اإللكتروني بجامعة الملك سعود‬-‫[ "القرآن الكريم‬online] avilable: http://quran.ksu.edu.sa [accessed 15-Jul-2013] [19] C.J.Van Rijsbergen, 1979, Information retrieval (2nd edn.), London, Butterworth.