Rules and Natural Language Pattern in Extracting ...

1 downloads 0 Views 461KB Size Report
6 matches - key concepts, manually done and taken from the 'Mushaf Al. Tajweed' (Al Tajweed is another name for the Quran). It contains a comprehensive ...
Rules and Natural Language Pattern in Extracting Quranic Knowledge Saidah Saad, S Azman Mohd Noah

Naomie Salim,

Hakim Zainal

Fac of Info.Sc. and Tech, UKM Bangi, Selangor saidah, [email protected]

Fac.of Computing, UTM, Skudai, Johor [email protected],

Fac. of Islamic Study, UKM Bangi, Selangor [email protected]

Abstract— The extraction of Quranic knowledge is a challenging task since the Quran is different from human literature. The unique arrangement of its texts has made the task of directly accessing its meaning, which is rich in its linguistics and multilayered meanings, difficult if done without the use of other resources such as the Hadith. Hence, this paper describes the extraction of the first layer and the use of natural language pattern in extracting the knowledge of Quranic English translation texts using natural language processing techniques. Evaluation is performed using the true positive, false positive and false negative metrics and encouraging results have been obtained. Keywords-component; ontology concept, relation, rules

I.

learning,

pattern

extraction,

INTRODUCTION

Automatic information extraction from unstructured sources has provided an opportunity for querying, organizing, and analyzing data in order to generate a clean formal semantic database. The field of information extraction has its origins in natural language processing, where the main impetus came from the attempt to extract name entities such as the names of people, places, organizations, and time. As a society finds it easier to access both structured and unstructured data, it thus marks the existence of new applications of structured extractions. At present, there is an interest in converting knowledge in scientific publications into structural and semantic databases, and in utilizing the Internet for semantic and structured fact-finding inquiries. As a result, various different communities of researchers bring in techniques from machine learning, databases, information retrieval, and computational linguistics for various aspects of information extraction problems. Since the computational techniques for the understanding and extraction of Quranic knowledge are limited, hence it is proposed that ontology learning techniques are used in extracting this knowledge. In the past fifteen years, scholars have been authoring books highlighting various linguistics, stylistics, scientific, rhetorical, and many hidden discoveries of knowledge in various other fields from the Quran. Obviously, during this period of time, these scholars have been relying on their personal knowledge and familiarity with the Quran since no computational tools were available then. Eventhough computational analysis of the Quran will be of great advantage in helping Moslems to understand the Quran, yet there has been little analysis performed on it. The Quran needs to be compiled in some special ways, involving

linguistic structures that can reveal different meanings across the ages [1] since the Quran is also known as the "living knowledge" (it certainly contains knowledge of the past, present and the future) Nowadays, techniques from the semantic technology field offer yet another approach in organizing knowledge of the Quran in more meaningful information. Thus, ontology is a good candidate in representing the complex knowledge of the Quran and also in supporting the expandability of knowledge in the Quran. Thus, the building of the ontology for Quranic content will provide a shared conceptualization of some domains which might be communicated between people and systems. It constitutes a specific vocabulary utilized to delineate a particular model of the world, in addition to a set of explicit assumptions considering the aimed meaning of the words in the vocabulary. Both the vocabulary and assumptions assist people and system to grasp common conclusions after communicating. The aim of this paper is to discover and propose the rules and pattern that can be used in extracting the Quranic knowledge that based on English translation texts. This research has motivated us and hopefully it will allow other researchers interested in the Quran to get as close as possible to the understanding of its intended meanings through the creation of the ontological hierarchy and relation. Since the users of the Quranic ontology will be mostly students and researchers of the Quran, a large focus of this research has been placed on the creation of the ontology as a whole, but focusing on a certain subject which is Salat related verses. II.

STATE OF THE ART

The Quran consists of 77784 word tokens and 19287 word types (distinct words) [1], which are sequenced in chapters and verses. The Moslem believe that the original data format was spoken in Classical Arabic, which were then captured faithfully in a sophisticated transcription system. Access to the Quran has traditionally been through the text. Many Muslims learn to memorize and recite the verbatim data-set. Access to the underlying knowledge, wisdom and law requires interpretation and inference. Much knowledge is encoded via the subtle use of words, grammar, allusions, links and cross-references. For over a thousand years, scholars have sought to extract knowledge and laws from the text, and have built a much larger tafseer or corpus of analyses, interpretations and inference chains [2],[3],[4]. Scholars have authored many books for the

Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia

1 - 407

last fifteen years that highlighting the issue in many languages, stylistics, scientific, rhetorical and many hidden discoveries from the Holy Quran in various other fields. Obviously, all this knowledge is based on their background knowledge of the Quran and their understanding the content of Quran. The Quran is characterized by its ability to hold vast information in an unstructured and scattered –yet conceptually related- verse. Computer Science and Artificial Intelligence present the opportunity to re-analyze the text data, to extract and capture the underlying knowledge through proper knowledge representation and reasoning, and to enable for automated and objective inference and querying [5]. Most computer-generated systems are explicit and are symbolic representations to related knowledge about a particular domain. In principle, they can be reused throughout the system. Basically, computers are symbol-manipulating machines and they need clear instructions on how to manipulate symbols in a meaningful way. When representing knowledge symbolically in a way that computers can process it, question arises on what symbols should be used and what they stand for. Thus, ontology as a knowledge representation is seen appropriate for this purpose. This type of ontology is able to state what is important for the domain in question, and also to define their relationships. In the context of knowledge-based systems, the underlying ontology will basically tell us which symbols are required and how they should be interpreted. At the logical level, the interpretation can then be constrained based on the ontology by appropriately axiomatizing the symbols [6]. The most promising techniques in ontology creation are the combination of rule-based approach using NLP and the machine learning approach [7]. Ontology ideally consists of classes (concept), relation (subclass of classes, either taxonomic or non-taxonomic), slot (feature, attributes, roles or properties), value of restrictions (facet, cardinality, type, scope) and instances (individual, object or entities) [8], [6]. In addition, ontology can also be classified based on the structure and according to the subject of the conceptualization, i.e. its content. Sowa [9] divides it into two types, which are Type 1 and Type 2. Type 1 is Terminological/Lexical ontology, where an ontology whose concepts and relations are not fully specified by the axioms and definitions that determine the necessary and sufficient conditions of their use. The concepts may be partially specified by relations such as subtype/supertype or part/whole, which determine the relative positions of the concepts with respect to one another, but do not completely define them. Type 2 is Axiomatized/Formal ontology where a terminological ontology whose concepts and relations have associated axioms and definitions that are stated in logic or in some computer-oriented language which could be automatically translated into logic. On the other hand, Guarino's [10] classifications are: (i), toplevel ontology where it describes very general concepts which are independent of a particular problem or domain. (ii), domain and Task ontologies where they describe, respectively, the vocabulary (conceptualizations) related to a generic domain by specializing the terms introduced in the top-level ontology. (iii) application ontology where this ontology contains all the definitions that are needed to model the knowledge required for a particular application and Finally, representation ontology

which describes the conceptualizations about knowledge representation formalisms which were proposed by [11]. At present, the availability of computational techniques for Quranic knowledge understanding is limited. [12] for example uses Name Entity Extraction from Quranic Text in order to form the ontology for the Quran, which can be further classified as type 1 and (ii). [13] presents the design and implementation of the ontological model and the results of its application are on “Time nouns” vocabulary of the Quran (type 2 and (i), (ii) & (iii)). Qurany Explorer [14] is the corpus of the Quran where it is augmented with an ontology or index of the key concepts, manually done and taken from the 'Mushaf Al Tajweed' (Al Tajweed is another name for the Quran). It contains a comprehensive hierarchical index or ontology of nearly 1200 concepts in the Quran (type 1 and (ii)). [15] uses automatic classification techniques of the Quranic verses based on certain surahs (chapters) according to the classifications made by the Islamic Scholars (type 1 and (ii)). However, the basic ontology of Islamic knowledge in classifying the main topics in the Quran has already been built for decades before. The entire topic is classified manually by the domain experts based on their understanding of the context of the Quran. This knowledge can be used as the general knowledge to guide in the construction of the computer-automated ontology based on Islamic Knowledge. III.

METHOD

In constructing a relatively complete ontology for Islamic Knowledge ontology, semantic technologies and natural language processing techniques have been combined in performing a (semi) automatic ontology creation for the Quranic translation text. Tentatively, the construction of this ontology is based on three layers of approaches. The first layer is Meta concept of the Quran (super concept), which is based on the Quranic Indexes texts introduced by [1]. The second layer is a domain ontology which focuses on the ontology of Salah (detailed in [16]). This is a semi-automatic process which involves the ontology engineer and domain experts. The second layer will be the lower layer. The last layer or third layer is the ontology population which is used to enrich and bridge the ontology built in the first and second layers (this is represented as dotted lines in Figure 1). Figure 1 shows the levels of ontology construction.

Figure 1. The Layers of Quranic Ontology

Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia

1 - 408

In order to construct the ontology population which is based on the Quranic Translation English Text, the steps involved are shown in Figure 2. Filtering, which is a process to filter text and all information in the brackets, is deleted (temporarily for parsing purposes that might interrupt the focusing texts), as well as converting all pronouns with capital letters such as 'We', 'Us', 'He', 'Lord', 'My' into Allah and 'O you who believe' into Mukminun. At this stage, the focus is placed solely on the translation text without looking at other meanings and explanations about the original texts. It is then followed with the NLP processes which are part-of-speech tagging, syntactic parsing and extracting terms that match the identified pattern. Next step, the focus is placed on the terms and information in brackets in order to extract more information and to overcome ambiguity in some texts. This process scrutinizes the synonym, definition and explanation of the terms.

only 1 match for the H3 pattern. Then, she uses the enhanced version of [6] approaches using the following pattern: C1: the < INSTANCE > < CONCEPT > C2: the < CONCEPT > < INSTANCE > C3: < INSTANCE >, (a|another) < CONCEPT > C4 : < INSTANCE > is < CONCEPT > From her experiments, she found 6 matches for C1 and C2 patterns which are known as the compound noun. For our experiment, the same extraction is performed based on the above pattern, with little modification being made on the pattern itself. The experiment yielded the same results for H1, H2, H3 and H4, as mentioned in [19]. As for the apposition and copula, the rules have been extended by adding another element which is still suitable for the definition of those terms (apposition (C3, using 'a' as a connector) and copula (C4)) such as in [6] extended rules of copula, where he had just focussed on the ‘is’ keyword, while this experiment uses the following rules: \w +{BE(D)}\w ==> is/are/was/were (verb-to-be) As for the apposition, the Cimiano’s extended rules use the 'a' keyword, but in this experiment, the rules have been modified by adding the following pattern: dt NN dt NN (dt = determiner/ as a)

Figure 2. The Extraction Process of Ontology Population

In order to detect the taxonomy relation, consideration is made based on the ones defined by [17] and those based on noun phrases introduced by [18]. Several tests have been conducted in order to discover which pattern could be potentially obtained. Next, the way in which both types of patterns could be combined in improving the performance is proposed. Apart from that, a pattern that could be used as the taxonomy relation is also proposed. As mentioned before, most researchers today would apply the Hearst pattern [17] in determining and learning the hyponym from the corpus. [19] has used the Hearst pattern in conducting the comparison in extracting the pattern by using Yusof Ali's Quranic Translation text. Her research is based on the following Hearst pattern: H1: s such as H2: such s as H3: , (especially | including) H4: < INSTANCE > (and | or ) other < CONCEPT > From her observation and research, the Hearst pattern [17] has failed in finding the matching results in the salat verses in the Quranic Translation text for H1, H2 and H4 patterns and

Based on the definition by [20], [21] and the description of Cimiano’s C1 and C2 patterns [6] , some improvements on the pattern have been done, which are also based on the definition of compound noun in [22]. The following pattern makes two or more nouns/adjectives combination to form a single noun, where it can be extracted into four patterns, where N1 is a subconcept of N0 and is also the additional prepositional terms conjunction with noun: i. The N1N0 ii. N0'sN1 iii. NP→JJ, NP0 iv. NP→NP0 (PP NP*)) v. head of Noun where N0 is part-of N1N0 /The N0 of N1 For example, “funeral prayer” is part of “pray”. Another pattern is the apposition style. The pattern that captures this intuition looks as follows: APPOSITION: NP0 a / as a NP1 Apposition can also be derived from two elements, normally noun phrases, which are placed side by side, such as dt NP0 dt NP1 (dt = determiner). This indicates that: NP1 is-a NP0 Copula is another pattern that can be used. It is a word that is used to link the subject of a sentence with a predicate, where the predicate is made up of an adjective and a noun, so the linking verb is "is". It is probably the most explicit way of expressing that a certain entity is an instance of a certain concept by the verb ‘to be’, where the general term is COPULA: NP0 verb-to-be NP1 This indicates that: NP1 is-a NP0 (NP0 can be NP or Adv→NP; PP→NP).

Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia

1 - 409

From the observation made on the gold standard as defined by the domain experts, it is found that there is a pattern which can be constructed in order to build the taxonomy relation. The following are the patterns that have been defined. The three patterns that are used in identifying the part-of relationships between the concepts are referred to by the two terms in the text. In order to identify phrases representing the concepts, henceforth P0, and phrases representing the subconcept, henceforth Pn (where n = 1 ….n ), that phrase can be either VP or NP. NP0 represents a superconcept. A shallow parsing technique (Stanford Parser) is used based on the matching of regular expressions over part-of-speech tags to identify the phrases as described above. Pattern 1 : NP0 [|( i.e P1,P2……Pn ]|) → P1, P2……Pn PartOf NP0 The use of the squared brackets [ ] : They are used in making the sentence portion of the text more complete or to elaborate in detail about the concept. If a word 'i.e' appears, it shows the example or part of the component to the NP mentioned before. Pattern 2 : NP0 {:,!} (P1 , and)* ……, and Pn

this pattern rarely appears and most of the keywords related through an 'is-a' relation do not appear in the Hearst-style patterns. TABLE I.

Pattern

dt NP0 dt NP1 NP0 verb-to-be NP1 NP0 [|( i.e P1,P2……Pn ]|) → P1, P2…… Pn PartOf NP0 NP0 {:,!} (P1 , and)* ……, and Pn TABLE II.

Where P * PartOf NP0. The delimiter is based on the same repeated patterns such as ",(comma)", "and", verb in phrase and the combination of that pattern. Many of the Quranic verses also apply this pattern.

Hearst

Approach

Cimiano

If after "open squared bracket [" i.e then if NP0[P1]* then P1, P2…Pn PartOf NP0 Cimiano++

RESULT

For the comparison of the performance of the extraction, a manual comparison between the obtained current approaches techniques and their improvements with the gold standard has been done. The approaches have been classified based on the [6] , which is described and defined in Table 1, focusing on the H1, H2, H3 and H4 patterns. [6] Pattern is based on C1, C2, C3 and C4. Cimiano++ is an extended pattern which is based on [6], but with modification on the pattern in order to increase the number of extractions within the described definition. Qpattern is actually a new pattern that was discovered based on the Quranic Translation Text, which is suitable as a taxonomy relation. The nature of the taxonomic relation for each pattern is reported in Table I. Table II and III show that the performance of the extracted taxonomy for each pattern is quite good, except for the Hearst pattern where this finding is in line with [6] who mentions that

QPattern TABLE III.

Total Number of Taxonomic Relation True Positive False Positive False Negative

H4

Approach Hearst

C1&C2

Cimiano

C3 C4 Compound Noun (C1 & C2 Modify)

Cimiano++

Apposition (C3 modify) Copula (C4 modify ) Pattern 1 QPattern Pattern 2

RESULTS OF TAXONOMY EXTRACTION BASED ON CURRENT APPROACHES AND IMPROVEMENTS True False False Style +ve +ve ve H4: < INSTANCE > (and | or 1 0 0 ) other < CONCEPT > C1: the < INSTANCE > < CONCEPT > 51 13 4 C2: the < CONCEPT > < INSTANCE > C3: < INSTANCE >, 3 1 1 (a|another) < CONCEPT > C4 : < INSTANCE > is < 1 0 0 CONCEPT > • The N1N0 • N0'sN1 • NP→JJ, NP0 34 4 0 • NP→NP0 (PP NP*)) • head of Noun where N0 is part-of N1N0 /The N0 of N1 dt NP0 dt NP1 2 1 1 NP0 verb-to-be NP1 20 5 12 18 2 3

This indicates that

IV.

Code

< INSTANCE > (and | or ) other < CONCEPT > the < INSTANCE > < CONCEPT > the < CONCEPT > < INSTANCE > < INSTANCE >, (a|another) < CONCEPT > < INSTANCE > is < CONCEPT > • The N1N0 • N0'sN1 • NP→JJ, NP0 • NP→NP0 (PP NP*)) • head of Noun where N0 is part-of N1N0 /The N0 of N1

The symbol ‘:’ or ‘!’ after NP shows the relation of ‘part of’ or ‘component of’ the NP. This extraction is based on the analysis of the Quranic texts, on the use of the exclamation mark.

The comma and conjunction operators will be the delimiter of the statement. These patterns are explained in detail in [16]. Table 1 shows the summaries of the pattern extraction for 'is-a' and 'part-of' relations.

THE SUMMARIES OF PATTERN EXTRACTION

PERCENTAGE OF TAXONOMY EXTRACTION BASED ON CURRENT APPROACHES AND IMPROVEMENTS Hearst Cimiano Cimiano++ QPattern 130

1

0.8%

55

42.3%

56

43.1%

18

13.8%

0

0.0%

14

10.8%

10

7.7%

2

1.5%

0

0.0%

5

3.8%

13

10.0%

3

2.3%

Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia

1 - 410

A machine readable dictionary is the best in applying this pattern[6]. [23] also mentions that Hearst’s pattern is rare, and that the coverage of this pattern is low even in huge collections of documents. Even though the number of extracted taxonomy using Cimiano's pattern shows quite a good result with 0.42% of true positive of the overall taxonomy relation, yet there is still some space for improvements. The use of compound nouns (C1 and C2) by [6], contributes the most to the extraction of taxonomy or concept hierarchy. The improvement of 0.57% of true positive in Cimiano++ and QPattern shows that, the amendment has to be made based on the extraction process as illustrated by the domain experts. It is also in line with the suggestion made by [6], where the extraction does not only focus on the Noun in a pattern, but also to place consideration to the closer noun phrase itself. For example, "the main historic area of Ordino is a charming village", if based on the lexico-syntactic pattern would derive as : is-a(Ordino, village), where it is definitely wrong; the right extraction should be is-a (main historic area of Ordino, charming village). It shows that the improvement has to be made when considering a closer NP which involves NP → DET NP; Compound Noun; N* (Adj N*)*; N* (PP N*)*, in extracting a pattern on taxonomy or concept hierarchy. AlBirr(?a) ⟹ AsSabirin(?s) AlBirr(?a) ⟹ muttaqun(?m) AlBirr(?a) ⟹ person (?p) ⋀ fulfill(?p, convenant) ⋀ givewealthto (?p,kinsfolk) ⋀ givewealthto(?p,kinsfolk) ⋀ givewealthto (?p,orphan) ⋀ givewealthto(?p,masakin) ⋀ givewealthto(?p,wayfarer) ⋀ givewealthto(?p,thosewhoask) ⋀ givewealthto(?p,setslavefree) ⋀ give(?p,zakat) ⋀ believe(?p,LastDay) ⋀ believe(?p,angle) ⋀ believe(?p,Book) ⋀ believe(?p,Allah) ⋀ perform(?p,salat) ⋀ believe(?p,prophet) AlBirr(?a) ⟹ person (?p) ⋀ is-a(?p,piety) ⋀ is-a(?p, righteousness) ⋀ is-a(?p, eachEveryActObidientToAllah) muttaqun(?m) ⟹ pious(?x) AlKhashiun (?k) ⟹ person(?p) ⋀ is-a(?p, truebeliever) AlKhashiun (?k) ⟹ person(?p) ⋀ obey(?p, AllahwithFullSubmission) ⋀ fear(?p, AllahPunishment)⋀ believe(?p, AllahPromise) ⋀ believe(?p, AllahWarning) mindful(m) ⟹ person(?p) ⋀ accept(?p,advice) mukminun(n)⟹ person(?p) ⋀ seekHelpin(?p,patient) ⋀ seekHelpin(?p, salat) Figure 3. Description of Logic on Knowledge Representation for the Extraction of the Salat Verses

In the process of mapping with the extraction knowledge based on the pattern defined before, there are some modifications being made. Some rules need to be added in order to fulfill the requirements of the concept definition. This involves the entire candidate concepts, which involve "those who" and "one who" which means "person". If there is a verb involved in the phrase, then the verb becomes a relation between the concepts. For example, the concept of AlBirr. There are a few different meanings for this concept, which are "AlBirr is AsSabirin" and also "Muttaqun". At the same time, the other meanings are AlBirr is a person who is the one who

believes in Allah, the Last Day, the Angels, the Book, the Prophets and gives his wealth, in spite of love for it, to the kinsfolk, to the orphans, and to Al-Masakin (the poor), and to the wayfarer, and to those who ask, and to set slaves free, performs As-Salat (Iqamat-as-Salat), and gives the Zakat, and who fulfill their covenant when they make it (AlBaqarah, 177). Figure 3 describes the rules that are generated to represent the knowledge which is extracted from the salat verses. V. DISCUSSION There is no doubt that some weaknesses may occur in the extraction of the pattern. There are three key points identified as having problems to reach the 100% precision and recalls. i. Anaphora analysis (based on co-reference) is required for the extraction of information on the context and relationships in the verses of the Quran. This is because in the gold standard, the domain experts would extract the information and pattern based on their knowledge about the Quran. For example, based on Figure 4 below, the last part of the verse mentions that ".... they will have their reward with their Lord". ‘They’ refers to whom? Based on the experts, the answer will be "Mukminun get reward"; "those who perform right actions get reward ";" those who perform prayer get reward"; "those who give alms get reward" (defined in the gold standard by the domain experts). However, this analysis of co-reference is not covered in this research. Truly those who believe, and do deeds of righteousness, and perform As-Salat (Iqamat-as-Salat), and give Zakat, they will have their reward with their Lord. On them shall be no fear, nor shall they grieve. Those who believe (Beriman) get Those who do deeds of righteousness

get

Reward= no fear, no grieve

get

Those who give Zakat

get Those who perform As-Salat

Figure 4. Surah Al-Baqarah verse 277

ii. Another reason is that 5% of the extracted answers are often caused by errors during parsing such as " .. and perform As-Salat (Iqamat-as-Salat).." in which the parsing result is (NP(NP (NNP and) (NNP perform) (NNP As-Salat)) (PRN (-LRB- -LRB-) (NP (NNP Iqamat-as-Salat))(-RRB- -RRB-)))) This error in parsing will increase the results in the pattern for the 'compound noun' extraction. Other errors are due to the problems in identifying the head of the compound noun. According to [21] :

Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia

1 - 411

"An interesting property of most compounds is that they are headed. This means that one of the words that make up the compound is syntactically dominant. In English the head is normally the item on the right hand of the compound. The syntactic properties of the head are passed on to the entire compound. Thus, . . . if we have a compound like easychair which is made up of the adjective easy and the noun chair, syntactically the entire word is a noun." Most English compound nouns are noun phrases (nominal phrases) that include a noun modified by adjectives or attributive nouns. Normally the head is identified as follows (refer to Table IV) : TABLE IV.

REFERENCES [1] [2] [3] [4] [5]

[6] [7]

HEAD OF THE COMPOUND NOUN

Modifier noun adjective verb preposition noun

Head noun noun noun noun adjective

Example Islamic calender Right path breakwater underworld Muhammad Peace

[8] [9] [10]

However, it is quite difficult to identify the attributive noun using the parser. For example, in the phrase "Messenger Muhammad" the noun adjunct "Messenger" modifies the noun "Muhammad". Supposedly, the result should get Messenger as the Head and Muhammad as the modifier (isA (Messenger Muhammad, Messenger) and not isA (Messenger Muhammad, Muhammad)). In this research, the focus is placed only on the "noun, noun" and "adjective, noun" extraction patterns. VI.

CONCLUSION

This work presents an approach to the automatic generation of ontology concept / instances and relation from a collection of unstructured documents known as the Quran. The presented approach is stimulated based on the combination of natural language processing techniques, Information Extraction (IE) and Text Mining techniques. Based on the traditional IE systems, the authors have applied and defined a grammar and extraction rule to obtain the ontology of the concept/instances and relation. The system then tries to form the correct partial concept/instances and relation by taking the words and entities from the texts and combining them to form the correct and complete ones. Here, a novel pattern is presented for the extraction of the Quranic Translation texts, with the aim of automatically inducing the concept/instances and relation extraction from the document texts. The approach is evaluated by comparing the resulting concept and relation with the gold standard that has already been identified by the domain experts. Our contributions and conclusions can be summarized as follows: i. ii.

We have generated a new pattern / rule that is suitable for the Quranic Text, which is based on the basic pattern introduced by previous researchers. From this research, we have also identified the difficulties in extracting information which involves co-reference and the style of the Quran, which have both grammatical shifts and metonymy.

[11] [12] [13]

[14] [15] [16] [17] [18]

[19] [20] [21] [22] [23]

Abbas N,.(2009) . Quran 'Search for a Concept' Tool and Website. Thesis. University of Leeds. al-Tabarī, 1989. The Commentary on the Qur’an. Oxford University Press. Mubarakpuri. 2003. Tafsir Ibnu Kathir. Maktaba Darus Salam. Hamka. 2009. Tafsir Al-Azhar. Pustaka Nasional. Atwell, E. Dukes, K. Sharaf, A. M. Habash, N. Louw, B. Abu Shawar, B. McEnery, T. Zaghouani W. and El-Haj M. (2010). Understanding the Quran: A new Grand Challenge for Computer Science and Artificial Intelligence. Grand Challenges for Computing Research. British Computer Society Workshop. Edinburgh. Cimiano. P. 2006. Ontology Learning and Population from Text: Algorithms, Evaluation and Applications Springer. November 2006. Maynard D., Li Y., and Peters W. 2008. NLP Techniques for Term Extraction and Ontology Population. In Proceedings of the 2008 conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, Paul Buitelaar and Philipp Cimiano (Eds.). IOS Press, Amsterdam, The Netherlands, The Netherlands, 107127. Buitelaar P., Cimiano P., and Magnini B. (Eds.). Ontology Learning from Text: Methods, Evaluation and Applications, Series information for Frontiers in Artificial Intelligence and Applications, IOS Press, 2005. Sowa, J. F. Knowledge Representation: Logical, Philosophical, and Computational Foundations, Brooks Cole Publishing Co., Pacific Grove, CA Guarino N. (ed.), Formal Ontology in Information Systems.Proceedings of FOIS’98, Trento, Italy, 6-8 June 1998. Amsterdam, IOS Press, pp. 315. Heijst, G., Schreiber, A. & Wielinga, B. (1997) Using Explicit Ontologies in KBS development. International Journal of HumanComputer Studies, Volume 46, pages 183-292. Kais Dukes. 2010. Ontology of Quranic Concepts. http://corpus.quran. com/ontology.jsp Yahya M-, Khalifa H., Bahanshal A, Odah I. and Helwah N. 2010. An Ontological Model for Representing Semantic Lexicons: An Application on Time Nouns in the Holy Quran. Arabian Journal for Science and Engineering 35 (2), 21 Abbas, N and Atwel, E. 2008. Qurany Explorer. http://quranytopics.appspot.com/. AL-Kabi M.N. Kanaan G. And Al-Shalabi R. 2005. Statistical Classifier of the Holy Quran Verses (Fatiha and Yaseen Chapters). Journal of Applied Sciences 5(3): 580-583. Saad S., Salim N. and Hakim, Z., 2010 ,Towards Context Sensitive Domain of Islamic Knowledge Ontology Extraction , International Journal for Infonomics (IJI), Volume 3, Issue 1, March. Hearst, M. (1992) Automatic Acquisition of Hyponyms from Large Text Corpora. Proc. of the Fourteenth International Conference on Computational Linguistics, Nantes, France. Grefenstette. G. 1997. SQLET : Short query linguistic expansion techniques: Palliating one or two-word queries by providing intermediate structure to text. In RIAO'97, Computer-Assisted Information Searching on the Internet, Montreal, Canada. Chew, K. M. (2010) Comparison of ontology learning techniques for Quranic text. Masters thesis, Universiti Teknologi Malaysia, Faculty of Computer Science and Information Systems. Carter, R. and McCarthy, M., Cambridge Grammar of English. Cambridge University Press, 2006 Katamba F. (2005). English Words: Structure, History, Usage, 2nd ed. Routledge, 2005 Nordquist, R. (2010). About.com - Grammar & Composition. http://grammar.about.com /od/c /g/ compnounterm.htm. Brunzel M. and Spiliopoulou M.. 2008. Discovering Groups of Sibling Terms from Web Documents with XTREEM-SG. In Journal on Data Semantics XI, Stefano Spaccapietra, Jeff Z. Pan, Philippe Thiran, Terry Halpin, Steffen Staab, Vojtech Svatek, Pavel Shvaiko, and John Roddick (Eds.). Lecture Notes In Computer Science, Vol. 5383. Springer-Verlag, Berlin, Heidelberg 126-155.

Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia

1 - 412

Suggest Documents