Relational WordNet Model for Semantic Search in

0 downloads 0 Views 423KB Size Report
mnadeem.yasin@hotmail.com,. 1 ... latest tools and Surah Al-Baqrah, the largest chapter of the Holy. Quran ... example, the second verse of Surah Al-Baqrah. 2.
IEEE --- 5th International Conference on Emerging Technologies ICET 2009

Relational WordNet Model for Semantic Search in Holy Quran Muhammad Shoaib1, M. Nadeem Yasin1, Hikmat Ullah K.1, M. Imran Saeed1, Malik Sikandar H. Khiyal2 1

2

Department of Computer Science, International Islamic University, Islamabad, Pakistan. Department of Computer Science, Fatima Jinnah Women University, Rawalpindi, Pakistan.

Email: [email protected], [email protected],1 [email protected] 1 [email protected] [email protected] Abstract- The Holy Quran, due to its unique style and allegorical nature, needs special attention about searching and information retrieval issues. The legacy keyword searching techniques are incapable of retrieving semantically relevant verses. In this paper, we address the deficiencies of key word based searching and the issues related to semantic search in the Holy Quran, and propose a model that is capable of performing semantic search. The model exploits WordNet relationships in relational database model. The implementation of this model has been carried out in latest tools and Surah Al-Baqrah, the largest chapter of the Holy Quran, has been taken as sample text. The precision of our model's prototype implementation is far better than simple key word searching.

Keywords: Semantic Search, WordNet Relationships, Database

Information

Retrieval,

INTRODUCTION 1

The Holy Quran is an ocean of information and collection of diverse knowledge and dissimilar subject matters. It discusses almost all fields of life and provides basics for all areas of knowledge. The Holy Quran has unique style of explaining different topics. Normally a topic is discussed at different places. For example, the topic of Hazrat Moosa (AS) is discussed in many chapters; the Oneness of Allah has been discussed through out the Holy Quran. Some consecutive verses or even a single verse may contain many topics. For example, the second verse of Surah Al-Baqrah2, consisting of only seven words (including prepositions etc) mentions three topics. While lengthy verses may discuss dozens of topics. Normally a chapter does not contain only one topic and the same is true for a lot of verses. Religious scholars, teachers and students struggle hard to collect similar verses that describe a particular topic. We realize the problem of the readers of the Holy Quran and try to facilitate them for performing subjective search. We have tried to overcome the problems of keyword searching and provide framework for semantic search. Quite a large number of today’s Quranic softwares, like [2-16], provide just electronic audio and text representation facility and some of them provide searching facility but the users are not satisfied by the According to the Muslims, the Holy Quran is the last Divine book reveled to the Last Prophet Muhammad (SAW). Surah is an Arabic word that means “chapter”. Al-Baqrah is the name of chapter that means “the cow.”

results as they exploit query word matching approach. These softwares lack the idea of semantic search. However some softwares [25-28] show almost all the relevant verses against a query word but they are not more than a book in an electronic form. The keyword search, with special reference to the Holy Quran, faces three basic problems. First: In most of the cases all the relevant verses are not retrieved. Second: The sequence of retrieved verses does not appeal to the reader. The reader wants the most relevant verses to be shown first and the least at the end. Third: Some irrelevant verses are also retrieved. Above cases motivate us towards semantic search technique to explore the Holy Quran against a query word. The semantic search should be capable of retrieving the related verses from the Holy Quran either the query-word is present in them or not. Semantic search in Quran What do we mean by semantic search in Quran? We briefly describe it in two steps: 1) Word sense disambiguation: to identify only one sense of the query word, 2) Synonymy: to perform searching against each and every synonym of the sense identified in step 1. This is depicted in fig.1. Sun

Signs

1. revelation to Muhammad (SAW) 2. revelation to you

Quran

n. Furqan Skies

Fig. 1: Semantic search in Quran (an example)

In fig.1 the term (His) signs is a polysemic word. The system must confirm the intention of the user while s/he is searching against this term. Let if s/he means "Quran" by the query word "signs" then the system must show all the verses related to the Quran by considering all its synonyms and must not search the verses related to other senses of “signs”. Rest of the paper is organized as: section II shows the related work. Section III discusses the issues in developing the

30

IEEE --- 5th International Conference on Emerging Technologies ICET 2009

Algorithm for semantic search in Quran. Section IV focuses on the proposed model. Section V highlights implementation issues and section VI compares the results of simple key word approach and our proposed approach. II. RELATED WORK There are a large number of Quranic databases in digital form [2-16]. One can also get variety of Quranic softwares present in the market as well as most of them are available on Internet. A large number of these digital Quranic stores are in the form of audio and video files. Videos files present only the Arabic text with a number of translations in different natural languages [8]. Multi-Language Qur'an Software [12] provides Arabic/English Qur'anic transliterations. It is equipped with Quranic commentary, index, and glossary of more than 500 words. It also provides full search on the key word basis. It supports plug-in Qur'anic translations. Another Quranic software [13] aims to help lawyers, judges, book authors, and lecturers, to work with Quran. This software contains hierarchal view of all verses in Quran, either sorted by Surah, or by parts. [14] provides root word search. It takes the query word as input and after processing the morphological analysis of the query word it tells the root word of the query word as output. [2] is a professional resource of the Quranic literature. It provides the key word base searchable interface, indexed by Surah number. [5] facilitates the users to browse Quran and search with translation and Tafseer (interpretation). [9] provides original search functionality. Multilingual Qur'an Software [15] provides Arabic/English Qur'anic Commentary. Different translations in French, German, Spanish, Urdu, Malay, Indonesian, Japanese, Tamil, Hausa, Turkish and Indonesian are also available on this site. It is equipped with query word base searching facility. By going through different software we perceive that most of them use query word base searching technique. We could not find even a single thesis or paper about semantic search in Quran. The software of Harf Company provides subjective search facility but only in Arabic language. This software also provides the exact match search for words/terms, part of the verse and even some consecutive verses. Technically, it provides search in static files, managed in such a way that verses are pre-linked to a topic or a sub-topic [16]. So for as our topic “semantic search in Quran” is concerned we could not find any relevant material. We searched the world-wide web and found some papers which show some work on Quran. Following is the survey of the literature we did in our research. [17-18] use a new specification language for XML semantics. It is employed to specify semantics rules. The writers applied XML semantics approach to specify the consistency of the Holy books that are published in XML format. [18] demonstrates the significance of XML semantics checker approach. This novel approach was used to check the

semantic consistencies of the Holy Quran, mentioned by Jon Bosak in Religion 2.00 website [29]. This checker model was successfully applied to check that the number of verses in each chapter was correctly written in the Quran XML format document. It was also checked that the Quran XML document contained exactly the same number of chapters as in the real Quran holy book. [19-20] describe ontological work for extraction key word and key phrase candidate for developing ontology of Islamic literature. They presented an algorithm for automatic extraction of the key words. They have presented a general and skelatel methodology and life cycle for building the ontology for Islamic literature. They also applied their technique on English text for mining ontologies from natural language. Our Contribution: We have implented the WordNet relationships in relational model. To the besst of our knowledge we are the first ones exploiting the relational schema for the purpose of WordNet. The types of relationships are easily handled in our approach. The imlementation of this model is also performed in very sensitive document, i.e. the Holy Quran. III. ISSUES IN DEVELOPING THE ALGORITHM FOR SEMANTIC SEARCH IN QURAN To the best of our knowledge, none of the softwares provides semantic search in Holy Quran. Why? There are so many problems to develop semantic search engine for the Holy Quran. For example: Quran has its own versatile sequence of text, different from human literature. Change of topic is very frequent from verse to verse and even within a verse. A topic is discussed in many consecutive and nonconsecutive verses, within a chapter and even in different chapters as well; and a verse may contain many topics. Most of the verses in a chapter are not relevant to the chapter name. There are a large number of allegorical (Mutshabehaat) verses3. Arabic words have so many forms. Prefix and suffix are also frequently used. A word has so many meanings--- polysemy. Word sense disambiguation sometimes becomes very crucial even for scholars of the Holy Quran. “He it is Who has sent down to thee the Book: In it are verses fundamental (of established meaning); they are the foundation of the Book: others are allegorical. But those in whose hearts is perversity follow the part thereof that is allegorical, seeking discord, and searching for its hidden meanings, but no one knows its hidden meanings except Allah. And those who are firmly grounded in knowledge say: "We believe in the Book; the whole of it is from our Lord:" and none will grasp the Message except men of understanding.” (Al-Quran, Surah Al-e-Imran)

31

IEEE --- 5th International Conference on Emerging Technologies ICET 2009

The case of holonymy, meronymy, hypronymy and hyponymy is sometimes very difficult to resolve. The language WordNet cannot be used as the knowledge dictionary. The Quran needs its own unique WordNet. A sense may be expressed by so many words--synonymy. The reason and occasion of the revelation of a Surah or some verses also plays an important role in defining the meanings of the text. But most often this is not explicitly included in the Quranic text. A technique designed for a particular translation of the Holy Quran may not give the same results for other translations even in the same language. There are a large number of allegorical words. For some verses and terms, different scholars/sects have different opinions. Even, sometime, they have opposite opinions. Above are the major obstacles in developing the semantic algorithm or constructing an intelligent knowledge base for the purpose of semantic search in Quran to be in action. Especially the last problem seems to be an NP complete for the experts of NLP. In this paper we are not going to address all of these issues.

A. WordNet Relationships. Now a days the WordNets are being developed for many languages of the world. The multilingual (inter-lingual) Wordnets are also being constructed. WordNet organizes [23], [24] the concepts in ontologies and not individual word forms. Every concept is expressed through synsets. Synsets are linked through relations. In WordNet the different synonyms and antonyms are handled through WordNet ontologies. B. Items of WordNet SynSets The detailed discussion of WordNet relationships [21], [22], [23], [24] and the items of SynSets used in language WordNets is not possible. Here we list only those items of SynSets that we handled in prototype implementation of our proposed model. Synonymy and polysemy Hyperonymy and Hyponymy Holonymy and Meronymy

IV. PROPOSED MODEL

C. Formulation of WordNet SynSet items for the Proposed Framework The SynSet items listed above need to be explained in the domain of the Holy Quran. Here we briefly discuss with examples from the Holy Quran. WordNet uses both verbal relations and noun relations. We will discuss only few noun relations of the items of SynSets.

Topic based search is quite a complex phenomenon. We may summarize the steps of this process as: First of all, the query word is assigned a single meaning striking out all other meanings. Secondly, the searching is performed against all possible synonyms of that intended meaning within a particular domain. Now, the semantic search, in our context, may be defined as: Semantic search in Holy Quran means that the verses that are relevant to a certain topic should be retrieved on querying the Quranic text by putting a query word or any synonym of it, either the query word is present in those verses or not. Our model for semantic search in Quran exploits WordNet relationships implemented in relational model. The proposed model creates the taxonomy of the related terms. The depth of the taxonomy (the tree constructed for a term and its synsets from relational schema) is not limited. The parent-child relationship among Quranic terms/phrases is handled through self join of “Topics”, a database relation. The synsets are handled through “Synonyms” relation, the child relation of “Topics” relation. All synsets of a topic/term inherit the same primary key value of the parent relation. Different types of WordNet relationships are easily handled through synonym_type attribute of “Synonyms” relation. The detailed Entity Relationship Diagram of the proposed model can be seen in [31]. Before going to the detail of the proposed model we give a brief discussion of some WordNet relations mapped in the Quranic terminology.

1) Synonymy and Polysemy: Arabic language is very rich in its vocabulary. A word has lot of synonyms; on the contrary a word may have so many senses. For example, the “Hell” and “The Fire” are synonyms of each other. “Kasur” used in Surah Al-Kausar is polysemy in nature. In Quran we find so many words that are conceptually synonyms of each other but if we see their dictionary meanings they are not synonyms. For example, Ahmed (SAW), Al-Mozammil, Al-Mudassir, Yaseen, Alr-rasool are synonyms of Muhammad (SAW). For Quranic text requirements we have further divided the synonyms in following classes: Exact Synonyms (ES): These are the words that always show the same concept/object. For example, Muhammad (SAW) and Ahmad (SAW) are used every where in the Holy Quran for our beloved Prophet Muhammad (SAW). Similarly Al-Mozammil, Al-Mudassir and Yaseen are also exact synonyms because in Quranic text these words are used only for Muhammad (SAW). Close or Strong Synonyms (CS/SS): The terms that are mostly used for the same concept/object and sometimes have been used for something else but very close to the first one. For example, Jehad is close synonym of Qital (battle) because at many places the word Jehad means Qital and at some places it means any type of struggle for Islam. Weak Synonyms (WS): The terms that are used for different concepts/objects at different places. For example, AlKitab (The Book) does not always mean the Holy Quran. It may mean any other revealed book.

32

IEEE --- 5th International Conference on Emerging Technologies ICET 2009

Hyperonymy and Hyponymy: The examples of Hyperonymy and hyponymy are also found in the Holy Quran. For example, Divine Revelation is hyperonym of Al-Quran or the Bible; prayer and fasting are hyponyms of worship. Holonymy and Meronymy: We can find so many examples of Holonymy and Meronymy in the Holy Quran. For example, “Hell” is holonym of “Haviyah”, a valley in the Hell. D. The Searching Algorithm The searching algorithm does two major tasks. It retrieves relevant verses by considering all the synonyms of the query word. It retrieves the most relevant verses first and the least at the end. The algorithm is described below. Start: Input: Step 1:

Query Word Fetch all the senses of the query word. If query word has just one sense perform step 2. Else inform user all the senses and get selected only one of them. Step 2: Fetch all verses that contain the query word. Step 3: Fetch all the synonyms of the query word and for each synonym repeat step 4-6. Step 4: If synonym is of highest priority perform Step 5 Else step 6. Step 5: Fetch all verses that contain the synonym of the query word. Step 6: Store the synonym in Syn_List list in such a way that it should come at the end of the synonyms of the same priority and before the synonyms of the lower priority. Step 7: For every item of the Syn_List (starting from first) repeat step 8. Step 8: Fetch all verses that contain the synonym. End Handling the Synonyms in Relational Model In this section we explain how we establish relationships among the synonyms. A part of the ERD of our proposed model is shown in fig 2.

Master_Syn relation is used for desired sequence of the retrieved verses (explained in next section) F. Sequence of the Retrieved Verses and the Items of Syn_Set The reader wants to see the retrieved verses in such a way that the most relevant verses be shown first and the least relevant at the end. Our proposed model can also solve this problem by retrieving the verses containing “Exact Synonyms” first and that of “Weak Synonyms” at the end. The priority list is shown below: Exact Synonyms Close Synonyms Hyponyms Meronyms Weak Synonyms IMPLEMENTATION ISSUES We implemented our model by using SQL Server 2005 and VB.Net on Surah Al-Baqarah (the longest chapter of the Holy Quran) for few words. We used the English translated text [1] in our model. The following issues are concerned while implementing the model: Categorizing the synonyms is a difficult task. Some synonyms fall into many categories and some may not fall exactly into any category. For example, can we consider Al-Raheem is the synonym of Allah? There are following issues: If we consider Allah as Super Class and Al-Raheem as sub class; it is no justification; because Allah and AlRaheem represent the same object. If we say that Allah is holonym of Al-Raheem then it is also not fair as Al-Raheem is not the part of Allah. A synonym may represent any other concept at another place---polysemy. In this way it will also show the irrelevant verses. For example, Al-Jannah is used for paradise and also for the worldly gardens. Different scholars have different view points about some terms of the Holy Quran. We can follow only one scholar at the same time. The allegorical nature of the Holy Quran can not be handled easily through syn_sets because, (very often) there is a great difference between the apparent meanings and the actual meanings of such type of words.

Fig 2: Handling the Synonyms in Relational Model

The topics are listed in the Topics relation, MasterSyn relation lists all types of synonyms, and Synonyms relation lists all the synonyms of all topics. The Syn_Type field of

The synonyms selected for one translation will not show same results for any other translation even in the same language. For example, the Syn_Set of the Holy Quran found in Abdullah Yousaf’s translation will not show the same results in the Nobel Quran (English translation of the Holy Quran). Similarly Syn_Sets selected for one language, say, Arabic will not show the same results for the exact translated terms in any other language.

33

IEEE --- 5th International Conference on Emerging Technologies ICET 2009

VI.

RESULTS

We tested our proposed system for five different nouns. The translated text [1] of Surah Al-Baqrah was taken as sample space. We carefully selected the synonyms and their types, and the results were checked manually to ensure whether the verses retrieved by the system were relevant or not. The total number of relevant verses was confirmed from different index books of the Holy Quran, like [30]. A. Comparison of No. of relevant retrieved and missed verses

Although this technique retrieves much more number of verses yet some relevant verses are not retrieved. Table 1 compares both techniques for number of relevant retrieved verses and the relevant missed verses. TABLE 1. COMPARISON OF NO. OF RELEVANT RETRIEVED AND MISSED VERSES IN BOTH CASES Total No. of

Simple Search (No. of Relevant

Relevant Verses

Semantic Search (No. of Relevant

Verses) Retrievd

Verses)

Missed

Retrieved

Missed

and four are irrelevant. The synonym “Righteousness” can be dropped from the Synset list of Self-Restraint. The process of TABLE 3. NUMBER OF RELEVANT AND IRRELEVANT VERSES RETRIEVED AGAINST DIFFERENT SYNONYMS OF “SELFRESTRAINT” Total No of Verses Retrieved

No. of Relevant Verses

No. of Irrelevant Verses

Self-Restraint

2

2

0

Fear Allah

13

13

0

Righteousness

7

3

4

Total

22

18

4

choosing elements of Synset list is manual. Because the model does not uses the language WordNet. Quran requires its own WordNet. That is what we are trying to do in relational model. Once the complete WordNet is constructed then we need not to perform this task manually. In our model we not only mention the elements of Synset list but we also mention their type so that the priority may be assigned to important elements.

Quran

31

1

30

29

2

VII. CONCLUSION AND FUTURE WORKS

Hell

14

1

13

14

0

Paradise

7

1

6

6

1

Charity

22

14

8

22

0

19

2

17

15

4

93

19

74

86

7

We implemented our proposed model on Suarh Al-Baqrah, the longest chapter of the Holy Quran. The task was very sensitive and difficult. The way we applied the WordNet relationships in relational model makes our work simple, novel and efficient. The results show the accuracy and reliability of our proposed model. We tested this model with five different terms from the Holy Quran. We chose the synonyms very carefully that is why the results are reliable. Our system is far better than the simple search based on key word matching technique. Our proposed model retrieves 80% more verses than that of simple search. While retrieving more verses some irrelevant verses are also retrieved but their number is negligible small. The model will help the Quranic software developers too. Although this model, with some necessary changes, will be applicable to other holy books and law books, yet our focus is only the Holy Quran. After successful implementation for the Holy Quran, this model, with some changes of course, can be implemented to Hadith and Fiqh books. In future, we intend to extend this work to eliminate irrelevant verses. We want to develop ontology based semantic intelligent search engine for the Holy Quran that can perform semantic search. For this purpose first of all we will develop a Quranic WordNet. This WordNet will be the major module of the semantic search engine. We will also try to improve the sequence of the retrieved verses.

Selfrestraint Total

B. Comparison of Total No. of retrieved, relevant, irrelevant and missed verses in both cases

Table 2 shows the accumulative result of our queries against five words mentioned in table 1. The semantic search causes only one irrelevant verse where as it retrieves 67 other relevant verses that could not be retrieved by simple search. TABLE 2. COMPARISON OF TOTAL NO. OF RETRIEVED, RELEVANT, IRRELEVANT AND MISSED VERSES IN BOTH CASES Total No. of Verses Retrieved

Total No. of Relevant Verses Retrieved

Total No. of irrelevant Verses Retrieved

Total No. of Missed Verses

19

19

0

74

87

86

1

7

Simple Search Semantic Search

C. Selecting or Rejecting a Synonym Selecting a synonym is very critical task. Some synonyms may cause to retrieve more irrelevant verses as compared to that of relevant verses. Table 3 shows the results for the “selfrestraint” and its synonyms. From the results of table 3 we can see that “righteousness” causes seven verses to be retrieved; three of them are relevant

REFERENCES [1]. http://www.harunyahya.com/Quran_translation/Quran_translation2.ph p (Visited on: 18/1/2008) [2]. http://www.globalquran.com/ (Visited on: 18/1/2008) [3]. ttp://www.quranexplorer.com/ (Visited on: 25/1/2008) [4]. http://www.understandquran.com/u/default.asp (Visited on: 08/1/2008) [5]. http://www.quran.com/ (Visited on: 12/6/2008)

34

IEEE --- 5th International Conference on Emerging Technologies ICET 2009 [6]. http://www.yaquran.com/ (Visited on: 17/4/2008) [7]. http://www.islamware.com/ (Visited on: 1/11/2008) [8]. http://www.shaplus.com/free-quran-software/quran-multipletranslation-software/QuranTrans/qurantrans-free-download.htm (Visited on: 1/11/2008) [9]. http://www.quransource.com/quran/ (Visited on: 1/11/2008) [10]. www.quraaniclessons.com/ (Visited on: 5/10/2008) [11]. http://www.hudainfo.com/QuranCD.htm (Visited on: 09/12/2008) [12]. http://www.allworldsoft.com/software/4-159-qur-an-viewerkoran.htm (Visited on: 15/07/2008) [13]. http://www.freedownloadmanager.org/downloads/Al_Quran_Explor e r_27366_p/ (Visited on: 15/07/2008)

[14]. http://tanzil.info/ (Visited on: 15/07/2008) [15]. www.topshareware.com/Qur'an-Viewer-(Koran)-download14247.htm (Visited on: 15/07/2008) [16]. http://www.eislamicsoftware.com/qurandir.htm (Visited on: 25/07/2008) [17]. Y. Kotb, K. Gondow, and T. Katayama, “The SLXS Specification Language for Describing Consistency of XML Documents,” Proc. of the Fourth Workshop on Information and Computer Science (WICS’2002), IEEE Comp. Soc., ElDamam, Saudi Arabia, pp. 289-304, March 2002. [18]. Y. Kotb, K. Gondow, and T. Katayama, “The XML Semantics Checker Model,” Proc. of the Third International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT’02), Kanazawa, Japan, pp. 430-438, September 2002. [19]. Saidah saad, Naomie Salim and Nazlia Omar, “Keyphrase Extraction for Islamic Knowledge Ontology,” IT symposium Vol. 2, pp. 1-6 (on 26-28 Aug 2008). [20]. Saidah saad and Naomie Salim, “Build Islamic Ontology based on Ontology Learning,” Postgraduate Annual Research Seminar 2007, (3-4 July 2007 ).

[21]. Trevor Mansuy and Robert J. Hilderman, “A Characterization of WordNet Features in Boolean Models for Text Classification,” Proc. of the Fifth Australasian Data Mining Conference (AusDM2006), vol. 61, Sydney, Australia, pp. 103 – 109,

2006. [22]. Roberto Navigli, “Word Sense Disambiguation: A Survey,” ACM Computing Surveys, Vol. 41, No. 2, Article 10, (2009). [23]. S.G. Kolte and S.G. Bhirud, “Exploiting Links in WordNet Hierarchy for Word Sense Disambiguation of Nouns,” Int’l Conference on Advances in Computing, Communication and Control (ICAC3’09), IEEE, Mumbai, Maharashtra , India, pp. 20-25, (2009). [24]. Giannis Varelas, Epimenidis Voutsakis, Paraskevi Raftopoulou, Euripides G.M. Petrakis and Evangelos E. Milios, “Semantic Similarity methods in WordNet and their Application to Information retrieval on the Web,” Proc. of the Seventh ACM International Workshop on Web Information and Data Management (WIDM’05), Bremen., Germany, pp. 10 – 6, (2005) [25]. http://www.islamicity.com/mosque/TOPICI.HTM (Visited on: 30/11/2008) [26]. http://www.islamicity.com/Quransearch/ (Visited on: 30/11/2008) [27]. http://www.quran.org.uk/out.php?LinkID=12 (Visited on: 14/01/2009) [28]. http://www.submission.org/quran/koran-index.html (Visited on: 22/02/2009) [29]. http://www.cafeconleche.org/examples/religion/quran/quran.xml (Visited on: 21/05/2009) [30]. Afzal ur rehman: “Subject Index of Quran” Published by Islamic Publications Private (ltd), Lahore, Pakistan [31]. M. Shoaib, “The Quranic Database for Semantic Search in Holy Quran,” Master thesis, Dept. of Comp. Sc., Int’l Islamic University, Islamabad, Pakistan, (2009), unpublished.

35

Suggest Documents