A Framework for Quranic Verses Authenticity Detection in Online Forum Thabit Sabbah
Ali Selamat
Faculty of Computer Science & Information Systems Universiti Teknologi Malaysia (UTM) 81310 Skudai, Johor, Malaysia
[email protected]
Faculty of Computer Science & Information Systems Universiti Teknologi Malaysia (UTM) 81310 Skudai, Johor, Malaysia
[email protected]
Abstract—Quran is the holy book for all Muslims around the world. Since more than 1400 years, it was preserved in all possible ways from distortion. The huge increment and spread of digital media and internet usage, leaded to many organizational and individual websites, services, and applications are being introduced to spread the knowledge related to Quran as well as Quranic Verses, Translations, Explanations with the Tafseer and other Quranic sciences in its digital formats, some of these services are less authentic. In this paper we introduce a framework to detect and authenticate Quranic verses in a text extracted from online source especially forums posts. The proposed methodology of detection is based on the assumption that Quranic Verses are the parts of the text that contain more diacritics (Harakat). Other assumptions were also established to increase the accuracy of detection in case of less diacritic text. Authentication methodology is based on computing numerical Identifiers of words in the detected text then comparing these identifiers with Identifiers of original Quranic manuscript. Experiments show acceptable results on the detections rate of the highly and less diacritic text. The accuracy was 62% in average while the precision and recall were 75% and 78%, respectively. Future works will focus on authentication side as well as incorporating computational intelligence methods, that involved the sound of the words pronounce during the reading of Quranic verses, image processing and others, to improve the detection.
Keywords: Quranic text detection, Quranic text authentication, Quranic Verses Authenticity, Verses Detection; I.
INTRODUCTION
Muslims around the world read their Holy Book (Al-Quran) in its native language which is Arabic even they speak many different languages. Muslim writers, authors, and scholars usually cite and include the authentic original Quran verses in their writings in Arabic even the language of the article is not Arabic as well as the articles written in Arabic; This because translations usually does not express the full meaning effectively. Because of the huge increment of Internet usage and accessibility; the websites, services, applications, and mobile applications introducing Quranic script, translations and searching are also increasing rapidly. Some of these websites are less authentic and do not follow the standard writing rules, and many introduce the non original Quranic script (i.e. Quran script without diacritics). The wide spread and ease of access of digital Quranic script over the Internet facilitate the citation and inclusion of Quranic verses and quotes (i.e. complete or partial
verses) using (Copy & Paste) technique. Consequently, the detection and authentication of Quranic script become very important to stop or reduce the spread of non authentic scripts especially through forums and social networks. People who are native Arab can usually read Arabic text properly even it has no diacritics. In fact diacritics are used to distinguish the vocal pronunciation of words and represents the vowel sounds[1]; sometimes the same word has many different vocal pronunciations and different meanings as a result. Muslims who are non Arab or has less knowledge in Arabic language rules should give more concern to diacritics, however they may do not notice the difference in meaning in case of different diacritics appear. For example in verse 28 of chapter 35
Lµ´ ³ ² ± ° ¯M
َ
The word ( )اﷲcame with the diacritic known as Fatha () at the end, the meaning1 of this context will completely change if
ُ
the diacritic changed into Damah (). However, there are many cases such this is found in Quran. Therefore, it is very important when dealing with Quranic script for validation and authentication purposes not to omit, remove, change any part specially the diacritics and special Quranic symbols as well as the order and structure of words and letters. In this paper, we propose a framework for detecting and authenticating Quranic quotes from online sources such as ordinary static and dynamic web pages, forum, and social networks. In the rest of this paper, section II summarizes previous related work. Section III explains our Framework and related methodology, and then section IV illustrates our experiments and initial results, as well as a discussion on these results. Finally we conclude this work in section V. II.
RELATED WORK
Previous related works includes Alshareef and El Saddik study[2] which aims to verify the fundamental text for specific Quranic quotes. In their study the authors introduce a model to validate Quranic quotes in digital forms. The input of proposed 1
It is only those who have knowledge among His slaves that fear Allah (source: http://www.qurancomplex.org/Quran/Targama/Targama.asp?nSora=35&l=arb& nAya=28#35_28)
Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia
1-6
model is a predefined Quranic quote, and the validation method depends on the phrase-based matching pattern. However Alshareef’s model does not consider the diacritics during validation, his model is closer to be a Quranic search engine than authentication model. Other previous works discussed many issues related to digital Quranic script from different aspects other than detection and authentication; A statistical study [3] aimed to protect the digital form of the Quran from corruption, the study discuss the structure of Quran in terms of number of characters in the
verses and number of verses in the chapters, and so on, by representing Quranic text as a codified inference in the form of an AI Natural Language. Other aspects discussed by previous studies were text information retrieval [4], semantic search[5, 6], knowledge modeling and retrieval [7-9]. However, none of the previous related works have introduced an algorithm to detect and authenticate Quranic quotes from online source, and the previous authentication methods were based on string matching.
Part A
Part B
Online forum Web Page
Original Quran Text Authenticated e-version
Social network Analyze Collection of Documents
Analyzer
Text Extractor Extractor Diacritical Ratio
Letters & Diacritics (Weights)
Fully Diacritic Quran Words
Quran Words Filtering
Quran Quote Detector
Filtered Quran Words
Calculate Diacritic Quran Words Identifiers
Starting/Ending Brackets
Inverse Positions Index
Starting/Ending Phrases
Detected Quran Quotes
Calculate Words Identifiers
Filter
DB Authenticated Quran text
Filtered text
Figure 1 Detecting and Authintication Quran Quotes from Online Forum.
This work aims to propose a framework for accurate detection of Quranic verses from text either it is none or fully diacritical, and authenticate the detected quotes without using string matching techniques.
III.
FRAMEWORK
In this paper we propose a framework to detect and authentic Quranic quotes in text extracted from online sources such as ordinary web pages, forum, and social networks; however the framework can be extended to include other sources such as
Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia
1-7
collection of static or dynamic documents and other sources. Figure 1 shows the proposed framework. Our framework consists of two main parts; First part (Part A) is a sequence of processes that to be performed once. The input of this part is an authentic e-version of original Quranic script, while the output is a database of inverse positions index which is a list of distinctive Quranic words Identifiers and their positions in Quran. The processes to be performed in this part are as follows: Analyzer: Is a process to prepare the input to be compatible with next process. This analyzer should be flexible enough to deal with different types of authentic e-version of Quranic manuscript. Algorithm 1: Determining Letters, Decorates, Quran Words and Filtered Quran Words Identifiers Input: Letters (L) = {Arabic Letters} Decorates (D) = {Arabic Diacritics and Special Symbols} Words (W) = {Quran Words} = {W1,W2,W3,…, Wn} Filtered Words (FW) = {Filtered Quran Words} = {FW1,FW2,…, Wm} Letters Weights = {} Decorates Weights = {} Words Identifiers = {} Filtered Words Identifiers = {} Start Letters Weights ← Initial values Decorates Weights ← Initial values Loop For (Word ∈ Words) loop Words Identifiers ← Calculate Wid End For For (FilterdWord ∈ Filtered Words) loop Filtered Words Identifiers ← Calculate FWid End For If (Wid ∈ Words Identifiers) OR (FWid ∈ Filtered Words Identifiers) {L}← Update values {D}← Update values End if Until (∀distinctive W ∈ Words there is distinctive Wid ∈ Words Identifiers) And (∀distinctive FW ∈ Filtered Words there is distinctive FWid ∈ Filtered Words Identifiers) Output: Letters Weights = {L1id, L2id, … , Lxid} Decorates Weights = {D1id, D2id, … , Dyid} Words Identifiers = {W1id, W2id, … , Wnid} Filtered Words Identifiers = {FW1id, FW2id, … , FWmid}
Figure 2 Calculating DQDW and DFQW identifiers.
Extractor: This process is done to extract three main lists from Quranic script; One is the list of Distinctive Quranic fully Diacritical Words (DQDW). Second is the list of Distinctive Letters, Diacritics, and Symbols (DLDS) appear in Quranic script. Every letter, diacritic, and symbol in this list will be given a distinctive weight as will be shown later. Third is a list of Distinctive Filtered Quranic Words (DFQW). The weights of letters, diacritics and symbols in DLDS list will be determined by performing next process, which is calculating Identifiers. Calculating Identifiers: This process is a very important and critical process. In this process a distinctive weight will be assigned to each letter, diacritic and symbol in DLDS so that the calculated Identifier for every word in DQDW list will be unique. The weights will be also used to calculate the Identifiers of DFQW. Figure 2 shows the proposed algorithm to be used in determining the weights of letters, diacritics, and symbols and how this process is related to DQDW and DFQW Identifiers calculation; After extracting all different “characters” used in authentic Quran manuscript, the set of characters is separated into two sets; One is letters which are the characters used for building the word skeleton, while the other contains the diacritics; this set will include also the special characters that appear in Quranic
۫
script such as ۜ and . Each letter in Letters Set (LS) as well as each diacritic in Diacritics Set (DS) will be given an initial distinctive Weight (w), this weight will be referred as Lw for letters and Dw for diacritics. The process of giving distinctive weights for letters and diacritics is a self-adjusting process and synchronized with the process of calculating Quranic words identifiers to achieve the target that every distinctive Quranic word has a distinctive numerical identifier. The final weights and identifiers will be saved into database for further use in calculating words identifiers for detected text from source. Quran Words Filtering: Filtering Quranic words means generating a list of distinctive non-diacritic words of Quran. These words will be used in detecting Quran words and quotes in less or non-diacritical parts of the given text. Our filtering is limited to removing diacritics, and substituting some letters forms with standard form of the letter to increase the retrievability and the accuracy of detection. Table I below shows the letters to be substituted during the filtering process and the standard substituted letters: TABLE I.
LETTERS TO BE SUBSTITUTED IN FILTERING PROCESS
Letters
Substituted by
ٱ, إ, أ, ٰ ,ﺁ ة ى
ا ﻩ ي
Second part of our framework (Part B) is the sequence of processes that will be performed whenever the user/agent wants
Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia
1-8
to detect and authenticate verses from a source. The processes performed in this part are as follows: Analyzer: This is the process that analysis the source and prepares the data to be compatible with next process. Text Extractor: Extracts the text from the source. Quran Quote Detector: This process is the main process in this part, the input of this process is the text extracted from source and the output will be the parts of the text that are nominated to be Quranic quotes based on a combination of four methods of detection. Following is a detail description of this process. Starting from the assumption that “among the text, the Quran quotes are highly diacritical”, we will calculate the Diacritical Ratio (DR) of each word in the text; DR is defined as the number of diacritics to the number of letters in the word as in the following equation:
DRw =
NumberofDiacritics NumberofLetters
As can be noticed from the equation above, the value of DRw can be zero or more; in case of non-diacritic word, the value of DR will be zero. For the fully diacritical words, the value of DR can be one or more; because it is known that in Arabic Language, many diacritics can be related to (hold by) one letter. Figure 3 shows the DR values of some Quran words written in different ways. Different way of writing of Quranic words leads to different values if DR as well as different values of calculated Identifiers. }
1.00
l
®
!
Word
1.00
Ä
1.40
DRw
ﻧُﻨﺠِﻲ
ِﺑَﺄﻳْ ٍﺪ
ﺸﻲْ ٍء َ ِﻟ
ﺁﻳَﺎ ُﺗ ُﻪ
َٰﻳَٓﺄﱡﻳﻬَﺎ
Word
0.50
1.00
1.00
0.80
1.20
DRw
ﻧُـ ن ـﺠِﻲ
ِﺑَﺄﻳْ ٍﺪ
ﺸﻲْ ٍء َ ِﻟ
ﺁﻳَﺎ ُﺗ ُﻪ
ﻳَﺎ َأﱡﻳﻬَﺎ
Word
0.33
1.00
1.00
0.80
1.0+0.5
DRw
1.40
1.40
a b c
Figure 3 Difference in writing between Standerd and some Online sources.
In Figure 3, row (a) shows the standard and most authentic writing style which is known as (Uthmani Script) and considered in Mushaf al-Madinah an-Nabawiyyah[10], row (b) shows the way of writing considered in Tanzil.net[11] website which is the base of many websites, applications, and mobile applications, while row (c) is shows the way of writing in less authentic websites, forums and search engines, notice that some of the words in row (c) are considered to be distortion of Quranic manuscript. Different ways of writing leads to different values in DR as it can be seen from figure 3.
Diacritical Ratio method and it’s base assumption works well in normal writings and articles (i.e. only verses among the text are highly diacritical). However it is inapplicable with nondiacritical text and with extremely diacritical texts in which most word of the text are diacritical such as Arabic formal and religious articles such[1] as Tafseer, Hadith and many other religious sciences. To increase the accuracy of detection in these cases we apply three more additional techniques, these techniques are explained in following paragraphs: First additional technique is looking for starting/ending phrases that are commonly used in Arabic articles to indicate that the following part of the text is a quote or a verse. Table 3 shows some of the most common starting and ending phrases: TABLE II.
SOME OF THE MOST COMMON STARTING AND ENDING PHRASES
Starting Phrases ﻗﺎل اﷲ ﺗﻌﺎﻟﻰ ﻳﻘﻮل اﷲ ﺗﻌﺎﻟﻰ ﻗﺎل اﷲ ﻋﺰ وﺟﻞ ﻗﺎل ﻋﺰ ﻣﻦ ﻗﺎﺋﻞ ﻗﺎل اﷲ ﺗﺒﺎرك وﺗﻌﺎﻟﻰ ﻗﺎل ﺗﻌﺎﻟﻰ
Ending Phrases ﺻﺪق اﷲ اﻟﻌﻈﻴﻢ ...اﻵﻳﻪ ...... ﺳﻮرة []اﺳﻢ اﻟﺴﻮرة
Each starting phrase from previous table may come in different verb tense (Past/Present), and the verse may end with any ending phrase or sometimes no ending phrase is mentioned, in other words, there are no rules to control starting and ending phrases. These cases generate a huge number of options and combinations specially the most popular case in which the author of the article cite the verse(s) by the name or the number of the chapter (Surah) and verse number between brackets separated by colon for example (9 : )اﻟﺤﺠﺮand (15:9). Second, is looking for starting/endings brackets. In general, many punctuations such as different types of ordinary (textual) brackets such as ( ), [ ], { }, < >, quotations ( ‘ ), double quotations ( “ ), and colon ( : ) are also commonly used to determine verses in diacritical or non-diacritical text. Moreover, in online forums and other web resources, many symbols, pictures, and icons as shown in figure 4 are used to indicate the starting and the ending of Quranic quote. However in this paper we are focusing on textual content, the detection of starting and ending brackets which are represented as picture is in the focus of our future work.
Figure 4 Graphical brackets used in online sources as Quranic quote indecators.
Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia
1-9
Third additional technique is searching within the inverse positions index of filtered Quranic words; this inverse index contains a list all positions in which each distinctive filtered Quranic word is mentioned in the holy Quran. The use of this technique will enhance not only the detection of verses and quotes in a non-diacritic text, but also the accuracy of Diacritical Ratio technique in a fully decorated text by excluding the non Quranic words detected by the DR technique. This technique requires the weights of letters, diacritics, and filtered Quranic words to be calculated. However it can be applied based of normal string matching techniques. Moreover this technique can be considered as a simple authentication method that is applicable for non-diacritic text to authenticate that the Quranic words of a quote are in the correct order as in the holy Quran. IV.
EXPERIMENTS, RESULTS, AND DISCUSSION
EXPERIMNETS SETUP To test our framework of Quranic quotes detection, we implemented the “Online Quranic Quotes Detector and Authenticator” prototype. The prototype was implemented using Netbeans IDE environment, and it was based on Java programming language. Figure 4 shows the GUI of implemented prototype.
Then, we conducted five different experiments to detect Quranic quotes from one online source; Table III shows the description and parameters of each experiment. As shown in Table III, we apply different individual and compound detection methods with different parameters. First we apply Diacritical Ratio Method with threshold value of T=0.25, in second experiment we apply the S/E Brackets method compound to S/E Phrases technique, while in third experiment we apply the Inverse Position Index technique with minimum number of words to be checked in the inverse index MW=3. In experiment number four we apply all detection methods with parameters T=0.5 and MW=3. Finally, in fifth experiment we apply Diacritical Ratio Method with T=0.25, however in this experiment we use a filtering mechanism which check the detected quotes against the Inverse Position Index to verify the existence and the order of quote. The input of all experiments was a web page contains Tafseer Ibnu-Katheer of verses 1-15 of Chapter 151. We use the precision, recall, and accuracy measurements to evaluate the results of different experiments. The results were as shown in the following sub-section: RESULTS In our experiments human expert review the content of the input online source, and then counted the quotes that were detected correctly/incorrectly, as well as the non-detected (rejected) Quranic quotes by the prototype for each experiment. Table IV shows the number of quotes and the status of detection for each experiment, experiments descriptions’ are shown in Table III: TABLE IV.
NUMBER OF QUOTES AND STATUS OF IDENTIFICATION/REJECTION
Correctly Identified(TP) Incorrectly Identified(TN) Incorrectly Rejected(FN) Figure 5 GUI of online Quranic quotes detector and authenticator” prototype. TABLE III.
Exp# Exp1 Exp2
Exp3
Exp4
Exp5
EXPERIMENTS DESCRIPTIONS AND PARAMETERS
Description Diacritical Ratio Method, without filtering S/E Brackets + S/E Phrases Methods, without filtering Inverse Position Index Method Diacritical Ratio + S/E Brackets + S/E Phrases+ Inverse Position Index Methods
Diacritical Ratio Method, with filtering
Parameters T*(0.25) --
T(0.25) MW(3)
*T means Threshold. **MW means minimum number of words in filtering.
Exp2
Exp3
Exp4
Exp5
50 17 5
28 33 11
47 5 20
50 14 11
54 9 17
TP=True Positive, TN=True Negative, FN=False Negative Based on the data from Table IV, we calculated Precision, Recall, and Accuracy, the common performance measures from Information Retrieval (IR) domain. Precision, Recall, and Accuracy are defined as follows: Precision =
MW**(3)
T(0.25) MW(3)
Exp1
Accuracy =
TP TP , Recall = , and FN + TP TP + FP TP + TN TP + TN + FP + FN
.
1
http://www.qurancomplex.org/Quran/tafseer/Tafseer.asp?nSor a=15&t=katheer&l=arb&nAya=9#15_9
Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia
1 - 10
Figure 5 shows the results of our five differennt experiments: 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00
diacritics are intensively used, so without any filtering technique the recall will be high h, however in experiment 5 which was the same as experiment 1 with applying of filtering technique, the recall decreased and d became in the average level with value of 0.76. For all experim ment except experiment 2, the accuracy of detection was in the ran nge between 0.65 and 0.69. V.
Precision
Exp1 0.75
Exp2 0.46
Exp3 0.90
Exp44 0.788
Exp5 0.86
Recall
0.91
0.72
0.70
0.822
0.76
Accuracy
0.69
0.39
0.65
0.677
0.68
Figure 6 Precision, Recall, and Accuracy of experiments of Quranic qoutes detection.
DISCUSSION As can be seen from Table IV and Figure 6,, the number of correctly identified Quranic quotes was high and precise for experiments 1, 3, 4, and 5, while it was less foor experiment 2; the reason of low number of correctly identifiedd Quranic quotes in Exp2 is that the applied methods (S/E Brrackets and S/E Phrases) are not very suitable to be applied on this kind of text without any other supporting method. The innput text was a purely religious text, in this kind of texts Quraanic quotes are written directly without brackets or indicating phrases, in fact, brackets in such kind of articles are used to indicate the numbers of verses. This usage justifies the large number of incorrectly identified quotes. The less numbeer of incorrectly identified quotes was in experiments 3 and 5 rrespectively; this result was because the inverse positions index method used in experiment 3 and the filtering process used iin experiment 5 verify the existence of quote’s words in the invverse index. For the incorrectly rejected quotes the large nnumber was in experiment 3 even the correctly identified quotees was high; this refers to the parameter (MW) which is the miniimum number of words to be verified in the inverse index, deteccted quotes with less than 3 words are rejected, most of incoorrectly rejected quotes were less than 3 words. In the common IR measurements side, the hhighest precision was in experiment 3 and then in experiment 5 w where the Inverse Index method and Inverse Index filtering technique were applied respectively. The less precision was iin experiment 2 because of the large number of incorrectly identtified quotes, the average precision of all experiments was 0.75. th the highest recall value was in experiment 1 where the Diacritiic Ratio method was applied, in religious texts such as our experriments input the
CONCLUSION AN ND FUTURE WORK
In this paper, we proposed a frramework for Quranic verses authenticity detection. Our frameework for detection includes four different detection methods, while w the authentication part depends on defining the weights off Quranic letters and diacritics and calculating the identifiers off distinctive Quranic words. Experiments conducted by the developed prototype, show reasonable and promoting resultts of precision, recall, and accuracy. Our future work will fo ocus on calculating identifiers of distinctive Quranic words, and incorporate a computational oving the detection and intelligence methods in impro authentication that involved the so ound of the words pronounce during the reading of Quranic versees and image processing. VI.
ACKNOWL LEDGMENS
The authors would like to exten nd their thanks to Universiti Teknologi Malaysia (UTM) under GUP G Q.J130000.2510.02H47. REFERENCES 1. 2. 3. 4. 5.
6. 7.
8.
9. 10. 11.
Aabed, M.A., et al. Arabic Diacritics D based Steganography. in Signal Processing and Commun nications, 2007. ICSPC 2007. IEEE International Conference on. 200 07. Alshareef, A. and A.E. Sadd dik. A Quranic quote verification algorithm for verses authentica ation. in Innovations in Information Technology (IIT), 2012 Internatiional Conference on. 2012. Shamsudin, A.F. and A. Farooq. AI natural language in metasynthetics of Al-Qur'an. in TENC CON 2000. Proceedings. 2000. Noordin, M.F. and R. Othman. An A Information Retrieval System for Quranic Texts: A Proposed System Sy Design. in Information and Communication Technologies, 2006. ICTTA '06. 2nd. 2006. mQ: A proposed framework for Al-Khalifa, H.S., et al. Sem representing semantic opposition in the Holy Quran using Semantic Web technologies. in Current Trends in Information Technology (CTIT), 2009 International Confference on the. 2009. Shoaib, M., et al. Relational Wo ordNet model for semantic search in Holy Quran. in Emerging Technologies, 2009. ICET 2009. 09. International Conference on. 200 Baqai, S., et al., Leveraging g semantic web technologies for standardized knowledge modelling and retrieval from the Holy Qur'an and religious texts, in Proceedings P of the 7th International Conference on Frontiers of In nformation Technology2009, ACM: Abbottabad, Pakistan. p. 1-6. Yauri, A.R., et al. Quranic-based concepts: Verse relations WL syntax. in Information Retrieval extraction using Manchester OW & Knowledge Management (CAM MP), 2012 International Conference on. 2012. A Majeed. Vocabulary of Quranic Mukhtar, T., H. Afzal, and A. Concepts: A semi-automatically created terminology of Holy Quran. C), 2012 15th International. 2012. in Multitopic Conference (INMIC Qur'an, K.F.C.f.t.P.o.t.H., Mush haf al-Madinah an-Nabawiyyah for Computer Publishing, 2005. ? 2013 5/16/2013]; Available from: Tanzil.net. Who is using Tanzil? http://tanzil.net/wiki/Who_is_usiing_Tanzil%3F.
Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences December 22 – 25, 2013, Madinah, Saudi Arabia
1 - 11