2014 8th Malaysian Software Engineering Conference (MySEC)
Support Vector Machine based approach for Quranic Words Detection in Online Textual Content Thabit Sabbah
Ali Selamat
Faculty of Computing Universiti Teknologi Malaysia (UTM) 81310 Skudai, Johor, Malaysia
[email protected]
Faculty of Computing Universiti Teknologi Malaysia (UTM) 81310 Skudai, Johor, Malaysia
[email protected] Quranic style (known as Uthmani) with full or partial diacritics.
Abstract—Quran is the holy book for Muslims around the world. Since it was revealed to the Prophet Muhammad (PBUH) before about 14 hundreds years, Quran is preserved in all imaginable ways from distortion. The rapid and huge growth of digital media and internet usage, cause a wide spread of the Quranic knowledge as well as Quranic Verses, scripts, Translations, and many other Quranic sciences in its digital formats. Some of the online sources, websites, services and social network users are introducing a less authentic Quranic content, services and applications. The ordinary user of such online resources could not detect and authenticate the provided Quranic verses. In this paper, we propose a machine learning approach to detect Quranic words in a text extracted from online sources. The proposed approach of detection utilizes Support Vector Machine to generate a learning model of Quranic words by training the learner on the Quranic words dataset. The generated classification model is used later to classify the words from online content. Experiments based on different features categories such as the Diacritics, and Statistical features are performed and a prototype is developed, Results show that the accuracy and other evaluation measurements achieved by the proposed approach are higher than the previous measurement in the domain. The Future works will focus on incorporating more machine learning and optimization techniques for achieving higher evaluation measurements.
However, authors in less authentic recourse such as online forums, and social networks, occasionally use the scientific method of quoting and citing Quranic verses. In such resources, it become hard for the normal reader to distinguish between the Quranic words and nonQuranic words. Moreover, the automatic text processing and natural language processing techniques such as Information Retrieval Systems (IRS) and Knowledge Management Systems (KMS) will work in less efficiency on Arabic text. Therefore, it is very important to propose techniques for helping users and computerized systems to effectively analyze such online resources, which form a significant portion of the web content. The proposed approach utilizes Support Vector Machine (SVM) to build a learning model of Quranic words, and then uses this model in classifying words extracted from the online source. In the rest of this paper, section II briefly describes previous works related to Quranic verse detection, and SVM basics. Section III explains our approach and dataset, and then section IV illustrates our experiments and initial results, as well as a discussion on these results. Finally, we conclude this work in section V.
Keywords— Quranic words; Arabic words; detection; classification; learning model; Support Vector Machine;
II. RELATED WORK A. Quranic Words/Verses Identification Less works were found in the core of Quranic words identification based on machine learning techniques. Our previous work [4] focus on Quranic verses detection in online sources using some traditional techniques such as Diacritical Ratio, Starting/Ending Brackets, Starting/Ending Phrases, and Inverse Positions Index. Such techniques do not employ the machine learning methods and requires many auxiliary information to be provided. Some other previous related works focus on the authentication of Quranic quotes such as [2] in which the input was a predefined Quranic quotes. However, other previous works discussed many issues related to digital Quranic script in different perspectives other than detection. For example, a statistical study by Shamsudin and Farooq [5] aimed to protect the digital form of the Quran from corruption. The study discuss the structure of Quran in terms of number of characters in the verses and number of verses in the chapters,
I. INTRODUCTION Quranic words identification can be defined as determining which words of the text belongs to Holy Quran and is written exactly as it is written in holy Quran [1]. Consecutive wellordered Quranic words forms a verse. Since Quran is not only just a book to be recited, Quranic verses are used in Muslims daily life for many reasons such as deduce solutions for people social and religious problems, support decision making. Verses also are quoted by Muslim authors as an evidence to support their conclusions, opinions and analysis [2]. Verses quoting is used not only in written works, but also it is common in oral discussions, conversations and speeches. This scientific method is popular in Islamic societies in general and Arabs in particular [3]. In authentic written works, the quoted verses are cited, and distinguished by writing it in the standard
978-1-4799-5439-1/14/$31.00 ©2014 IEEE
325
transformed space, the only thing that needs to be computed efficiently is the similarity of two examples. Redundant features (that can be predicted from other features), and high dimensional features are well-handled, thus SVM does not need an aggressive feature selection [17]. Second, it processes error-estimating formulas, which can help SVM in predicting how well a classification is functioning. This eliminates the need for cross validation techniques.
and so on, by representing Quranic text as a codified inference in the form of an AI Natural Language. Other facets discussed by previous studies were semantic search [6, 7], text information retrieval [8], knowledge modeling and retrieval [9-11]. Lastly, the work [1] proposed a novel dataset for Quranic words identification and authentication. However, none of the previous related works has introduced a machine learning technique to detect Quranic words from online source.
III. PROPOSED APPROACH
B. Support Vector Machine Support Vector Machine (SVM) [12] is based on the procedure of learning a linear hyper plane from a training set that separates positive examples from negative examples. The hyper-plane is located at the point in the hyperspace that maximizes the distance from the closest positive and negative examples (called support vectors).
The proposed approach, as shown in Fig. 2, starts with Quranic words dataset provided by [1] (Part A in Fig. 2), and then, by an iterative cross-validation process, the samples in the dataset are split into training and testing sub-datasets.
Fig. 1. Hyper-plane and support vectors
Thus, SVM is designed to simultaneously minimize the empirical classification errors and maximize the geometric margin between positive and negative examples, in our case the Quranic and nonQuranic words. Fig. 1 illustrates the classification procedure of SVM. The line in bold represents the optimal hyper plane that separates the two classes of examples[13], while The other two lines are called support vectors, which, denote the closest positive and negative examples. The linear classifier is based on two elements: the
Fig. 2. Quranic Words Identification Approach
The training dataset is used to generate the SVM Classification Model (SVM CM), which is tested and evaluated by the testing dataset. The processes in Part A are repeated using different SVM parameters until the highest classification evaluation measurements are recorded. The SVM CM is saved to be used in classifying the words extracted from the online sources. On the other hand, preprocessing is applied on the online source content to extract the text, and then the features are extracted. Finally, the SVM CM is used to classify the extracted words, and the Quranic words are distinguished to the user, as in Part B in Fig. 2.
weight vector ( W ) perpendicular to the hyper-plane (which accounts for the training whose components represent feature),
and a bias ( b ) which determines the offset of the hyper-plane from the origin. An unlabeled example ( x ) is classified positive if
A. Preprocessing Two main steps are performed in order to extract the textual content of the online resource.
X + b ≥ 0 f (x) = W otherwise, it is classified as negative. Hence, SVM is a binary classifier. Many existing works have applied SVM as text classifier such as [14-17].
Filtering which includes removing HTML tags, nonArabic characters, numbers, punctuations, and special characters and symbols except those included in the dataset. Moreover, filtering includes an elimination rule that exclude all words longer than the longest word in the dataset.
There are several advantages of SVM as text classifier. First, SVM can handle exponential or even infinitely many features, because it does not have to represent examples in its
326
Tokenizing which means dividing the text into chucks of characters. In this research, we divide the text based on the white spaces, since the aim is to identify the Quranic words.
these features represent diacritics count, Letters count, diacritics ratio, and Letters ratio, respectively. The features are numbered in the dataset as f1, f2, …, f63, f64. The number of any feature from Fig. 3 can be found by considering the numbers in the most left column as tens and the numbers in the upper row as the ones. The position of the feature is the sum of its row’s tens and column’s ones. For example, the order of feature “ ” is 41, since it is in the row labeled 4 (i.e., 40) and column labeled 1 (i.e. 40+1 = 41), and so on.
B. Features Many features can be extracted from the text, however, in our approach we are concern with the features similar to the dataset’s features. The features in the provided dataset are the Arabic letters, Arabic diacritics, and special symbols that appear in Quranic text in addition to four “Statistical” features, as shown in Fig. 3.
IV. DATASET, EXPERIMENTS, AND EVALUATION A. Dataset As mentioned before, the dataset provided by [1] is used to train and test the classifier with different parameters. The dataset consists of 93161 samples, where 18994 sample (20.39%) are Quranic words (this is the number of distinctive Quranic words). The remaining samples 74167 (79.61%) are Arabic words collected from “Muslim” forum2 which is one of the largest Islamic forums on the web containing more than 2,600,000 posts in 351,429 threads, with more than 66,100 active users. The threads in this forum are distributed into 21 categories in many life aspects such as politics, religion, news, multimedia, computer software and hardware, education, design, and other categories. The categories are not in the consideration in this research. However, the diversity of topics reflects the diversity of nonQuranic words as the Quranic words have approximately the same diversity. More information about the Quranic words dataset are found in [1].
Fig. 3. Quranic Words Dataset Features
In Fig. 3, the 64 features are divided into 3 categories as follows: • First category contains the “diacritics” and special symbols (features surrounded by the bold border). This category consists of 20 Arabic diacritics and other special Quranic symbols. In fact diacritics are used to distinguish the vocal pronunciation of words and represents the vowel sounds [18]. However, the special symbols are used in Quranic text to indicate special instruction during the recitation1.
B. Experimental Setup As mentioned in the explanation of part A of the proposed approach, an iterative cross-validation classification process is applied with different SVM kernels until the best evaluation measurements are recorded. In this research, we utilized the LIBSVM library[19] to carry out the learning and classification process. Based on this iterative process we found that the highest evaluation measurements were achieved by using the linear kernel. The 10-fold cross validation technique were applied in which the dataset is divided into 10 equal parts, and then for each iteration, nine parts are used for training and building the classification model while the remaining part is used for testing. However, since the dataset is unbalanced, and the number of samples included in the dataset is not dividable of 10, the stratified sampling method is applied. In stratified sampling method, the number of samples in different folds are almost equal and the distribution (classes’ ratio) of samples in each fold is the same as the distribution of samples in the dataset.
• Second is the “Letters” category which consists of 35 Arabic letters’ shapes (features with shaded background in Fig. 3). Although the standard Arabic alphabet set has 28 letters. However, some of these letters are written in different forms according to the position of the letter in the word and other considerations. For example, the first Arabic letter is “Alif”, can be written in many forms such “ ”أ, “ ”إ, “ ا ”, and “ ”ﺁ. In most of the Arabic text mining, classification, and language identification approaches, these forms are consolidated into one form [2]. However, in identification and authentication Quranic studies, it is inappropriate to change or replace any of the original Quranic scripture properties.
C. Evaluation The quality measures precision, recall, and F-measure are commonly used in classification evaluation In addition. To calculate these evaluation measures the set of true positives (TP), false positives (FP), true negative (TN), and false
• Third category is the “Statistical” features. It consists of four features that are the “Dc”, “Lc”, “Dr” and “Lr”, 1
http://www.abouttajweed.com/kb/entry/356/ (and) http://www.ilmfruits.com/2008/tajweed-different-stops/
The Research Management Centre (RMC) at the Universiti Teknologi Malaysia under Research University Grant (Vot 01G72) and the Ministry of Science, Technology & Innovations Malaysia under Science Fund (Vot 4S062) supports this work.
2
327
http://www.muslm.org (accessed on May 15, 2014)
negatives (FN) were determined. Based on these sets the quality measures were calculated as follows:
Precision =
Recall =
F - measure = 2 *
TP TP + FP
TP FN + TP
,
,
Precision * Recall , Precision + Recall
and Fig. 4. Features based Evaluation measurement
Accuracy =
TP + TN TP + FP + TN + FN
As seen in Fig. 4, the accuracy is ranging between 80% and 92%, while, the precision ranges between 61% and 90%. However, the minimum values of recall and f-measure are less than 20% and their maximum values are about 71% and 77% respectively.
Where TP is the number of Quranic words correctly labeled, FP is number of nonQuranic words incorrectly identified as Quranic, TN is number of nonQuranic words correctly identified, and FN is the number of Quranic word incorrectly identified as nonQuranic. V. RESULTS AND DISCUSSION Four experiments were conducted based on the feature categories; in the first experiment, we considered the “Diacritics” category that contains 25 features. The “Letters” category, which contains 35 features, is considered in the second experiment. While in the third experiment we considered the “Statistical” features category that consists of four features, and finally all dataset’s attributes (64 attribute) are considered in the last experiment. The results of the experiments were as follows: A. Results Fig. 4 shows the evaluation measurements of the SVM classifier for different experiment, while Fig. 5 shows the time required in training and testing the dataset for different experiments. In addition, Fig. 6 shows the sparsity of features matrix, where the sparsity of the matrix is calculated as follows:
Fig. 5. Classifier training and testing time
Fig. 5 shows the training and testing time of the SVM classifier based on different feature categories, which means different number of features. It can be seen that the required time for training and testing in the “Statistical” features based experiment and the experiment based on all dataset’s features is very long (700 and 800 seconds respectively) compared to the time elapsed for the “Letters” and “Diacritics” features based experiments.
⎡ number of non zero values ⎤ Sparsity = ⎢1 × 100% total number of values ⎥⎦ ⎣
328
VI. PROTOTYPE Based on the proposed approach, a web-based prototype is developed as shown in Fig. 7. The JAVA language is used in implementing the back processes such as preprocessing and the
Fig. 6. Category based Matrix Sparsity
Fig. 6 shows the relation between matrix sparsity and the number of features and the required time for training and testing. It can be seen that the longest required time was about 800 second for training and testing the dataset while considering 64 features (all features in the dataset). Although, the sparsity of full features matrix was less than the diacritics and Letters features matrixes, however the required time for training and testing the full features matrix was longer. On the other hand, the “Statistical” features matrix, which consisted of 4 features was 100% densely, again, we see that it required longer time than the “Diacritics” and “Letters” matrix, however the required time was not too much less than the time required by the 64 features matrix. In addition, it can be seen from Fig. 6 that the maximum classification accuracy (91.28%) is achieved by using the full features matrix however the accuracies achieved by other features categories were in 80% - 90% range. Moreover, it is noticeable from the Fig. 6 that the achieved accuracy based on the diacritics features is greater than the one achieved based on the “Statistical” features while the required time is significantly less.
Fig. 7. Prototype user interface
SVM classification, while the JSP language is used for the user interface. In the prototype, after submitting the URL of the online source to be check, the pack processed read the source and perform the preprocessing to extract the words of the text. Next, the features are extracted for each word and classified based on the classification model created in the off-line training phase. The words classified as Quranic words are highlighted in the “Detected Verses” section of the interface.
B. Discussion The results above show that the maximum accuracy achieved (91.28%), which means a high method correctness in distinguishing the Quranic and nonQuranic words when considering all features in the dataset. The more features used in classification the more accurate separation achieved. Moreover, it show that the maximum precision percentage is about 90% achieved based on the diacritics features, which means a high method exactness in detecting the Quranic words, and also means that the diacritics are significant in detecting Quranic words. In addition, the accuracy achieved by the proposed approach is higher than the accuracy achieved based on the traditional techniques proposed by [4] such as the Diacritic Ratio, Starting/Ending brackets and phrases. Moreover, the achieved results are also better than the classification based on the Naïve Bayes classifier shown in [1].
VII. CONCLUSION AND FUTURE WORK In this paper, a machine learning approach for Quranic words detection from online sources is presented. The Quranic words Identification dataset is utilized to build the classification model, which is used to classify the text extracted from the online source. Many experiments are conducted based on different sub-sets of features and a webbased prototype is developed. Results show that the achieved accuracy by the proposed approach is higher than the accuracy achieved by previous methods. However, the sparsity of the features matrix and the number of features play a role in the required time for building the classification model. Our future work will focus on applying more machine learning and optimization techniques in order to achieve higher evaluation
329
measurements, and incorporate these methods in improving the detection and authentication of Quranic verses from images.
[8]
[9]
ACKNOWLEDGMENT The authors would like to thank the colleagues in the Software Engineering Research Group (SERG), Universiti Teknologi Malaysia who provided insight and expertise that greatly assisted the research. Moreover, the thanks are extended to the Research Management Centre (RMC) at the Universiti Teknologi Malaysia and the Ministry of Science, Technology & Innovations Malaysia for supporting this work.
[10]
[11]
[12]
REFERENCES [1]
[2]
[3] [4]
[5] [6]
[7]
T. Sabbah, and A. Selamat, “A novel dataset for quranic words identification and authentication,” in 2nd International Conference on Interactive Digital Media (ICIDM 2013). 2013. Sarawak, Malaysia. A. Alshareef, and A.E. Saddik, “A Quranic quote verification algorithm for verses authentication,” in Innovations in Information Technology (IIT), 2012 International Conference on. 2012. E. Alsulamy, “Fundamentalists used Quran and Sunni to extract the rules of fundamentalism,” 1999, Riyadh: Al Rushed library. T. Sabbah, and A. Selamat, “A framework for Quranic verses authenticity detection in online forum,” in Taibah University International Conference on Advances in Information Technology for the Holy Quran and Its Sciences, 2013, Madinah, KSA. A.F. Shamsudin, and A. Farooq. “AI natural language in meta-synthetics of Al-Qur'an,” in TENCON 2000. Proceedings. 2000. H.S. Al-Khalifa, M. M. Al-Yahya, A. Bahanshal, and I. Al-Odah, “SemQ: A proposed framework for representing semantic opposition in the Holy Quran using Semantic Web technologies,” in Current Trends in Information Technology (CTIT), 2009 International Conference on the. 2009. M. Shoaib, N. Yasin, U. K. Hikmat, M. I. Saeed, and M. S. H. Khiyal, “Relational WordNet model for semantic search in Holy Quran,” in Emerging Technologies, 2009. ICET 2009. International Conference on. 2009.
[13]
[14]
[15]
[16]
[17]
[18]
[19]
330
M.F. Noordin, and R. Othman, “An information retrieval system for Quranic texts: a proposed system design,” in Information and Communication Technologies, 2006. ICTTA '06. 2nd. 2006. S. Baqai, A. Basharat, H. Khalid, A. Hassan, and S. Zafar, “Leveraging semantic web technologies for standardized knowledge modeling and retrieval from the Holy Qur'an and religious texts,” in Proceedings of the 7th International Conference on Frontiers of Information Technology2009, ACM: Abbottabad, Pakistan. p. 1-6. A.R. Yauri, R. A. Kadir, A. Azman, and M. A. A. Murad, “Quranicbased concepts: Verse relations extraction using Manchester OWL syntax,” in Information Retrieval & Knowledge Management (CAMP), 2012 International Conference on. 2012. T. Mukhtar, H. Afzal, and A. Majeed, “Vocabulary of quranic concepts: a semi-automatically created terminology of holy Quran,” in Multitopic Conference (INMIC), 2012 15th International. 2012. B.E. Boser, I.M. Guyon, and V.N. Vapnik, “A training algorithm for optimal margin classifiers”, in Proceedings of the fifth annual workshop on Computational learning theory1992, ACM: Pittsburgh, Pennsylvania, USA. p. 144-152. L. Lee, W. ChinHeng, R. Rajprasad and I. Dino, “An enhanced Support Vector Machine classification framework by using Euclidean distance function for text document categorization,” Applied Intelligence, 2012. 37(1): p. 80-99. S. Tong, and D. Koller, “Support vector machine active learning with applications to text classification,” J. Mach. Learn. Res., 2002. 2: p. 4566. Chen, Bourlard, and Thiran, “Text identification in complex background using SVM,” in Proceedings of the International Conference on Computer Vision and Pattern Recognition. 2001. Y. Yiming, and L. Xin, “A re-examination of text categorization methods,” in Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval1999, ACM: Berkeley, California, USA. p. 42-49. T. Joachims, “Text categorization with suport vector machines: learning with many relevant features,” in Proceedings of the 10th European Conference on Machine Learning1998, Springer-Verlag. p. 137-142. M.A. Aabed, S. M. Awaideh, A. R. M. Elshafei, and A. A. Gutub, “Arabic diacritics based steganography,” in Signal Processing and Communications, 2007. ICSPC 2007. IEEE International Conference on. 2007. C.-C. Chang, and C.-J. Lin, LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst. Technol., 2011. 2(3): p. 1-27