2015 IEEE Conference on Open Systems (ICOS), August 24-26, 2015, Melaka, Malaysia
Malay Document Clustering using Complete Linkage Clustering Technique with Cosine Coefficient Nurazzah Abd Rahman, Zainab Abu Bakar, Nurul Syeilla Syazhween Zulkefli Department of Computer Science, Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, Shah Alam, Selangor, Malaysia {nurazzah, zainab}@tmsk.uitm.edu.my,
[email protected]
modernized [2]. Document clustering is one of the method that could handle this problem [4,5,6]. Based on the literature review, there is limited number of search engines that use Malay text documents apply the clustering techniques in their algorithm [4,7]. Therefore, this research is conducted to evaluate the effectiveness of applying clustering technique in retrieving Malay texts document.
Abstract– Finding useful and relevant information is a very challenging task to the user. The retrieval system usually responded with a long listed documents which are not necessarily relevant to the user’s need. Document clustering is a special technique that can sort out the documents effectively so that documents in the same cluster are similar to each other and documents in different cluster are dissimilar to each other. This paper focuses on document clustering for Malay test collection. It consists of 2028 Malay translated Hadith documents from book Sahih Bukhari. This paper presents the results using Complete Linkage Clustering algorithm with Cosine Coefficient on Malay translated Hadith documents. The evaluation of the experiments uses Recall (R), Precision (P) and Effectiveness (E) measure. The experiments is conducted on 100 clusters, 50 clusters and 20 clusters. It shows that the smaller the size of clusters, Recall (R) will increase, but Precision (P) will decrease. Results for Effectiveness (E) measure compared to the non-clustered documents show that applying clustering algorithm will improved the effectiveness of searching process. For this experiment 20 clusters is rather effective compared to the others.
II. DOCUMENT CLUSTERING Document clustering has been deliberate in the field of Information Retrieval (IR) for numerous periods. An overview of the existing techniques and applications has been shown by the researchers [4,18]. Clustering technique groups a set of documents into subsets or clusters. The goal is to create clusters that are belonging to each other, but clearly diverse from each other. The use of clustering in IR is based mostly on the Cluster Hypothesis: “closely associated documents tend to be relevant to the same request” [8]. In other words, documents within a cluster should be as similar as possible; and documents in one cluster should be as dissimilar as possible from documents in other clusters. In this context, each group of documents are hoped shares a common theme or topic. All methods of document clustering require several steps of preprocessing of the data. First, remove the stopword. Stopword is a list of words which do not carry any meaning and do not have any indexing value [4,7,17]. Example or stopword in Malay language are dari, yang, oleh, kepada, oleh and many more. Next reduce the terms to their basic stem using stemming algorithm. Stemming algorithm are used to identify morphological variant and are language dependent [4,7,19]. To
Keywords - Malay Information Retrieval, Document Clustering, Hierarchical Agglomerative Clustering, Complete Linkage, Cosine Similarity
I.
INTRODUCTION
The Web is a lively main source of information for users. As the huge amount of data is rapidly increased, the need for search engines that is capable to keep up in organizing and retrieved data effectively is inevitable. Most of the search engines nowadays produce loads of returned documents whether or not they are relevant to the need of the user. Thus, it is very difficult for users to search and find the relevant documents [1,2,3]. A reason fall to this problem is because of the indexes are not
978-1-4673-9434-5/15/$31.00 ©2015 IEEE
103
2015 IEEE Conference on Open Systems (ICOS), August 24-26, 2015, Melaka, Malaysia
stem Malay text effectively, not only suffixes but also prefixes and infixes must be removed in proper order [9,10,11]. The standard clustering algorithms group the documents according to the type of cluster structure they produce. Clustering algorithms fall into two main categories, partitioning and hierarchical. Partitioning algorithms, such as K-means, produce a flat partition of objects without any explicit structure that relate clusters to each other. Hierarchical algorithms, on the other hand, produce a more informative hierarchy of clusters called a dendrogram [4,5]. To measure the degree of association between two documents, a distance measure or a measure of similarity or dissimilarity is needed [5,12]. There are a number of similarity measures available. The choice of similarity measure is very vital as it will give effect on the clustering results obtained [4] . In cluster-based retrieval, the determination of interdocument similarity depends on both the document representation, in terms of weights assigned to the indexing terms characterising each document and the similarity coefficient that is chosen. The coefficient values range between zero and one. Dice, Jaccard and Cosine coefficients are the most often used techniques for document clustering because those techniques is simple and easier to apply in term of simplicity and normalisation [4,12].
IV. EXPERIMENTS A. Test Collection This experiment used Hadith test collection of Sahih Bukhari, which consists of 2028 Hadith text documents. The original stopwords list from Fatimah [9] and Zainab [11] is used with additional stopwords found by Nurazzah [4] in the process of creating the Hadith text documents. B. Pre-Processing 30 queries are selected based on the popular topic asked by users [14] as in Table I. Each query entered by the user will be processed by removing all stopwords to remove all the noisy characters and terms. Then the keywords will be stemmed to get the root words. Stemming algorithm for Malay Language written by [9] called RulesApplication Order (RAO) is used in this experiment. TABLE I. LIST OF RELEVANT JUDGMENTS FOR NATURAL MALAY LANGUAGE QUERIES Query #
Queries
1 2
Tuntutlah ilmu hingga ke liang lahad Halal dan haram mengenai makanan dan minuman Adab-adab berkaitan makan dan minum Hormati kedua ibu bapa Hukum hudud Kewajipan menutup aurat dan had menutup aurat Beriman kepada Allah swt Cara-cara Bertayammum Tentang hukum berpuasa Jihad dan siapakah yang dituntut untuk melakukannya Hukum bermuamalat dengan orang kafir Hukum bernazar Kedudukan wanita dalam Islam Penghijrahan ke Madinah Penjagaan binatang pemeliharaan dan tanggungjawab kita terhadap binatang Apakah hukum berhias bagi kaum wanita Kaum Muhajirin dan Ansar Nama-nama malaikat dan tugas-tugasnya Kewajipan dan rukun menunaikan haji Muzik dalam Islam Solat malam atau tahajjud Bagaimana cara solat Jenazah? Hadis berkaitan azan Tafsir mimpi Harta yang diwajibkan zakat Nikah kahwin dan penceraian Pemimpin yang adil Pembahagian harta mengikut faraid Balasan di hari kiamat Haid Total
3 4 5 6
III. RELATED WORK Due to the production of Islamic documents on Islamic websites on the web and high percentage of Islamic scholars referring to internet to seek knowledge [4,11,13], users are left alone to decide which relevant documents assist their information needs. To help and guide users to fulfil their information needs especially in retrieving relevant Malay Islamic documents, the ultimate goal of this research is concerned with investigating the performance of retrieving relevant Malay text documents from the corpus by applying clustering technique on Malay test collection. This paper is an extension of the previous work [4]. From the previous experiments, it is proved that Complete Linkage clustering technique using Cosine coefficient gives a better performance for Malay text documents. The experiment is done on smaller number of test documents that is 50 documents. Hence this experiment is continued on a larger scale by using Complete Linkage clustering technique with Cosine coefficient similarity. The goal of this work is to see the effectiveness of retrieving Malay text for larger number of test documents which is 2028 documents compared to non-clustered documents.
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
104
No. of Relevant Documents 20 35 18 8 16 7 43 7 69 18 30 11 59 17 28 11 51 41 104 3 42 14 60 26 69 48 72 18 28 19 972
2015 IEEE Conference on Open Systems (ICOS), August 24-26, 2015, Melaka, Malaysia
documents. This experiment employs Complete Linkage clustering technique with Cosine Coefficient similarity measures. The text collection is clustered into different number of clusters in different experiments, to illustrate whether the size of cluster will affect Recall (R), Precision (P) and Effectiveness (E) scores. Three different experiments are run to produce three different set of clusters based on different cutoff points. The number of clusters produce are 100 clusters, 50 clusters and 20 clusters. Then each set of clusters are tested on the same set of 30 queries. The results is then compared to the non-clustered documents by using the same 30 queries. The Recall (R), Precisison (P) and Effectiveness (E) measure for each query is recorded in Table II, Table III and Table IV respectetively. Recall (R) and Precision (P) are important in IR community to evaluate the Effectiveness (E) of searching process for retrieval system. For this experiments, Recall (R) and Precision (P) are selected from maximum values evaluated for each cluster. It is shown that the smaller the size of clusters, Recall (R) will increase, but Precision (P) will decrease.
C. Document Clustering Complete Linkage Clustering algorithm is applied on the index term to form clusters of Hadith. To measure the similarity between two documents, document 1 and document 2 represented in the Vector Space Model( VSM), the Cosine Coefficient similarity measure is used which is defined by the cosine of the angle between the two vectors. It is used to decide whether the retrieved index term is accepted or rejected [6]. D.
Relevant Judgment
Experts on Hadith studies, Ustadz Ezani Bin Yaakub and Ustaz Mohd. Takiyuddin Bin Hj. Ibrahim from Centre for Islamic Thought & Understanding (CITU), Universiti Teknologi MARA (UiTM), Shah Alam, Malaysia [15,16] are consulted and main Hadith book are referred to formulate relevance judgment list according to each query. The relevant judgment list (Refer Table I) consists of document number that should be clustered for each query as a guide to evaluate the effectiveness of clustering techniques in retrieving relevant Malay translated Hadith documents. E. Evaluation Process
TABLE II. RECALL COMPARISON BETWEEN CLUSTERED AND NON-CLUSTERED DOCUMENTS FOR MALAY HADITH RETRIEVAL
Evaluation is the key to answer whether a search is effective or not. Effectiveness measures the ability of the system to find the right information to the user’s need. This experiment used the two most common effectiveness measure, Recall (R) and Precision (P). Recall is the proportion of relevant documents that is clustered and Precision is the proportion of clustered documents that are relevant. The Effective (E) measure is a weighted combination of R and P [1, 27]. The value of β >0 indicates how many times more important R is compared to P. If β =1.0, R and P are equally important, while if β = 2.0, R is twice as important as P [4]. In this experiment, the value of β = 2.0 is used since the aim of this work is to get the relevant documents.
Query #
Recall (R) =#(relevant documents retrieved) #(relevant documents in collections)
(1)
Precesion (P) = #(relevant documents retrieved) #(documents retrieved)
(2)
Effectiveness (E) = 1 – (1+ β2) PR β2P + R
(3)
V. RESULTS AND DISCUSSION The experiments for clustering uses a complete collection of selected Sahih Bukhari, a Malay translated Hadith books consisting of 2028
105
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 AVG
Recall (R) Clustered, N= number of clusters 100 50 20 0.15 0.25 0.50 0.11 0.11 0.17 0.13 0.13 0.39 0.25 0.25 0.50 0.19 0.19 0.19 0.29 0.29 0.38 0.23 0.23 0.23 0.14 0.29 0.45 0.12 0.16 0.20 0.06 0.06 0.48 0.13 0.13 0.27 0.14 0.21 0.21 0.12 0.12 0.12 0.06 0.12 0.18 0.11 0.11 0.17 0.18 0.27 0.32 0.08 0.08 0.21 0.07 0.12 0.12 0.35 0.39 0.43 0.33 0.33 0.69 0.10 0.10 0.18 0.07 0.07 0.33 0.07 0.07 0.17 0.08 0.19 0.21 0.04 0.09 0.15 0.08 0.08 0.13 0.04 0.07 0.17 0.11 0.11 0.16 0.07 0.08 0.16 0.21 0.21 0.32 0.14 0.16 0.27
NonClustered 0.65 0.89 0.75 0.25 0.94 0.57 0.74 0.14 0.70 0.44 0.73 0.71 0.93 0.88 0.96 0.18 0.98 0.85 0.53 0.33 0.98 0.79 0.33 1.00 0.48 0.40 0.25 0.68 0.54 0.68 0.64
2015 IEEE Conference on Open Systems (ICOS), August 24-26, 2015, Melaka, Malaysia
TABLE IV. EFFECTIVENESS COMPARISON BETWEEN CLUSTERED AND NON-CLUSTERED DOCUMENTS FOR MALAY HADITH RETRIEVAL
From Table II, it is shown that Recall (R) scores is increased as the number of clusters reduced from 100 clusters to 50 clusters and to 20 clusters. Based on the Recall (R) scores, it is shown that 20 clusters produce best results compared to 100 clusters and 50 clusters.
Query # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 AVG
TABLE III. PRECISION COMPARISON BETWEEN CLUSTERED AND NON-CLUSTERED DOCUMENTS FOR MALAY HADITH RETRIEVAL Query # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 AVG
Precision (P) Clustered, N= number of clusters 100 50 20 0.89 0.85 0.50 0.85 0.78 0.11 0.33 0.33 0.10 0.33 0.13 0.04 0.85 0.17 0.07 0.33 0.08 0.07 0.75 0.70 0.11 0.84 0.29 0.08 0.50 0.33 0.33 0.50 0.50 0.14 0.29 0.20 0.10 0.33 0.33 0.33 0.74 0.65 0.09 0.75 0.70 0.09 0.95 0.85 0.20 0.25 0.08 0.08 0.70 0.56 0.20 0.33 0.33 0.13 0.27 0.29 0.47 0.25 0.20 0.03 0.33 0.33 0.09 0.50 0.50 0.13 0.68 0.62 0.67 0.76 0.65 0.07 0.84 0.74 0.21 0.18 0.18 0.10 0.33 0.33 0.10 0.50 0.25 0.17 0.65 0.40 0.25 0.54 0.57 0.33 0.55 0.43 0.18
NonClustered 0.81 0.19 0.09 0.09 0.58 0.06 0.06 0.03 0.41 0.16 0.02 0.43 0.29 0.19 0.32 0.01 0.31 0.29 0.89 0.01 0.07 0.02 0.36 0.96 0.21 0.41 0.45 0.05 0.09 0.59 0.28
Effectiveness (E) Clustered, N= number of clusters 100 50 20 0.18 0.29 0.50 0.13 0.13 0.15 0.14 0.14 0.25 0.26 0.21 0.15 0.22 0.18 0.14 0.29 0.19 0.20 0.27 0.27 0.19 0.17 0.29 0.24 0.14 0.18 0.22 0.07 0.07 0.33 0.15 0.14 0.20 0.16 0.23 0.23 0.14 0.14 0.11 0.07 0.14 0.15 0.13 0.13 0.18 0.19 0.19 0.20 0.10 0.09 0.21 0.09 0.14 0.12 0.33 0.36 0.44 0.31 0.29 0.14 0.11 0.11 0.15 0.09 0.09 0.25 0.08 0.08 0.20 0.09 0.22 0.15 0.05 0.11 0.16 0.09 0.09 0.12 0.05 0.08 0.15 0.12 0.12 0.16 0.08 0.10 0.17 0.24 0.24 0.32 0.15 0.17 0.21
NonClustered 0.32 0.49 0.70 0.82 0.17 0.78 0.77 0.92 0.39 0.67 0.91 0.37 0.35 0.49 0.31 0.97 0.32 0.39 0.42 0.96 0.73 0.91 0.66 0.01 0.62 0.60 0.73 0.82 0.74 0.34 0.59
VI. CONCLUSION AND FUTURE WORK The aim of this research is concerned with investigating the performance of retrieving relevant Malay Text documents from the corpus by applying Complete Linkage Clustering technique with Cosine Coefficient similarity measure on Malay Translated Hadith test collection, Sahih Bukhari. This study is done in order to help and guide users to fulfil their information needs especially in retrieving relevant Malay Language Islamic documents. The experiments in the previous section are tested based on Recall (R), Precision (P) and Effectiveness (E). The results are discussed based on the experiments done for 20 clusters, 50 clusters and 100 clusters. The comparison scores between clustered and non-clustered documents are also elaborated. In summary, the results has shown that the smaller size of clusters, Recall (R) scores is increased and Precision (P) scores is decreased. Results for Effectiveness (E) measure shows that 20 clusters perform the best scores compared to 50 clusters and 100 clusters.
Results presented in Table III have illustrated the increment of Precision (P) scores for a set of selected natural language queries used and it is highest when the Malay corpus is clustered to 100 clusters compared to 50 or 20 clusters. The number of clusters that is recommended for this test collection based on Precision (P) scores is 100 clusters. Recall (R) and Precision (P) scores has proven to be more effective compared to the conventional non-clustering technique. For Effectiveness (E) measure, the scores are calculated based on the weighted combination Recall (R) and Precision (P), which is Recall (R) is twice as important as Precision (P). The Effectiveness (E) measure shows that the 20 clusters got the best scores rather than 50 clusters and 100 clusters. It is therefore settled that clustering technique will improve Malay retrieval system compared to the non-clustering technique.
106
2015 IEEE Conference on Open Systems (ICOS), August 24-26, 2015, Melaka, Malaysia
Information Retrieval. Proceedings of the 4th International Conference and Exhibition on Multi-lingual Computing, (1994) [11] Zainab, A. B.: Evaluation of retrieval effectiveness of conflation methods on Malay documents, Ph.D. Thesis, Universiti Kebangsaan Malaysia (1999) [12] Manning, C. D., Raghavan, P., & Schutze, H.: Introduction to Information Retrieval. New York: Cambridge University Press (2008) [13] NorShahriza, A. K., & Norselatun, H. R. Assessing Islamic Information Quality on the Internet: A Case of Information about Hadith. Malaysian Journal of Library & Information Science, 10(2): 51-66, (2005) [14] Kamarul, A. M. To Evaluate the Effectiveness and Efficiency of Stemming Algorithm and Biagram Method in Retrieving Hadith Documents. Universiti Teknologi MARA (2002) [15] Ezani, Y. Centre for Islamic Thought & Understanding (CITU), Universiti Teknologi MARA (UiTM) 40450 Shah Alam: MALAYSIA, (2007) [16] Mohd Takiyuddin, I. Centre for Islamic Thought & Understanding (CITU), Universiti Teknologi MARA (UiTM) 40450 Shah Alam: MALAYSIA, (2007) [17] Othman, A.: Pengakar Perkataan Melayu Untuk Sistem Capaian Dokumen. MSc Thesis, Universiti Kebangsaan Malaysia (1993) [18] Abdullah, M. T.: Monolingual and crosslanguage information retrieval approaches for Malay and English language document . Ph.D. Thesis. Universiti Putra Malaysia (2006) [19] Nurazzah, A.R., Zainab, A.B., Tengku Mohamad, T. S., &Ismail, N. K.: ClusterBased Hadith Retrieval System. Proceedings of the International Conference on ICT for the Muslim World (ICT4M), Kuala Lumpur (2006)
By the ample evidences presented, clustering technique can be said to be more effective to enhance the searching process for Malay Retrieval system. Thus, it fulfills the aim of assisting users in getting the relevant documents effectively. However, there is still limitation on the document clustering technique. Therefore we present possible idea that can improve the effectiveness of the searching process for Malay retrieval system. Clustering Search Results using semantic information for Malay text documents is currently being evaluated to ensure that the relevant text documents appear in a cluster. So that, user can find the set of document relevant to a given query effectively. REFERENCES [1] Hungming, H., & Watada, J. Search Result Clustering through Density Analysis Based KMedoids Method. In Advanced Applied Informatics (IIAIAAI), 2014 IIAI 3rd International Conference on (pp. 155-160). IEEE., (2014) [2] Nguyen, H. S., Nguyen, S. H., & Swieboda, W.: Semantic explorative evaluation of document clustering algorithms. In Computer Science and Information Systems (FedCSIS), 2013 Federated Conference on (pp. 115-122). IEEE, (2013) [3] Yang, N., Liu, Y., & Yang, G. Clustering of Web Search Results Based on Combination of Links and In-Snippets. In Web Information Systems and Applications Conference (WISA), 2011 Eighth (pp. 108-113). IEEE, (2011) [4] Nurazzah A.R.: Evaluating The Effectiveness of Clustering Techniques In Retrieving Malay Translated hadith Text, Ph.D. Thesis, Universiti Teknologi Mara (2011) [5] Beil, F., Ester, M., & Xu, X. Frequent termbased text clustering. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 436-442). ACM,(2002) [6] Leuski, A., & Allan, J. Improving Interactive Retrieval by Combining Ranked List and Clustering. In RIAO (pp. 665-681), (2000) [7] Ab Samat, N., Azmi Murad, M. A., Abdullah, M. T., & Atan, R. Malay documents clustering algorithm based on singular value decomposition. Journal of Theoretical and Applied Information Technology, 8(2), 180186,(2009) [8] van Rijsbergen, C. J.: Information Retrieval (2nd ed.). London: Butterworths (1979) [9] Fatimah, A.: A Malay Language Document Retrieval System An Experimental Approach And Analysis. Phd Thesis, Universiti Kebangsaan Malaysia (1995) [10] Tengku Mohamad, T. S., Mohamad, Y., & Fatimah, A. A Malay Stemming Algorithm for
107