Query Expansion using Thesaurus in Improving ...

3 downloads 0 Views 246KB Size Report
from Malay Hadith retrieval system. ... documented sayings and the doings of the Prophet ... is an important facility in any text retrieval system, whereby; it is also ...
Query Expansion using Thesaurus in Improving Malay Hadith Retrieval System Nurazzah Abd Rahman, Zainab Abu Bakar Faculty of Computer & Mathematical Sciences Universiti Teknologi MARA Shah Alam, Malaysia [email protected], [email protected],

Abstract—Thesaurus has become another valuable structure in any Information Retrieval system. It is a list of terms and concepts that provide a controlled vocabulary of words to use in document indexing, clustering, searching and retrieval. This paper present the results of expanding user’s query using Malay thesaurus in the process of searching Malay documents from Malay Hadith retrieval system. The results obtained shows that the retrieval effectiveness improves by four percent when thesaurus is employed in process of retrieving Malay translated Hadith documents. Information Retrieval, Cluster-based Retrieval

I.

Thesaurus,

Query

Expansion,

INTRODUCTION

Research in Malay information retrieval is relatively a new area in ICT, in which has just started about fifteen years ago [1, 2, 3, 4, 5, 6]. The growth in number of documents written in Malay language had triggered the needs for Malay document retrieval system. For Muslims, Hadith is a second source of reference after The Holy Book, Al-Quran. A Hadith (plural Ahadith) are documented sayings and the doings of the Prophet Muhammad SAW by reliable narrators. To search a Hadith from the vast Hadith Books available in the market need somebody expert in the Islamic field to verify the authenticity of a particular Hadith. For example the reliability of the narrator of the Hadith reflects the Hadith’s standard. The time consume to search manually through the books adding another reason for the need of automatic search engine for Malay Hadith. The main objective of this research is to evaluate the retrieval effectiveness by expanding user’s query using thesaurus in the process of retrieving Malay translated Hadith text documents. The significance of this research is to improve retrieval effectiveness of Malay information retrieval (IR) system by expanding user’s query using thesaurus in the process of retrieving Malay text documents.

II.

THESAURUS

In order to overcome words variants problems, computational technique is developed to transform both user’s search and database words into a single canonical form [7]. This technique is known, as conflation. Conflation is an important facility in any text retrieval system, whereby; it is also able to find not identical words in the database that match the vocabulary that the user used [3]. Generally, conflation method can be divided into 2 types. The first type is a language independent such as string similarity method, dynamic programming method and spelling correction method. The second type is language dependent conflation such as stemming and thesaurus. Thesaurus can build all types of relationship that exist between words, such as hierarchic, synonym, and morphological. The information retrieval thesaurus typically contains a list of terms, where a term is either a single word or phrase. The relationships between them are also included to assist in coordinating indexing and retrieval. However, the construction of a comprehensive thesaurus for a given subject domain requires a large amount of manual and highly trained effort. Meanwhile, automatic construction method has three approaches. The first approach is designing thesaurus from document collection. Second is merging existing thesaurus and the third automatic approach, thesaurus is built using information obtained from users. In this third alternative, the objective is to capture thesaurus from the user’s search. A. Malay Thesaurus Construction The first construction of digital Malay thesaurus vocabulary is based on the 36 natural language queries [8]. These query keywords are obtained after the removal of all the stop words and duplicate words. The complete thesaurus obtained from merging this thesaurus and a thesaurus collection from [9] are scanned and compiled by [31]. The newly updated digital version of this thesaurus is compiled by [10] by referring to the new version of Malay thesaurus book by [38]. The newly compiled thesauruses are used in various experiments on different test collections [10]. For experiment in [10], the entries are sorted and their equivalent

thesauruses are merged for the same entry. The number of entries is reduced from 11466 to 10780 words. III.

QUERY EXPANSION

Query expansion is the process of a search engine adding search terms to a user's weighted search. The intent is to improve precision and/or recall. The additional terms may be taken from a thesaurus and/or user’s relevance feedback. A standard method of performing query expansion is to use relevance information from the user; those documents a user has assessed as containing relevant information. The content of these relevant documents can be used to form a set of possible expansion terms, ranked by some measure that describes how useful the terms might be in attracting more relevant documents. For example a search for "car" may be expanded to: car cars auto autos automobile automobiles. Table I below listed part of the expanded 30 queries using thesaurus that has been used in the Malay language retrieval experiment. TABLE I.

LIST OF EXPANDED MALAY QUERIES USING THESAURUS

Q#

Original Queries

Expanded Queries using Thesaurus

1.

Tuntutlah Ilmu Hingga ke Liang Lahad

Tuntut Minta Mohon Ilmu Pengetahuan Ajar Tahu Liang Lahad Gua Kubur

Halal dan Haram Mengenai Makanan dan Minuman

Halal Harus Boleh dibenarkan diizinkan dan Haram Pantang Larang Tegah Dosa Mengenai Makanan Rasa Kecap Jamah Minum Telan Ratah Minuman Air Kordial Teh Kopi Koko Susu Jus Arak Tuak Teguk Telan Togak Tuang Tunggang

3.

Adab-Adab Berkaitan Makan dan Minum.

Adab-Adab Adat Betul Disiplin Hukum Istiadat Kebiasaan Lembaga Normal Peraturan Resam Santun Sopan Sesuai Susila Tatasusila Tertib Tradisi Berkaitan Makan Minum Teguk Telan Togak Tuang Tunggang

4.

Hormati Kedua Ibubapa

Hormati Luhur Tinggi Agung Besar Kedua Ibubapa

Hukum Hudud

Hukuman Hudud Adab Adat Arahan Disiplin Isitiadat Kanun Larangan Lembaga Manual Penjagaan Pengawasan Penyeliaan Peraturan Perintah Petunjuk Preskripsi Prinsip Resam Rukun Suruhan Susila Tegahan Teras Tertib Tonggak Tradisi Upacara Undang-Undang Hukuman Azab Balasan Deraan Denda Keputusan Penalti Dosa

2.

5.

6.

Kewajipan menutup aurat dan had menutup aurat

IV.

EXPERIMENTAL DETAILS

In any information retrieval system test collection consists of document database, set of queries for the database, and relevance judgments that are formulated based on the queries are required [24]. Without the availability of the test collection the retrieval effectiveness and efficiency of various information retrieval algorithms cannot be compared and justified. A document database is usually in a textual form. A. Test Collection A Hadith (plural Ahadith) is a narration about the life of the Prophet Muhammad (pbuh) or what he approved. Hadith test collection consists of 2028 Malay Translated Hadith documents from book Sahih Bukhari, relevance judgments, stopword list, morphological rules, and dictionary of Malay root words. Stopword list, morphological rules and dictionary of root words are needed in stemming algorithm during the process of creating inverted index file. B. Malay Query Expansion Each query entered by the user will be processed by removing all stopwords. A total of 30 queries are used in the experiment. Then the terms will be stemmed and equivalent terms will be searched in the Malay thesaurus. The expanded query then will be used to search for relevant documents from the Malay document database. C. Retrieval Process Figure 1 refers to the complete view of the process of retrieving Hadith relevant documents based on Yates and Neto [33]. From the user interface, users will enter their query(s) and submit the query to the system to get their results. The system will take the query and submit it to the query operations to be processed. Clustering algorithms are applied on the Hadith inverted index file to form clusters of Hadiths. Searching of relevant documents is done using clustered Hadith inverted index file. The list of retrieved Hadith documents will be ranked according to selected ranking procedure and the list will be displayed to the user. Lastly, the user will select which Hadith document to be opened and displayed for further actions. D. Evaluation Process Evaluation on Malay text documents was done using the redefined well-known IR metrics recall (R) and precision (P) as the proportion of relevant documents that is clustered and as the proportion of clustered documents that are relevant [34]. R=

Kewajipan menutup aurat had menutup aurat busana hiasan kafan kostum pakaian persalinan

P=

∑ kL = 1 Re l k Re l

(1)

∑ kL = 1 Re l k

∑ kL = 1 (Re l k + Non k )

(2)

E = 1−

(1 + β )PR 2

(3)

2

β P+R

User Interface

Texts

Text Operations

Indexing

Query Operations

Query Ranked Documents

Malay Hadith Documents

Hadith Indexed

Malay Thesaurus

Inverted File Clustering

Searching

Retrieved Docs

Cluster 1

….

Ranking

Cluster n

Figure 1. Process of Retrieving Clustered Malay Hadith Text Documents

where Rel is the number of relevant documents in the clustered set, L is the number of clusters, Relk and Nonk are the number of relevant and non-relevant documents in kth cluster. The effectiveness measure, E, is a weighted combination of R and P [39]. The value of β>0 indicates how many times more important R is compared to P. If β=1.0, R and P are equally important, while if β=2.0, R is twice as important as P. In this experiment, the value of β=2.0 is used. E. Malay Hadith Retrieval System Figure 1 refers to the complete view of the process of retrieving Hadith relevant documents based on [33]. From the user interface, users will enter their query and submit the query by clicking the cari button to the system to get their results. Selecting the thesaurus button will allow the system to search the Hadith documents using Malay thesaurus. The system will read the query and submit it to the query operations to be processed. A list of Hadith

documents will be displayed according to the ranking done by the system page by page of ten documents per page. The system will read the query and submit it to the query operations to be processed. Clustering algorithms are applied on the Hadith inverted index file. Searching of relevant documents is done using clustered Hadith inverted index file. The list of retrieved Hadith documents will be ranked according to selected ranking procedure and the list will be displayed to the user. V.

RESULTS AND DISCUSSIONS

The result of running two different clustering techniques using Jaccard coefficient on six topics and using thesaurus is presented in Table II.

TABLE II. RECALL(R), PRECISION(P) AND EFFECTIVENESS(E) FOR COMPLETE AND AVERAGE LINKAGE CLUSTERING TECHNIQUES ON

SIMULATED DATA SET USING JACCARD COEFFICIENTS WITH THESAURUS. Topics Ilmu Knowledge Sembahyang Pray Zakat Zakh) Wuduk Ablution Haid Menstrual Tayammum Tayammum Average

1.2

Recall CL AL

Precision CL AL

Effectiveness CL AL

0.45

0.89

0.03

0.03

0.88

0.87

0.32

0.69

0.01

0.05

0.96

0.81

0.45

0.74

0.05

0.04

0.83

0.84

0.31

0.68

0.02

0.02

0.92

0.94

0.67

0.33

0.04

0.01

0.84

0.96

0.55

0.65

0.02

0.02

0.91

0.91

0.34

0.2

0.05

0.05

0.89

0.89

TABLE III. RECALL{R), PRECISION(P) AND EFFECTIVENESS(E) FOR COMPLETE AND AVERAGE LINKAGE CLUSTERING TECHNIQUES ON SIMULATED DATA SET USING COSINE COEFFICIENTS WITH THESAURUS

Ilmu Knowledge Sembahyang Pray Zakat Zakh) Wuduk Ablution Haid Menstrual Tayammum Tayammum Average

0.8

0.6

0.4

0.2

0

Effectiveness of employing thesaurus to Complete Linkage clustering technique improves by 4% from 85% [35] to 89%, while employing thesaurus to Average Linkage clustering improves retrieval effectiveness by 1% from 88% [35] to 89%. Table III presents the result of running two different clustering techniques using Cosine coefficient on six topics and using thesaurus. Effectiveness of employing thesaurus to Complete Linkage clustering technique improves by 4% from76 % to 79%, while employing thesaurus to Average Linkage clustering improves retrieval effectiveness by 2% from 80% to 82% [35].

Topics

1

1

2

3

4

Jaccard w/o Thes Cosine w/o Thes

Figure 2.

5

6

Jaccard w Thes Cosine w Thes

Graph of Retrieval Effectiveness using Jaccard and Cosine Coefficient, with and without Thesaurus

Figure 2 illustrates graph of retrieval effectiveness using Jaccard and Cosine coefficient, with and without thesaurus. Precision is low since a lot of non-relevant documents are also retrieved for the synonym terms associated with the topic or queries. Overall on the average, retrieval performance does not make significant improvement from adding synonymous or related terms. TABLE IV. TABLE OF ELEVEN-POINT INTERPOLATED AVERAGE RECALL-PRECISION-PRECISION FOR DIFFERENT EXPERIMENTS (INCLUDE THESAURUS AND NOT INCLUDE THESAURUS) β = 2.0

β = 1.0

Precision Effectiveness Effectiveness Include Without Include Without Include Without Recall Thesaurus Thesaurus Thesaurus Thesaurus Thesaurus Thesaurus

Recall CL AL

Precision CL AL

Effectiveness CL AL

0

0.524

0.582

1.000

1.000

1.000

1.000

0.1

0.376

0.428

0.883

0.882

0.842

0.838

0.7

0.95

0.07

0.09

0.75

0.67

0.2

0.276

0.318

0.788

0.784

0.768

0.754

0.6

0.64

0.08

0.04

0.74

0.84

0.3

0.226

0.265

0.718

0.708

0.742

0.719

0.4

0.157

0.205

0.695

0.664

0.775

0.729

0.4

0.35

0.05

0.04

0.83

0.86

0.5

0.105

0.153

0.714

0.657

0.826

0.766

0.4

0.54

0.05

0.03

0.83

0.88

0.6

0.074

0.103

0.752

0.695

0.868

0.824

0.7

0.042

0.045

0.832

0.822

0.922

0.916

0.8

0.029

0.024

0.873

0.893

0.944

0.953

0.65

0.63

0.07

0.05

0.76

0.81

0.7

0.57

0.05

0.03

0.81

0.88

0.9

0.012

0.015

0.945

0.931

0.977

0.971

0.82

1

0.003

0.003

0.987

0.985

0.995

0.994

Avg

0.166

0.195

0.835

0.820

0.878

0.860

0.5

0.46

0.08

0.07

0.79

0.200

effectiveness in this domain even though the percentage of improvement is very small. Hersh, Price and Donohoe [36] concludes that thesaurus-based query expansion causes a decline in retrieval performance generally but improves it in specific instances. Thesaurus also will help users to retrieve documents that are relevant to the query entered by searching synonym terms. According to Sihvonen & Vakkari [36], a vital condition for benefiting from a thesaurus in query expansion to improve search results is sufficient familiarity with the search topic.

0.100

ACKNOWLEDGMENT

0.700

0.600

Precision

0.500

0.400

0.300

All authors are grateful for the Institute of Research, Development and Commercialization (IRDC), Universiti Teknologi MARA, Malaysia, for financial support (grant # 600-BRC/ST 5/3/635).

0.000 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall With Thesaurus

Without Thesaurus

Figure 3. Graph Of Recall-Precision For Two Types Of Experiments (With And Without Thesaurus)

Table IV presents the eleven-point interpolated average recall-precision-effectiveness for different experiments (include thesaurus and not include thesaurus) and Figure 3 illustrates the graph. Figure 3 above shows that the precision of the Malay Hadith Sahih Bukhari text retrieval if queries are expanded using thesaurus, is lower than when the queries are not expanded using thesaurus. As the queries being expanded using thesaurus, more text documents are retrieved from the test collection, resulted in the decreased of precision. Even though the results of precision decreased, detail results shows that the effectiveness level has increased slightly. Differences in effectiveness measures is greater when recall values are between 0.3 to 0.7 (Refer to Figure 4).

REFERENCES [1]

[2]

[3]

[4]

[5]

[6]

1.200

[7]

1.000

Effectiveness

0.800

[8] 0.600

0.400

[9]

0.200

[10]

0.000 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall With Thesaurus

Without Thesaurus

[11] Figure 4. Graph Of effectiveness For Two Types Of Experiments (With And Without Thesaurus) [12]

VI.

CONCLUSIONS

The results of the experiments done on the Hadith text collection proves that thesaurus does improve retrieval

[13]

O. Asim, Pengakar Perkataan Melayu untuk Sistem Capaian Dokumen, Tesis Ijazah Sarjana Sains, Universiti Kebangsaan Malaysia, 1993. F. Ahmad. A Malay Language Document Retrieval System: An Experimental Approach And Analysis. Ph.D. Thesis. Universiti Kebangsaan Malaysia, 1995. F.C. Ekmekcioglu, M.F. Lynch, A.M. Robertson, T.M.T. Sembok, & P. Willett, “Comparison of N-gram Matching and Stemming for Term Conflation in English, Malay, and Turkish Texts”, The Journal of Computer Text Processing, 6(1), 1996, pp.1-14. A. B. Zainab, T.M.T. Sembok & M. Yusoff, “Experiment on Conflation Algorithms on Malay Texts for Document Retrieval,” Proceedings of the 15th IASTED International Conference, 1997, pp. 229-231. A. B. Zainab. “Evaluation Of Retrieval Efectiveness Of Conflation Methods On Malay Documents”, PhD. Thesis, Universiti Kebangsaan Malaysia, 1999. A. R. Nurazzah, A. B. Zainab and K. I. Normaly, “Experiments On Clustering Techniques In Retrieving Malay Translated Hadith Text Documents”, Proceedings of Brunei International Conference of Engineering and Technology (BICET’05), Bandar Sri Begawan, 2005, pp.129-138. Lennon, M., Peirce, D.S., Tarry, B.D. & Willett, P. “An Evaluation of Some Conflation Algorithms for Information Retrieval”, Journal of Information Science, 3, 1981, pp.177-183. A. T. Rapizal, “To Improve Malay Document Retrieval System Using Thesaurus Approach Base On User Query”, B.Sc. Thesis, Universiti Teknologi MARA. 2000. Abdullah & Ainon.: Tesaurus Bahasa Melayu. Utusan Publication Sdn Bhd, Kuala Lumpur (1994) A. B. Zainab & A. R. Nurazzah (2003). “Evaluating the Effectiveness of Thesaurus and Stemming Methods in Retrieving Malay Translated Al-Quran Documents”, Lecture Notes in Computer Science 2911, Springer-Verlag, Berlin Heidelberg, Germany, pp.653-662. Rasmussen, E. “Clustering Algorithms”, A book chapter, Information Retrieval: Data Structures & Algorithms edited by W.B. Frakes & Ricardo Baeza-Yates, Prentice Hall, Englewood Cliffs, 1992, pp.419-442. Jarvis, R.A. and Patrick, E.A., Clustering Using a Similarity Measure Based on Shared Near Neighbors, IEEE Transactions on Computers, Vol. C-22, 1973, pp.1025-1034. P. Willet, Similarity Coefficients and Weighting Functions for Automatic Document Classification: an Empirical Comparison, International Classification, Vol. 10, 1983, pp.138-142.

[14] A. Jain, and R. Dubes, Algorithms for Clustering Data, Prentice Hall, New Jersey, 1988. [15] G. Salton, Automatic Text Processing, Addison Wesley, Reading, Mass, 1989. [16] P.H.A. Sneath and R.R. Sokal, Numerical Taxonomy: The Principles and Practice of Numerical Classification, W.H. Freeman, San Francisco, 1973. [17] S. Guha, R. Rastogi, and K. Shim, “CURE: An Efficient Clustering Algorithm for Large Databases”, Proceedings of 1998 ACMSIGMOD International Conference on Management of Data, 1998. [18] S. Guha, R. Rastogi and K. Shim, “ROCK: A Robust Clustering Algorithm for Categorical Attributes”, Proceedings of the 15th International Conference on Data Engineering, 1999. [19] D. Eppstein, Fast hierarchical clustering and other applications of dynamic closest pairs, Proceedings of The 9th Symposium of Discrete Algorithms, 1998, pp. 619-628. [20] G. Karypis, E.H. Han, and V. Kumar, “Chameleon: A hierarchical clustering algorithm using dynamic modeling”. IEEE Computer, Vol 32(8), 1999, pp.68-75. [21] T. Sorenson, “A method of Establising Groups of Equal Amplitude in Plant Sociology Based On Similairty of Species Content and Its Application To Analyses Of The Vegetation on Danish Common”, Biologiske Skrifter, Vol 5, 1948, pp.1-34. [22] T. Korenius, J. Laurikkala, K. Jarvelin, and M. Juhola, Stemming and Lemmatization in the Clustering of Finnish Text Document, Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management, 2004, pp.625-633. [23] M. Lorr, “Cluster Analysis for Social Scienties: Techniques for Analyzing and Simplifying Complex Blocks of Data”, JosseyBass, San Francisco, 1983. [24] W.B. Frakes, Introduction to Information Storage and Retrieval Systems. In Frakes, W.B. & Baeza-Yates, R.: (ed.): Information Retrieval:Data Structures & Algorithms, Prentice Hall, Englewood Cliffs, 1992, pp.1-12. [25] Z, Hamidy, Hs, Fachruddin, T. Nasharuddin, A. Johar, A Rahman Zainuddin M. A, Al-Imam Al-Bukhary, Terjemahan Hadis Shahih Bukhari Volume I – IV, Darel Fajr Publishing House, Singapore, 2002.

[26] M. Daud, Terjemah Hadis Shahih Muslim, Volume I-IV, Darel Fajr Publishing House, Singapore, 2003. [27] M. Zuhri, Tarjamah Sunan At-Tirmidzi, Volume I-III, Victory Agencie, Kuala Lumpur, 1993. [28] B. Arifin, and Y. A. Ali, Tarjamah Sunan An-Nasa’iy, Volume I, III, IV. DarulFikir, Kuala Lumpur, 1993a. [29] B. Arifin, and Y. A. Ali, Tarjamah Sunan An-Nasa’iy, Volume II, V. CV. Asy Syifa’, Semarang, Indonesia, 1993b. [30] T. M.T. Sembok, M. Yussoff, & F. Ahmad, “A Malay Stemming Algorithm for Information Retrieval”, Proceedings of the 4th International Conference and Exhibition on Multi-lingual Computing, 5.1.2.1-5.1.2.10, 1994. [31] M. Z. M. Abas, Image and Translated Al-Quran Verses Retrieval System Using Thesaurus Approach Base on Malay Query Words, B.Sc. Thesis, Universiti Teknologi MARA, 2001. [32] M. R. Mokhtar, Incorporating Stemming Algorithms in the Malay Information Retrieval that Employs Thesaurus Approach, B.Sc. Thesis, Universiti Teknologi MARA, 2001. [33] R. B. Yates, and B. R. Neto, Modern Information Retrieval. ACM Press-Addison Wesley, New York, 1999. [34] A. V. Leouski, and B. W. Croft, “An Evaluation Techniques for Clustering Search Results”, Technical Report IR-76, Department of Computer Science, University of Massachusetts, Amherst, 1996. [35] A. R. Nurazzah, A. B. Zainab, T.M.T. Sembok and K. I. Normaly, “Cluster-Based Hadith Retrieval System”, Proceedings of International Conference on ICT for the Muslim World (ICT4M), Kuala Lumpur, 2006. [36] W.R. Hersh, S. Price, L. Donohoe, Assessing thesaurus-based query expansion using the UMLS Metathesaurus, Proceedings of the 2000 Annual AMIA Fall Symposium, 2000, pp. 344-348. [37] A. Sihvonen, & P. Vakkari, Subject knowledge improves interactive query expansion assisted by a thesaurus, Journal of Documentation, Vol. 60(6), 2004, pp.673-690. [38] Ainon, M., & Abdullah, H. Tesaurus Bahasa Melayu. Kuala Lumpur: PTS Professional Publishing Sdn Bhd, 2006. [39] van Rijsbergen, C.J. 1979. Information Retrieval. 2nd edition. London: Butterworths.