the area of Domain Specific Information Retrieval (E-learning being the domain). The approach ... language information retrieval; search engine; ontology; e- learning. ..... associated with college name, course name, lecture name and presented in two .... 2007. [Online]. Available: http://www.amazon.ca/exec/obidos/redirect?
Multi-Language Ontology-based Search Engine Leyla Zhuhadar and Olfa Nasraoui Knowledge Discovery and Web Mining Lab Dept. of Computer Engineering and Computer Science University of Louisville, KY 40292, USA Abstract—One of the first Multi-Language Information Retrieval (MLIR) systems was implemented in 1969 by Gerard Salton who enhanced his SMART system to retrieve multilingual documents in two languages, English and German. However, the research field of MLIR is still struggling since the majority of information retrieval systems are monolingual and more precisely English-based, even though only 6% of the world’s population native language have as English [14]. This paper presents a MultiLanguage Information Retrieval (MLIR) approach that falls into the area of Domain Specific Information Retrieval (E-learning being the domain). The approach we followed is a synergistic approach between (1) Thesaurus-based Approach and (2) Corpusbased Approach. This research has been implemented on a real platform called HyperManyMedia1 at Western Kentucky University. Index Terms—multi-language information retrieval; crosslanguage information retrieval; search engine; ontology; elearning.
Robert Wyatt and Elizabeth Romero The Office of Distance Learning Division of Extended Learning and Outreach Western Kentucky University, KY 42101, USA
•
citations are written in a language that is different from the language of the article itself. The problem of a user who is capable to read or use documents written in a specific language, but he/she is not fluent in this specific language to query for the right terms to find the document. Oard and Dorr [19] provide three different scenarios to this problem: (a) a user who is searching for images where those images are tagged and indexed in a language that the user does not understand, (b) a researcher who is interested in a specific research topic and would like to know which individuals or institutes world wide are working on the same topic and (c) a user who has a system to translate documents to different languages and would like to search for those documents in languages he is unfamiliar with.
I. I NTRODUCTION
II. BACKGROUND
One of the first Multi-Language Information Retrieval systems was implemented in 1969 by Gerard Salton who enhanced his SMART system to retrieve multilingual documents (English & German); he used concept lists and proved the effectiveness of multilingual information retrieval [22]. Oard and Dorr [19] defined Multi-Language (Multilingual) text retrieval as “the retrieval of documents or more precisely, electronic texts based on explicit queries formulated by humans using natural language, regardless of the language in which the documents and the query are expressed.” The majority of information retrieval systems are monolingual and more precisely English-based, even though only 6% of the world’s population use English as their native language [14]. Haddouti [14] provides a complete survey of Multi-Language information retrieval techniques and multilingual processing methods and applications. Oard and Dorr [19] indicate some major interesting reasons for designing a multilingual information retrieval system, we list some of them:
The first workshop on Multi-language Information Retrieval was held as part of the SIGIR962 Conference. MLIR is also part of the Text Information Retrieval Conference, TREC track. In TREC-8, the organizers divided the way of approaching the Language Problem into three approaches [23]:
•
A repository of documents written in multi-languages, with each individual document containing more than one language, for example: (a) technical documents written in non-English, but use expressions (jargon terms) written in English, (b) a document that uses quotes written in languages different than the language of the article itself and (c) a document that cites foreign articles and those
1) Query translation 2) Document translation 3) Mix of Query & Document translation Over the last 13 years, Multi-Language Information Retrieval used different approaches, such as controlled vocabulary, dictionaries, thesauri and free text. In general MLIR relies on Machine Translation (MT). One of the major contributors to advances in MLIR is the Multi-Language Evaluation Forum, CLEF. CLEF started in 2000, “[it] promotes Research and Development in multilingual information access by (a) developing an infrastructure for the testing, tuning and evaluation of information retrieval systems operating on European languages in both monolingual and Multi-language contexts, and (b) creating test-suites of reusable data which can be employed by system developers for benchmarking purposes3 .” In the Third Workshop of the Advances in Cross-Language information Retrieval; Peters, Braschler, and Gonzalo [20] divided the MLIR research field into four distinguished categories: 2 SIGIR:http://www.sigir.org/
1 HyperManyMedia
platform: http://hypermanymedia.wku.edu
3 http://www.clef-campaign.org/
1) Multilingual Retrieval: In this field the IR system contains documents written in multiple languages and the goal is to query in one language and to be able to retrieve all the documents related to the query in multilanguages. 2) Bilingual Retrieval: In this system, the query is written in one language and the system is capable of retrieving documents in another language. 3) Monolingual Retrieval: The repository of this system contains documents in multiple languages. The IR system works as following, when a user write a query in one language, the system will only retrieve the documents related to the query and the results are only from the same queried language. 4) Domain Specific Retrieval: This research field is related to documents containing scientific text. The goal is to have an IR system that is capable of querying and retrieving those terms in multiple languages. A. Approaches to Multi-Language IR Oard and Dorr [19] divided research in MLIR into three approaches: 1) Text Translation Approach 2) Thesaurus-based Approach 3) Corpus-based Approach 1) Text Translation Approach: A machine translation is used to map a query q and the document d into a common language L. Oard and Dorr [19] explained the difficulties of implementing such a system, they mentioned that the effectiveness of this approach is domain dependent. In some domains, the quality is high, while in others it is very low. Among the first implementations of this approach were the following three: (1) in 1993, by Fluhr and Radwan: a fulltext database as lexical semantic knowledge for multilingual interrogation and machine translation [10]; (2) in 1995, by Davis and Dunning: an evaluation of query translation methods for multilingual text retrieval lingual Information Retrieval systems [5]; and in 1997, Fluhr: a multilingual information retrieval system [9]. In general, the Text Translation Approach uses straightforward techniques, but its main weakness is that the quality of translation is sometimes very low. 2) Thesaurus-based Approach: Oard and Dorr [19] defined this approach as an ontology-based approach. Here, the thesaurus is an ontology, a knowledge representation of the domain. Oard, Douglas and Dorr, Bonnie distinguished four types of thesaurus: 1) Subject Thesaurus: Hierarchical representation of a domain with associative relationship between entities 2) Concept List: Terms respects concept in classes and subclasses 3) Term List : List of Multi-Language Synonyms 4) Lexicon: Semantics Among the first implementations of this approach were the following two systems: (1) an automatic processing of foreign language documents by Salton [22], where Salton augmented
his SMART system to retrieve two languages (English and German), it was considered as the first MLIR system being tested and evaluated. Salton used Concept List to build the system. In the evaluation stage, he used average precision and found that there were different results between the queries written in German compared to the ones written in English; (2) Pigur’s system IRRD [21] was based on a Vocabulary Thesaurus, where he used three languages (English, French and German), there was no evaluation test for this system. 3) Corpus-Based Approach: Those techniques are exactly the same techniques used in monolingual information retrieval systems. Instead of using a thesaurus, these techniques explore the statistical information about the corpora. Oard and Dorr survey [19] distinguished three techniques: 1) Automatic Thesaurus Construction: this approach extracts the statistical information about the terms in the corpora and automatically build a thesaurus based on this information. Van der [24] used an algorithm to automatically extract the terminology in bilingual corpora; Kupiec used an algorithm to find noun (N) phrases (F) correspondences in bilingual corpora [15]; Daille, Gaussier, and Langé [4] used a similar method to Van der [24] but based on linguistic knowledge, the algorithm identify noun phrases in bilingual corpora (English and French) those NF most likely to be terms; Gaussier [12] extended the previous model [4] by using word alignment and finding terminology from bilingual corpora using a flow network model; finally, Gale and Church [11] used a method based on the assumption that probabilistically, there is a correlation between the length of a text and its translation, the probabilistic score was applied to find the maximum likelihood alignment of sentences. More details about Automatic Thesaurus Construction can be found in [19], [13], [3]. 2) Term Vector Translation: Oard and Dorr [19] defined this approach as follows: “We consider statistical multilingual text retrieval techniques in which the goal is to map statistical information about term use between languages... techniques which map sets of tf idf term weights from one language to another [19].” A variation of techniques have been used to enhance the performance of this method, e.g., relevance feedback. Davis and Dunning [5] used a query translation method to retrieve multi-language documents; where as Ballesteros and Croft [1] used dictionary methods for Multi-lingual information retrieval. They also enhanced their method by using expansion techniques for phrasal translation [2]. Finally, Croft [17] used a unified formal model using language modeling, he also integrated query expansion to solve one of the most difficult problems in IR (disambiguation), he implemented his model on a parallel bilingual corpus. More details on Term Vector Translation can be found in [19], [13], [3]. 3) Latent Semantic Indexing (LSI): In 1990, this technique was introduced by Deerwester, Dumais, et al [6]. It associates terms with documents based on the seman-
tic structure in order to find relevant documents to a query. This method is also used in MLIR (LSI-CL). Dumais, Letsche, et al [7] implemented a system that retrieves documents in languages different from the query’s language, besides the original language of the query, it uses LSI for a French-English collection and the evaluation proved to have a good performance. Landauer and Littman [16] used computerized Multi-language document retrieval using latent semantic indexing, their method was patented by Google. More details on Latent Semantic Indexing in MLIR can be found in [19], [13], [3], [18], [8].
Table I T ERMS USED FOR COMPUTING THE RELEVANCE OF A QUERY TO A DOCUMENT
Term coord(q,d) norm(q) tf (t in d) idf(t) boost(t field in d) norm(t,d) Score(q)
Description Score factor based on the number of query Normalization factor for query q Term frequency of term t in the document d Inverse document frequency of term t overall documents Boosting factor for specific field Normalization factor for term t in document d Relevance of query q to document d
III. M ETHODOLOGY AND IMPLEMENTATION In this section, we present a multilingual course/lecture retrieval system. By multilingual, we mean that some courses are presented to students in two languages (English and Spanish). Our corpus consist of courses/lectures from WKU presented in English language augmented with courses from the MIT Open Courseware4 that contains parallel corpora lectures (the exact lecture presented in both languages, English and Spanish). Example 1: When a user submits a query in English or Spanish, if the query term exists in the corpora, the search engine retrieves all documents related to this query and ranks them based on the search engine ranking algorithm (more details in Section III.1), all retrieved documents are in the language the query term belongs to. However, if the query term is part of the E-learning ontology (for more details about the design and implementation of our ontology, refer to our previous work [26], [27], [25]), the system retrieves the semantic meaning of this term and it shows all the classes/subclasses related to this query, it also shows the translation of the query as synonym in the alternative language. When a user clicks on the translation of this query term, the search engine retrieves all documents (lectures) related to that term and ranks them based on the search engine ranking algorithm for this specific language. More details in Section III.3. A. Scoring Algorithm Our search engine is based on Nutch5 , an open-source search engine based on Apache Lucene, 6 which is a scalable Information Retrieval (IR) library that allows indexing and searching capabilities. Its scoring algorithm is based on a combination of Vector Space Model and Boolean Model. It applies the Boolean Model first to select the most relevant documents for the query; then, it uses the Vector Space Model for content-based ranking. The score of query q for document d is related to the cosine-distance similarity between the document and query vectors in a Vector Space Model (VSM), as shown in Equation (1). 4 http://ocw.mit.edu/OcwWeb/web/home/home/index.htm 5 http://www.nutch.org 6 Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java: http://lucene.apache.org/java/docs/.
cos(x.x0 ) =
xT .x0 xT · x0 √ √ = | x | · | x0 | xT x · x0T x0
(1)
where x ∈ R|V| , x and x0 are vector-space representations of two documents, T the ’transpose’ operator and xT ·x0 indicates the dot product between two vectors. Our search engine uses several refinements of VSM by extending the Vector Space Model with associated weights for terms and fields. Equation (2) represents the scoring equation. Table I represents the terms used for computing the relevance of a query to a document. P Score(q, d) = coord(q, d) × queryN orm(q) × (tf (tind)× idf (t)2 × t.getBoost() × norm(t, d)) (2)
B. Objectives The following illustrative examples are meant to provide the benefit of implementing a multilingual information retrieval system in the E-learning setting. Those examples are not exhaustive, but present the motivation behind adding MLIR to our platform: 1) A foreign student who can understand a text written in English, but cannot formulate a well enough query to search for a document (lecture), so he/she can write a query in Spanish and retrieve the lecture in two languages (English-and-Spanish). 2) An English native speaker who would like to read the same document (lecture) in a foreign language (Spanish) to increase his/her foreign language knowledge. In this case, querying for a lecture in English and having the synonym in Spanish would provide the learner with the capability to pick the language he/she wants. C. MLIR Approach Our MLIR research area falls into Domain Specific Retrieval (E-learning). The approach we followed is a synergistic approach between (1) Thesaurus-based Approach and (2) Corpus-based Approach.
Figure 1.
Cross-Language Search Engine (English vs. Spanish)
Table II M ULTILINGUAL T HESAURUS Type Subject Thesaurus Concept List Term List Lexicon
Characteristics College concept as a hierarchical upper-level Subclasses represent the all the colleges in the HyperManyMedia domain List of courses and lectures in cross-language synonyms Semantics represented in the OWL file to present the relations between all the upper types
1) Thesaurus-based Approach: Thesaurus text retrieval allows the learners to explore more information during the searching process. The information retrieval system is capable of bringing more insight about the domain and the relationship between the concepts in the domain and present them in a better formulated query, this helps the learners navigate the system in a way similar to multilingual dictionary, but with visualized hints which can be considered as a powerful tool. Since we already designed and built a domain ontology, this part can be considered as an extension to the original ontology that can distinguish multilingual concepts/subconcepts and the relationship between the entities in the ontology. A multilingual thesaurus can be considered as an ontology thesaurus [19]. Therefore, a multilingual ontology is one which defines terms from more than one language. In our case, it is a bilingual ontology thesaurus, similar to a dictionary, it organizes terms with respect to the two languages (English and Spanish). We used a simple bilingual listing of terms, phrases, concepts, and subconcepts.
The hierarchical structure of the ontology is used to define the relationship between concepts/subconcepts. Since our ontology is a domain specific ontology (E-learning), the terminology used is not a standard terminology. We used a terminology that captures the domain, those terms are associated with college name, course name, lecture name and presented in two languages. Table II presents the thesaurus types that we took under consideration in our design. Refer to the survey of multilingual text retrieval, by Oard and Dorr [19], for more details on Thesaurus types. In URL7 we present our complete extended Cross-Language E-learning Ontology ~40,000 line of code, Figure 1 illustrates part of it. We mentioned in Section II that Schauble and Sheridan [23] distinguished three methods to deal with the translation in information retrieval domain: (1) Query translation, (2) Document translation and (3) Mix of Query & Document translation. Our approach is Query translation approach, i.e., whenever a user submits a query in the semantic search interface, the following two parallel processes occur: 1) All relevant documents to the query term will be retrieved, the ranking of those documents will be based on the scoring algorithm described in Equation (2) 2) An automatic semantic mapping between the query term and the HyperManyMedia ontology, which is resident in memory, if the query term is a part of the HyperManyMedia ontology, automatically the information retrieval 7 http://www.wku.edu/~leyla.zhuhadar/semanticowl.owl
system will present two semantic entities: a) All the subconcepts related to this query term in both languages (English and Spanish). b) Synonym to the query term in the alternative language Tables (1, 2, 3,...,10) in URL8 present part of the thesaurus and the categorization that we took under consideration to build our thesaurus. We consider our thesaurus-based approach as what is called controlled vocabulary approach. Since the semantic search is provided to the users as a hierarchical structure, the platform presents each concept “college” as an upper-level concept on the right-side of the user interface and all subclasses and their synonyms in an alternative language. We consider our approach as query expansion, which proved to increase both precision and recall.
Figure 2.
Multi-Language Search (English term)
Our approach is considered as Term Vector Translation. Oard and Dorr [19] defined this approach as: “statistical multilingual text retrieval techniques in which the goal is to map statistical information about term use between languages... techniques which map sets of tf idf term weights from one language to another [19].” We used a query translation method to retrieve multilingual documents with an expansion technique for phrasal translation. As we mentioned previously, our search engine uses the Vector Space Model to match the query term with the indexed documents, and it uses the scoring Equation (2). The scoring algorithm is based on the vector space model representation of the documents. Each term vector representation is associated with each field document. We discussed the weight associated with each term in Section III.1. We used the vector space model technique for multilingual term vector translation. Algorithm 1 describes the method used to implement this model. When a user submits a query in English or Spanish, and clicks on the Cross-Language search engine, if the query is a part of our indexed translated terms, the Cross-Language search engine does the following: 1) Translate the query q to the alternative language q 0 , as shown in Algorithm 1. 2) Use the vector space model to calculate the dot product between the translated query and the documents in the hyperManyMedia repository. It uses Equation (2), after substituting each q to q 0 to retrieve relevant documents and rank them based on the score. 3) If the query has no translation in our system, then the user will have only the retrieved documents where terms from the original q query appears. Figures 2 and 3 show the retrieval interface for CrossLanguage search generated for a keyword = History. IV. C ONCLUSION
Figure 3.
Multi-Language Search (Spanish term)
2) Corpus-based Approach: In Section II, we reviewed different techniques to build a Multilingual Information Retrieval system some of these techniques explore the statistical information about the corpora. Oard and Dorr’s survey [19] distinguished three techniques: (1) Automatic Thesaurus Construction, (2) Term Vector Translation and (3) Latent Semantic Indexing (LSI). 8 http://web2.wku.edu/~leyla.zhuhadar/Multi-LanguageInformationRetrieval. php
In this paper, we presented a multilingual retrieval system. Our corpus consist of courses/lectures from WKU (English only) augmented with courses from the MIT Open Courseware9 . The MIT courses contain parallel corpora lectures (the exact lecture presented in both languages, English and Spanish). Our MLIR research falls into the Domain Specific Retrieval (E-learning being the domain). The approach that we followed was a synergistic approach between (1) Thesaurusbased Approach and (2) Corpus-based Approach. In the case of the Thesaurus-based Approach, we used a simple bilingual listing of terms, phrases, concepts, and subconcepts. The hierarchical structure of the ontology is used to define the relationship between concepts/subconcepts. Also, we used a specific terminology that captures the domain of E-learning, those terms are associated with college name, course name, lecture name and presented in two languages. In the case of Corpus-based Approach, we used a Term Vector Translation approach, where the goal was to map statistical information about term usage between languages using techniques which 9 http://ocw.mit.edu/OcwWeb/web/home/home/index.htm
Algorithm 1 Multilingual (Cross-language) Term Vector Translation Input: query q in language L Output: relevent documents in language L0 only If (q 0 = translate(q) 0) then P tf (t.in.q 0 ).(idf (t).boost(t.f ield.in.d).(norm(t, d)); Score(q 0 , d) = coord(q 0 , d).(norm(q 0 )) t.in.q 0
// retrieve only the documents in Language L0and rank them end; end
map sets of tf idf term weights from English to Spanish and vice-versa. This research has been implemented on a real platform called HyperManyMedia10 at Western Kentucky University. R EFERENCES [1] L. Ballesteros and B. Croft, “Dictionary methods for cross-lingual information retrieval,” Lecture Notes in Computer Science, vol. 1134, pp. 791–801, 1996. [2] L. Ballesteros and W. Croft, “Phrasal translation and query expansion techniques for cross-language information retrieval,” in ACM SIGIR Forum, vol. 31. ACM New York, NY, USA, 1997, pp. 84–91. [3] W. Bruce, D. Metzler, and T. Strohman, Search engines: information retrieval in practice. Addison-Wesley, 2009. [4] B. Daille, É. Gaussier, and J. Langé, “Towards automatic extraction of monolingual and bilingual terminology,” in Proceedings of the 15th conference on Computational linguistics-Volume 1. Association for Computational Linguistics Morristown, NJ, USA, 1994, pp. 515–521. [5] M. Davis and T. Dunning, “A TREC evaluation of query translation methods for multi-lingual text retrieval,” in Fourth Text Retrieval Conference, 1995, pp. 483–498. [6] S. Deerwester, S. Dumais, G. Furnas, T. Landauer, and R. Harshman, “Indexing by latent semantic analysis,” Journal of the American Society for Information Science, vol. 41, no. 6, pp. 391–407, 1990. [7] S. Dumais, T. Letsche, M. Littman, and T. Landauer, “Automatic cross-language retrieval using latent semantic indexing,” AAAI Spring Symposuim on Cross-Language Text and Speech Retrieval, pp. 115–132, 1997. [8] R. Feldman and J. Sanger, The text mining handbook. Cambridge University Press, 2006. [9] C. Fluhr, “Multilingual information retrieval,” Cambridge Studies In Natural Language Processing Series, pp. 261–266, 1997. [10] C. Fluhr and K. Radwan, “FULLETEX DATABASES AS LEXICAL SEMANTIC KNOWLEDGE FOR MULTILINGUAL INTERROGATION AND MACHINE TRANSLATION,” in Proceedings of the EastWest Conference on Artificial Intelligence: EWAIC’93, September 7-9, 1993, Moscow, Russia. Association for Artificial Intellegence of Russia, 1993, p. 124. [11] W. Gale and K. Church, “Identifying word correspondences in parallel texts,” in Proceedings of the workshop on Speech and Natural Language, 1991, pp. 152–157. [12] É. Gaussier, “Flow network models for word alignment and terminology extraction from bilingual corpora,” in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1. Association for Computational Linguistics Morristown, NJ, USA, 1998, pp. 444–450. [13] D. Grossman and O. Frieder, Information Retrieval: Algorithms And Heuristics. Springer, 2004. [14] H. Haddouti, “Survey: Multilingual text retrieval and access,” in Working Notes of the AAAI Symposium on Cross Language Text and Speech Retrieval. Citeseer, 1997. [15] J. Kupiec, “An algorithm for finding noun phrase correspondences in bilingual corpora,” in Proceedings of the 31st annual meeting on Association for Computational Linguistics. Association for Computational Linguistics Morristown, NJ, USA, 1993, pp. 17–22. 10 HyperManyMedia
platform: http://hypermanymedia.wku.edu
[16] T. Landauer and M. Littman, “Computerized cross-language document retrieval using latent semantic indexing,” Apr. 5 1994, uS Patent 5,301,109. [17] V. Lavrenko, M. Choquette, and W. Croft, “Cross-lingual relevance models,” in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM New York, NY, USA, 2002, pp. 175–182. [18] B. Liu, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data (Data-Centric Systems and Applications). Springer, January 2007. [Online]. Available: http://www.amazon.ca/exec/obidos/redirect? tag=citeulike09-20\&path=ASIN/3540378812 [19] D. W. Oard and B. J. Dorr, “A survey of multilingual text retrieval,” 1996. [20] C. Peters, M. Braschler, and J. Gonzalo, Advances in cross-language information retrieval: third workshop of the Cross-Language Evaluation Forum, CLEF 2002, Rome, Italy, September 19-20, 2002: revised papers. Springer Verlag, 2003. [21] V. Pigur, “Multilanguage information-retrieval systems: Integration levels and language support,” Automatic Documentation and Mathematical Linguistics, vol. 13, no. 1, pp. 36–46, 1979. [22] G. Salton, “Automatic processing of foreign language documents,” in Proceedings of the 1969 conference on Computational linguistics. Association for Computational Linguistics Morristown, NJ, USA, 1969, pp. 1–28. [23] P. Schauble and P. Sheridan, “Cross-language information retrieval (CLIR) track overview,” NIST SPECIAL PUBLICATION SP, pp. 31– 44, 1998. [24] P. Van der Eijk, “Automating the acquisition of bilingual terminology,” in Proceedings of the sixth conference on European chapter of the Association for Computational Linguistics, April, 1993, pp. 21–23. [25] L. Zhuhadar, O. Nasraoui, and R. Wyatt, “Visual Ontology-Based Information Retrieval System,” in Proceedings of the 2009 13th International Conference Information Visualisation-Volume 00. IEEE Computer Society, 2009, pp. 419–426. [26] L. Zhuhadar and O. Nasraoui, “Semantic information retrieval for personalized e-learning,” Tools with Artificial Intelligence, 2008. ICTAI ’08. 20th IEEE International Conference on, vol. 1, pp. 364–368, Nov. 2008. [27] L. Zhuhadar, O. Nasraoui, and R. Wyatt, “Dual representation of the semantic user profile for personalized web search in an evolving domain,” in Proceedings of the AAAI 2009 Spring Symposium on Social Semantic Web, Where Web 2.0 meets Web 3.0, 2009, pp. 84–89.