Keywords. Cross-Language Information Retrieval, Unicode, Ontology, Knowledge Based Information. Retrieval. ..... Optimization. (EP95 in San Diego). Amtrup ...
TALN 2003, Batz-sur-Mer, 11-14 juin 2003
Cross-Language Information Retrieval using Ontology Ahmed Abdelali, James Cowie, David Farwell, Bill Ogden, and Stephen Helmreich Computing Research Laboratory Box 30001/3CRL New Mexico State University Las Cruces, NM 88003 USA {ahmed}@crl.nmsu.edu
Abstract In this paper we present a description and an evaluation of ontology- based Cross-Language Information Retrieval. Earlier systems we have developed used bilingual dictionaries to support a user in selecting terms in the language of the documents being retrieved. This presents the user with the problem of deciding if the translations are the correct senses needed for the query. The system described here replaces the bilingual dictionaries by a pair of language-ontology lexicons. The user can see definitions of the senses in the ontology and then select matching terms in the target language. This allows better control of the generation of the new query in the target language.
Keywords Cross-Language Information Retrieval, Unicode, Ontology, Knowledge Based Information Retrieval.
1 Introduction The aim of Information Retrieval (IR) is to find and retrieve documents relevant to a given query, usually where documents and query are in the same language. With further advances in research and technology the goal was extended beyond language barriers to include different in different languages, which is known as Cross Language Information Retrieval (CLIR). Using the available capabilities of the Computing Research Laboratory at New Mexico State University, we developed a new approach and produced a cross-language retrieval system with meaning-based alternatives for query translation. This paper includes an overview of the approach, a description of the system, and a preliminary evaluation of performance.
Cross-Language Information Retrieval using Ontology
2 Background The knowledge-based approach described here was intended to solve problems that exist in corpus-based, bilingual-dictionary-based, and machine-translation-based Information Retrieval. For example, there are problems with query recall, problems with ambiguity, and problems matching the query to actual documents. We started this project by developing tools to support acquisition of knowledge for the ontology and lexicon resources. These tools included an editor by which acquirers could populate the ontology concepts, as well as an interface that allowed the lexicographers to acquire and expand the English, Spanish and Chinese lexicons. These tools also supported mapping of acquired lexemes to ontology concepts. After developing these resources, we used Keizai, the cross-language information retrieval system, developed during previous CRL projects (Davis & Dunning 1995, Ogden et al. 1999, Ogden & Davis 2000), and modified it to support ontology-based queries for all the three languages.The Unicode-based nature of the system facilitated the modification. On the other hand, the Knowledge-based Ontology developed at CRL and used in the Mikrokosmos Project (Mahesh & Nirenburg 1995, Mahesh 1996) is language neutral and combines the information in a thesaurus with that in an encyclopedia. It also contains conceptlevel co-occurrence constraints. The ontology is connected to lexical elements of natural languages via a dictionary, either directly or through the English side of available bilingual dictionaries. The combination of ontological knowledge and its connection to the dictionaries gives the approach a powerful means for resolving IR problems. Through the ontology and its related lexicons/dictionaries the user of the IR system has the ability to do a direct lookup in any of the dictionaries. E.g., the English word “ship” leads to immediate instantiation of the corresponding ontological concept “SHIP”, and also to words in other languages, such as Spanish and Chinese, which are also used to express the concept “SHIP”. (Figure 1)
Figure 1: Ontological Concept “SHIP” and related natural language entries Every ontological concept contains a set of features that allow the user to disambiguate the concept from other hyponyms; also the concept, through a set of defined relations, is connected to other concepts. The latter serves as a means for expansion of the IR query. To evaluate the performance of the system we performed experiments that would explore the advantages and problems with the approach. The evaluation was conducted in a comparative fashion, as an evaluation of the ontology-based versus dictionary-based approaches to information retrieval.
Cross-Language Information Retrieval using Ontology
3 Tool Description The system is based on Keizai, a cross-language, interactive, retrieval and summarization system that uses URSA (Unicode Retrieval System Architecture) and MINDS (Multilingual Interactive Document Summarization), developed at CRL. Keizai uses a combination of automatic and user-assisted methods to build and improve cross-language queries. It sends the modified queries to language-specific query modules to retrieve documents, and displays various types of English summaries of the retrieved document (See Figures 2, 3 and 4). Figures 2, 3, and 4 illustrate the steps taken in retrieving documents in Chinese containing the Chinese equivalents of the word “ship”. In this task, the user enters the English word in the English Query Interface-Interactive Selection. The result of the request will be the set of matches in a set of bilingual dictionaries; the entries are sorted by language. The user then chooses the closest translations that could match the original query. In the final step after constructing the new query in the target language, the system will return a set of documents relevant to the query. The user has also the possibility of translating the returned document back to English. Depending on the source language, we either use an internal translation system MEAT (Amtrup et al. 2000, Zajac et al. 2001) (Chinese) or an external translation engine Systran (Spanish).
Figure 2 : Query on Keizai using regular lexicon
Cross-Language Information Retrieval using Ontology
Figure 3 : Chinese and Arabic “ship” using regular lexicon
Cross-Language Information Retrieval using Ontology
Figure 4 : Chinese text retrieved by Keizai using regular lexicon The new approach using the ontology gives the user another route to retrieve the data. Since the ontology is connected to different lexicons available via semantic dictionaries, the interface provides a wider variety of lexical choices in the target languages, but these are organized by concept. Figure 5 illustrates a query for the word “building” using the ontology lexicon. The search outputs a list of ontology concepts containing the word “building”, each for a different sense of the word. As shown in Figure 6, the user selects the intended meaning of the word. Then the editor shows that ontology entry for “building”, in the right frame, with one or more English and Spanish equivalents of the word mapped to it. The user then selects one or more appropriate equivalents, in this case “construcción” and enters it in the query automatically (Figure 6).
Cross-Language Information Retrieval using Ontology
Figure 5 : Using ontology lexicon through the new Keizai interface
Figure 6 : Spanish equivalent for “building” selected for the new query Keizai retrieves a list of Spanish texts that contain the word “construcción” and outputs them in thumbnail form. The user then can click on each of the texts retrieved for further analysis.
Cross-Language Information Retrieval using Ontology
As a visual aid, the occurrences of a keyword in the texts retrieved can be highlighted by selecting color options on the interface and mouse-over action on the keyword.
4 Evaluation Evaluation was carried out over two versions of the IR system, one using the ontology versus one using the regular lexicon. We chose Chinese as the target language for this test. For each test, we took the first 20 documents from the results for comparison. The testing procedures are as follows: Query
Buy
Money
Market
Guest
Discover
Share
Method
Words Available
Words Selected
Total Relevant retrieved document in documents the first 20
Ontology
27
17
352
20
Regular
10
5
316
20
Ontology
40
17
413
20
Regular
21
8
324
20
Ontology
10
10
728
20
Regular
22
7
1256
20
Ontology
2
2
488
13
Regular
6
3
218
19
Ontology
1
1
21
10
Regular
7
2
1430
19
Ontology
8
8
177
20
Regular
199
2
16
4
Figure 7 : Table comparing results of Ontology versus regular lexicon IR Information retrieval using the ontology: 1.
An English query is entered;
2.
"Use Ontology lexicon" option is selected to generate the list of concepts;
3.
One concept from the list is chosen;
4.
All related Chinese words connected to each concept are selected and specified as Chinese queries;
Cross-Language Information Retrieval using Ontology
5.
The retrieval results are displayed.
6.
The first 20 documents are checked against the queries for relevancy.
Information retrieval using regular lexicon: 1.
An English query is entered;
2.
"Use Regular lexicon" option is selected to generate the list of Chinese-English pairs;
3.
All related Chinese words in the list are selected and entered as Chinese queries;
4.
The retrieval results are displayed.
5.
The first 20 documents are checked against the queries for relevancy.
Figure 7 shows the results.
5 Discussion The purpose of ontology-based IR is to narrow down the search by eliminating the number of meanings (sense/concepts) of the query. Only a relevant concept is chosen to find the Chinese words. The user selects the Chinese words as queries for retrieval. To reach high-quality results the following supporting components are necessary: •
An ontology with wide range of coverage in world knowledge;
•
A large-size lexicon in Chinese, English and Spanish with accurate mapping to the ontology concepts.
•
A large corpus of on-line documents in three languages.
•
A sophisticated IR strategy.
Even with the above supports, ontology IR only controls the meaning of the English query. Once the Chinese words are found, there is no control of the Chinese queries. That is, each Chinese word may have several meanings that result in irrelevant documents being selected. The ontology-base IR approach relies heavily on the ontology, and particularly on the accuracy of the lexicon mapping in various languages. In case no appropriate concept exists or if the constraints on the concept are neglected, the translated query can be far different than the original meaning of the English query. For example, in the current ontology there are 832 English words mapping to the concept OBJECT and 787 English words mapping to the concept EVENT. This will generate a Chinese query with a very general sense. On the other hand, there are 968 Spanish words mapping to OBJECT and 921 Spanish words mapping to EVENT. This will result in unmanageable Spanish IR. Much work is needed in building highquality lexicons in three languages. The lexicon mapping format needs to be consistent. Currently there are two different ways of mapping a lexical item to a PROPERTY (as opposed to an EVENT or an OBJECT): Case 1. mapping a noun to a PROPERTY, such "price"
Cross-Language Information Retrieval using Ontology
COST[DOMAIN COMMODITY] COMMODITY[DOMAIN-OF COST] The IR system extracts COST as a concept in the first case and COMMODITY in the second case that results in wrong query. Case 2. mapping an adjective to a PROPERTY, such as "equal" or "equality" EQUAL[DOMAIN OBJECT] OBJECT[DOMAIN-OF EQUAL-TO] The IR system extracts EQUAL as a concept in the first case and OBJECT in the second case, resulting in wrong query. Since the lexicon files are shared with various projects for different purposes, the lexicon-mapping algorithm cannot be based only on IR needs. Therefore different strategies of concept extraction must be taken into account. Problems particularly in Chinese: Because Chinese texts are presented as a sequence of characters, without segmentation, there is no way to indicate word boundaries. Efficient segmentation may be needed to improve the IR results. Problem in the method of extracting concepts: Currently the system only extracts the head concept regardless of constraints. As result the concept is generalize, so that, for example, 'man', 'woman', 'child' are all mapped to HUMAN. Therefore the translated query is too general in comparison to the original English query. The system also is not aware that the meaning of adjective is represented in the constraint. In most case the translation of adjective queries are incorrect. The case is similar when mapping a noun to PROPERTY. As a result, irrelevant documents are sometimes collected. The current Chinese lexicon is too small and in a specific domain, while the Chinese corpus collection is in the general domain. This fact limits the number of documents retrieved. The Chinese lexicon needs to be extended. English lexicon needs to be checked with respect to concept accuracy and format consistency, in order to reduce the number of irrelevant documents.
6 Conclusion and future work In this simple attempt to use a new approach for replacing conventional lexically-based IR with ontology-based IR we demonstrated that the ontology-based IR performed equivalently toi the regular lexicon IR. Improving the quality and size of the ontology could improve results. Promising results could be achieved with little effort by fixing the ontology inconsistencies and populating the attached lexicons. Another point to consider for future work to improve the system includes correcting the wrong concept mapping in both English and Spanish and changing method of extracting concepts from adjectives.
Cross-Language Information Retrieval using Ontology
Acknowledgments Keizai was originally developed by Mark Davis. The principal designers of the Mikrokosmos Ontology are Sergei Nirenburg and Victor Raskin.
References Davis, Mark, and Ted Dunning. (1995) Cross-Language Text Retrieval using Evolutionary Optimization. (EP95 in San Diego). Amtrup, Jan W., Hamid Mansouri Rad, Karine Megerdoomian, and Rémi Zajac. (2000) Persian-English Machine Translation: An Overview of the Shiraz Project. NMSU CRL Technical Report. MCCS-00-319 Mahesh, Kavi, and Sergei Nirenburg. (1995). Semantic Classification for Practical Natural Language Processing. Proceedings of the 6th ASIS SIG/CR Classification Research Workshop: An Interdisciplinary Meeting. Chicago, Illinois. Mahesh, Kavi. (1996). Ontology Development for Machine Translation: Ideology and Methodology. NMSU CRL Technical Report. MCCS-96-292. Ogden, William, James Cowie, Mark Davis, Eugene Ludovik, Sergei Nirenburg, Hugo Molina-Salgado, and Nigel Sharples. (1999) Keizai: An Interactive Cross-Language Text Retrieval System. Paper presented at the Workshop on Machine Translation for Crosslanguage Information Retrieval, Machine Translation Summit VII, September 13-17, 1999, Singapore. Ogden, William, and Mark Davis. (2000) Improving Cross-Language Text Retrieval with Human Interactions. Hawaii International Conference on System Sciences, HICSS-33 January 4-7, 2000. Zajac, Rémi, Ahmed Malki, Ahmed Abdelali, James Cowie, and William Ogden W. (2001). Arabic-English NLP at CRL, Proceedings of the Arabic NLP Workshop ACL/EACL in July 2001, Toulouse (France).