using eurowordnet in a concept-based approach to cross ... - CiteSeerX

10 downloads 22 Views 378KB Size Report
Fulton–County–Grand–Jury), persons (Cervantes), or locations (Fulton). We also generated a list of ''stop± senses'' and a list of ''stop± synsets,'' automatically ...
u

USING EUROWORDNET IN A CONCEPT-BASED APPROACH TO CROSS-LANGUAGE TEXT RETRIEVAL JULIO GONZALO, FELISA VERDEJO, and IRINA CHUGUR UNED, Ciudad Universitaria, Madrid, Spain

W e present an approach to cross± language text retrieval based on the EuroWordN et (EW N ) multilingual semantic database. EuroW ordNet is a multilingual, W ordNet± like database with basic semantic relations between words for several European languages (English, Dutch, Spanish, Italian, German, French, Czech, and Estonian). In addition to the relations in W ordNet 1.5, EW N includes domain labels, cross± language, and cross± part± of± speech relations, which are directly useful for multilingual information retrieval. In our approach, documents in any language covered by EuroW ordNet are indexed in a space of language± independent concepts (the EuroW ordNet Inter Lingual Index ), thus turning term weighting and query/ document matching into language± independent tasks. W e report on the results of a number of experiments that measure the potential beneŽts of the approach and its tolerance to word sense disambiguation errors. In our monolingual experiments, the classical, vector space model for text retrieval is shown to give better results (up to 29% better in our experiments) if W ordNet synsets are chosen as the indexing space, instead of word forms. T his result is obtained for a manually disambiguated test collection derived from the S EMCOR annotated corpus. T he sensitivity of retrieval performance to (automatic) disambiguation errors is also measured. Our preliminary bilingual experiments, also reported here, show that our approach can sensibly outperform a naive, dictionary± based, translation of the query terms into the target language.

Text retrieval deals with the problem of Žnding all the relevant documents in a text collection for a given user’s query, stated in a natural language. For a human, this involves reading and understanding documents and query, and judging on the relevance of each document for the query. Such tasks seem to fall naturally within the scope of knowledge representation and natural language processing (N LP) techniques. However, both scientiŽc communities (information retrieval and NLP people) have largely evolved in isolation one from the other. There are two powerful reasons : one is that Final version received N ovember 1998. This research is being supported by the European Community, project LE [ 4003 and also partially by the Spanish government, project TIC± 96± 1243 ± CO3± O1. We are indebted to J. Ignacio M ayorga, Anselmo PenÄ as, Fernando Ostenero, and David Ferna ndez for their help building up the test collection. Thanks also to Carol Peters for many fruitful discussions.  ctrica, Electro nica y de Address correspondence to Julio Gonzalo, Departamento de Ingenierõ a Ele Control, UNED, Ciudad Universitaria , s.n., 28040 M adrid, Spain. E± mail : julio @ ieec.uned.es

Applied ArtiŽcial Intelligence, 13 :647­ 678, 1999 Copyright Ó 1999 Taylo r & Francis 0883 ± 9514/99 $12.0 0 1 .00

647

648

J . Gonzalo et al.

statistical approaches that neglect the linguistic properties of the texts they manipulate, have been quite successful for information retrieval. The other is that the attempts to introduce natural language techniques (part± of± speech tagging, morphological analysis and stemming, word± sense disambiguation, etc.) have largely failed to improve statistical approaches signiŽ cantly. However, the increasing relevance of cross± language and multilingual text retrieval seems to be changing this landscape. The explosive growth of universally accessible information over the international networks± information that is unstructured, heterogeneous, and multilingual by nature± has made cross± language text retrieval (CLTR) one of the currently most compelling challenges for the software industry. In principle, a user of a WWW search engine wants to Žnd the information relevant to his query, regardless of the languages used to write documents and query. And thus, the search has to be able to Žnd documents that are expressed in diV erent languages. But the cross± language text retrieval task has proved to be much harder than its monolingual counterpart. In Grefenstette (1998), the three problems that CLTR must solve are identiŽed as : a. knowing how a term expressed in one language might be written in another ; b. deciding which of the possible translations are appropriate in a given context ; and c. deciding how to weight diV erent translation alternatives when more than one is retained. Problem a is related with the use of bilingual dictionaries and other lan± guage resources, and b and c with word± sense disambiguation and machine translation issues. And, on the other hand, it has been generally observed that traditional IR models and techniques su V er a loss of performance of around 50% when adapted naively to cross± language retrieval. It seems, therefore, that a point of convergence between IR, AI, and N LP ± based tech± niques must be found to deal satisfactorily with the problem of cross± language text retrieval. The main approaches to CLTR being experimented with today use either knowledge± based or corpus± based techniques (Oard, 1997).

Knowledge± based approaches. Apply bilingual or multilingual diction± aries, thesauri, or general± purpose ontologies to get appropriate equivalents in the target language for the original terms of the query. U SING T HESAURI : So far, the best known and tested approaches to CLTR are thesaurus± based, although these are generally used in controlled± text retrieval, where each document is indexed (mainly by hand) with key-

Cross± Language T ext Retrieval

649

words from the thesaurus. A thesaurus is an ontology specializing in organizing terminology ; a multilingual thesaurus organizes terminology for more than one language. ISO 5964 gives speciŽ cations for the incorporation of domain knowledge in multilingual thesauri and identiŽes alternative tech± niques. There are now a number of multilingual thesaurus± based systems available commercially. However, controlled text retrieval demands resource± consuming thesaurus construction and maintenance and user± training for optimum usage. In addition, domain± speciŽ c thesauri are not very useful outside of the particular domain for which they have been designed. The remainder of the article will implicitly refer to free± text retrieval, where queries are compared against full documents, rather than prebuilt keyword descriptions of the documents. U SING DICTIONARIES : Some of the Žrst methods attempting to match the query to the document for free± text (as opposed to controlled± text) retrieval have used bilingual dictionaries. It has been shown that dictionary± based query translation, where each term or phrase in the query is replaced by a list of all its possible translations, represents an acceptable Žrst pass at cross± language information retrieval although such relatively simple methods clearly show performance below that of monolingual retrieval. Automatic machine readable dictionary (M RD) query translation, on its own, has been found to lead to a drop in eV ectiveness of 40­ 60% of monolingual retrieval (Hull & Grefenstette, 1996 ; Ballesteros & Croft, 1996). There are three main reasons for this : general purpose dictionaries do not normally contain specialized vocabulary ; failure to translate multiword terms ; and the pres± ence of spurious translations.

Corpus± based approaches. The above considerations have encouraged an interest in corpus± based techniques in which information about the relation± ship between terms over languages is obtained from observed statistics of term usage. Corpus± based approaches analyze large collections of texts in multiple languages and automatically extract the information needed to con± struct application± speciŽc translation techniques. The collections analyzed may consist of parallel (translation equivalent) or comparable (domain± spe± ciŽc) sets of documents. The main approaches that have been experimented using corpora are vector space and probabilistic techniques. A recent, com± parative evaluation of some representative approaches to corpus± based cross± language free± text retrieval (Carbonell et al., 1997) showed that such approaches± and in particular some applications of example± based machine translation ± signiŽ cantly outperformed the simple dictionary± based term translation used in the evaluation. The Žrst tests with parallel corpora were on statistical methods for the extraction of multilingual term equivalence data, which could be used as input for the lexical component of M T systems. Some of the most interesting

650

J . Gonzalo et al.

recent experiments, however, are those using a matrix reduction technique known as latent semantic indexing (LSI) to extract language independent terms and document representations from parallel corpora (Dumais et al., 1996). Latent semantic indexing applies a singular value decomposition to a large, sparse term document co± occurrence matrix (including terms from all parallel versions of the documents) and extracts a subset of the singular vectors to form a new vector space. Thus queries in one language can retrieve documents in the other (as well as in the original language). The problem with using parallel texts as training corpora is that test corpora are costly to acquire± it is difficult to Žnd already existing trans± lations of the right kind of documents and translated versions are expensive to create. F or this reason, there has been a lot of interest recently in the potential of comparable corpora. A comparable document collection is one in which documents are aligned on the basis of the similarity between the topics they address rather than because they are translation equivalent. M ethods have been studied to extract information from such corpora on cross± language equivalences in order to translate and expand a query formu± lated in one language with useful terms in another (Sheridan & Ballerini, 1996 ; Picchi & Peters, 1996). Again, as with the parallel corpus method reported above, it appears that such strategies are very application depen± dent. A new reference corpus would have to be built to perform retrieval on a new topic. From this discussion, we can conclude that any single method currently being tried presents limitations. Existing resources± such as electronic bilin± gual dictionaries± are normally inadequate or insufficient for the purpose; the building of resources like domain± speciŽc thesauri and training corpora is expensive and such resources are generally not fully reusable; a new multi± lingual application will require the construction of new resources or con± siderable work for the adaptation of previously built ones. It should also be noted that most of the systems and methods in use so far concentrate on pairs rather than multiples of languages. This is hardly surprising. The situation is far more complex when an attempt is made to achieve eV ective retrieval over a number of languages than over a single pair ; it is necessary to study some kind of interlingual mechanism± at a more or less conceptual level± in order to permit multiple cross± language transfer. The EWN project (Vossen, 1998) aims at building a multilingual, WordN et± like database with basic semantic relations between words for several European languages (English, Dutch, Spanish, Italian, German, and French), and it is scheduled to produce the Žnal database in 1999. Such a large± scale, multilingual semantic database o V ers an interesting knowledge± based alternative to query expansion techniques± performing conceptual, language± neutral retrieval without requiring neither training nor parallel

Cross± Language T ext Retrieval

651

corpora. We present such approach here, together with a number of experi± ments to determine whether it can enhance retrieval and whether it is a feasible technique. First of all, we review previous approaches to text retrieval using WordN et, Žnding that retrieval strategies and word± sense disambiguation problems have not been properly isolated from each other. Then we describe a set of monolingual retrieval experiences with a hand± disambiguated test collection derived from Semcor (M iller et al., 1994), a subset of the Brown Corpus annotated with WordN et senses. These experiment s indicate that retrieval with WordNet can be more efficient than it had been before, pro± vided that word± sense disambiguation can be performed to a certain degree of accuracy. Then we state our proposal for language± independent text retrieval with the EWN database, and perform some bilingual experiments with a preliminary version of the database to test the feasibility of the approach and identify how the database should be improved during the last building stages to permit better cross± language retrieval.

WORDN ET AN D TEXT RETRIEVAL : PREVIOUS APPROACHES

d

d

WordNet (M iller, 1990) is a freely available lexical database for English. It consists of semantic relations between English words, which can be acces± sed as a kind of thesaurus, in which words with similar meanings are grouped together into so± called synsets (synonym sets). Besides synonymy (implicit in the deŽnition of synset), other relations are established between synsets (or, exceptionally, between word forms): hyponymy/hyperonymy (IS± A relation), which gives the network a hierarchical structure; meronymy/ holonymy (HAS± A relation) in its part, member, and substance variants ; and antonymy (between opposite word forms). With these relations, the WordN et lexical database is conŽgured as a web of 168,000 synsets (concepts) that contain 126,000 diV erent word forms. A large± scale semantic database such as WordN et seems to have a great potential for text retrieval. There are, at least, two obvious reasons : It oV ers the possibility to discriminate word senses in documents and queries. This would prevent matching spring in its ‘‘metal device’’ sense with documents mentioning spring in the sense of springtime. And then retrieval accuracy could be improved. WordN et provides the chance of matching semantically related words. F or instance, spring, fountain, out¯ ow, outpouring, in the appropriate senses, can be identiŽed as occurrences of the same concept, ‘‘natural ¯ ow of ground water.’’ And beyond synonymy, WordNet can be used to

652

J . Gonzalo et al.

measure semantic distance between occurring terms to get more sophisti± cated ways of comparing documents and queries. However, the general feeling within the information retrieval community is that dealing explicitly with semantic information does not improve signiŽ ± cantly the performance of text retrieval systems. This impression is founded on the results of some experiments measuring the role of word sense dis± ambiguation (WSD) for text retrieval, on one hand, and some attempts to exploit the features of WordNet and other lexical databases, on the other hand. In Sanderson (1994), word sense ambiguity is shown to produce only minor eV ects on retrieval accuracy, apparently conŽrming that query/ document matching strategies already perform an implicit disambiguation. Sanderson also estimates that if explicit WSD is performed with less than 90% accuracy, the results are worse than nondisambiguating at all. In his experimental setup, ambiguity is introduced artiŽcially in the documents, substituting randomly chosen pairs of words (for instance, banana and kalashnikov) with artiŽcially ambiguous terms (banana/kalashnikov). While his results are very interesting, it remains unclear, in our opinion, whether they would be corroborated with real occurrences of ambiguous words. There is also other minor weakness in Sanderson’s experiments. When he ‘‘disambiguates’’ a term such as spring/bank to get, for instance, bank, he has done only a partial disambiguation, as bank can be used in more than one sense in the text collection. Besides disambiguation, many attempts have been done to exploit WordN et for text retrieval purposes. M ainly, two aspects have been addressed ± the enrichment of queries with semantically related terms, on one hand, and the comparison of queries and documents via conceptual distance measures, on the other. Query expansion with WordNet has shown to be potentially relevant to enhance recall, as it permits matching relevan t documents that could not contain any of the query terms (Smeaton et al., 1995). However, it has pro± duced few successful experiments. For instance, (Voorhees, 1994) manually expanded 50 queries over a TREC± 1 collection (Harman, 1993) using syn± onymy and other semantic relations from WordNet 1.3. Voorhees found that the expansion was useful with short, incomplete queries, and rather useless for complete topic statements, where other expansion techniques worked better. For short queries, it remained the problem of selecting the expansions automatically± doing it badly could degrade retrieval per± formance rather than enhancing it. In Richardson & Smeaton (1995), a com± bination of rather sophisticated techniques based on WordN et, including automatic disambiguation and measures of semantic relatedness between query/document concepts resulted in a drop of eV ectiveness. Unfortunately,

Cross± Language T ext Retrieval

653

the eV ects of WSD errors could not be discerned from the accuracy of the retrieval strategy. However, in Smeaton and Quigley (1996), retrieval on a small collection of image captions± that is, on very short documents± is reasonably improved using measures of conceptual distance between words based on WordNet 1.4. Previously, captions and queries had been manually disambiguated against WordNet. The reason for such success is that with very short documents (e.g., boys playing in the sand) the chance of Žnding the original terms of the query (e.g., of children running on a beach) are much lower than for average± size documents (that typically include many phrasings for the same concepts). These results are in agreement with Voorhees (1994), but it remains the question of whether the conceptual dis± tance matching would scale up to longer documents and queries. In addi± tion, the experiments in Smeaton and Quigley (1996) only consider nouns, while WordN et o V ers the chance to use all open± class words (nouns, verbs, adjectives, and adverbs).

M ON OLIN GUAL EXPERIM EN TS WITH WORDN ET Our essential retrieval strategy in the experiments reported here is to adapt a classical vector model± based system, using WordNet synsets as indexing space instead of word forms. This approach combines two beneŽts for retrieval, regardless of multilinguality : i. terms are fully disambiguated as synsets representing word senses (this should improve precision); ii. equivalent terms can be identiŽed, as terms with the same sense map to the same synset (this should improve recall).

d

Note that query expansion does not satisfy the Ž rst condition, as the terms used to expand a query are, themselves, words and, therefore, can be in their turn ambiguous. On the other hand, plain word sense disambiguation does not satisfy the second condition, as equivalen t senses of two diV erent words are not recognized. Thus, indexing by synsets enables a maximum of word sense matching while reducing spurious matching and seems to be a good starting point to study text retrieval using either WordNet or EuroWord± Net. Given this approach, our goal is to test two main issues that are not clearly answered to our knowledge by the experiments mentioned above: Abstracting from the problem of sense disambiguation, what potential does WordN et oV er for text retrieval ? In particular, we would like to extend experiments with manually disambiguated queries and documents to average± size texts.

654

J . Gonzalo et al.

Once the potential of WordNet is known for a manually disambiguated collection, we want to test the sensitivity of retrieval performance to dis± ambiguation errors introduced by automatic WSD. d

The Test Collection The best± known publicly available corpus hand± tagged with WordN et senses is SEMCOR (M iller et al., 1993), a subset of the Brown Corpus of about 100 documents that occupies about 11 M b. (including tags). The collection is rather heterogeneous, covering politics, sports, music, cinema, philosophy, excerpts from Žction novels, scientiŽc texts, etc. A new, bigger version has been made available recently (Landes et al., 1998), but we have not still adapted it for our collection. We have adapted SEMCOR in order to build a test collection that we call IR S EMCOR in four manual steps : We have split the documents to get coherent chunks of text for retrieval. We have obtained 171 fragments that constitute our text collection, with an average length of 1,331 words per fragment. We have extended the original TOP IC tags of the Brown Corpus with a hierarchy of subtags, assigning a set of tags to each text in our collection. This is not used in the experiment s reported here. We have written a summary for each of the fragments, with lengths varying from between 4 and 50 words and an average of 22 words per summary. Each summary is a human explanation of the text contents, not a mere bag of related keywords. These summaries serve as queries on the text collection, and then there is exactly one relevant document per query. F inally, we have hand± tagged each of the summaries with WordN et 1.5 senses. When a word or term was not present in the database, it was left unchanged. In general, such terms correspond to groups (e.g., F ulton– County– Grand –Jury), persons (Cervantes), or locations (F ulton). d

d

d

d

We also generated a list of ‘‘stop± senses’’ and a list of ‘‘stop± synsets,’’ automatically translating a standard list of stop words for English. Such a test collection o V ers the chance to measure the adequacy of WordN et± based approaches to IR independently from the disambiguator being used, but also oV ers the chance to measure the role of automatic dis± ambiguation by introducing diV erent rates of ‘‘disambiguation errors’’ in the collection. The only disadvantage is the small size of the collection, which does not allow Žne± grained distinctions in the results. However, it has proved large enough to give meaningful statistics for the experiments report± ed here. Although designed for our concrete text retrieval testing purposes, the resulting database could also be useful for many other tasks. For instance, it

Cross± Language T ext Retrieval

655

could be used to evaluate automatic summarization systems (measuring the semantic relation between the manually written and hand± tagged summaries of IR± S EMCOR and the output of text summarization systems) and other related tasks. For the bilingual experiments reported in this article, we also extended the database to include manually translated and indexed versions in Spanish of the summaries.

The M onolingual Ex periments We have performed a number of experiments using a standard vector± model± based text retrieval system, S MART (Salton, 1971), and three diV erent indexing spaces : the original terms in the documents (for standard SMART runs), the word± senses corresponding to the document terms (in other words, a manually disambiguated version of the documents), and the WordN et synsets corresponding to the document terms (roughly equivalent to con± cepts occurring in the documents). These are all the experiments considered here: 1. The original texts as documents and the summaries as queries. This is a classic SMART run, with the peculiarity that there is only one relevant document per query. 2. Both documents (texts) and queries (summaries) are indexed in terms of word± senses. That means that we disambiguate manually all terms. For instance ‘‘debate’’ might be substituted with ‘‘debate%1 :10 :01 ::.’’ The three numbers denote the part of speech, the WordNet lexicographer’s Žle and the sense number within the Žle. In this case, it is a noun belonging to the noun.communication Žle. With this collection we can see if plain disambiguation is helpful for retrieval, because word senses are distinguished but synonymous word senses are not identiŽed. 3. In the previous collection, we substitute each word sense for a unique identiŽer of its associated synset. For instance, ‘‘debate%1 :10 :01 ::’’ is substituted with ‘‘n04616654,’’ which is an identiŽer for ``{argument, debate1}  (a discussion in which reasons are advanced for and against some proposition or proposal ; ``the argument over foreign aid goes on and on  )

This collection represents conceptual indexing, as equivalent word senses are represented with a unique identiŽer. 4. We produced diV erent versions of the synset indexed collection, intro± ducing Žxed percentages of erroneous synsets. Thus we simulated a word±

656

J . Gonzalo et al.

FIGURE 1. DiV erent indexing approaches.

sense disambiguation process with 5%, 10%, 20%, 30%, and 60% error rates. The errors were introduced randomly in the ambiguous words of each document. With this set of experiments we can measure the sensi± tivity of the retrieval process to disambiguation errors. 5. To complement the previous experiment, we also prepared collections indexed with all possible meanings (in their word sense and synset versions) for each term. This represents a lower bound for automatic dis± ambiguation : we should not disambiguate if performance is worse than considering all possible senses for every word form. 6. We produced also a non± disambiguated version of the queries (again, both in its word sense and synset variants). This set of queries was run against the manually disambiguated collection.

Discussion of Results Indexing Approach In Figure 1 we compare diV erent indexing approaches : indexing by synsets, indexing by words (basic SM ART), and indexing by word senses

Cross± Language T ext Retrieval

657

(experiments 1, 2, and 3). The leftmost point in each curve represents the percentage of documents that were successfully ranked as the most relevan t for its summary/query. The next point represents the documents retrieved as the Žrst or the second most relevant to its summary/query, and so on. Note that, as there is only one relevant document per query, the leftmost point is the most representative of each curve. Therefore, we have included these results separately in Table 1. The results are encouraging :

Indexing by W ordNet synsets. produces a remarkable improvement on our test collection. 62% of the documents are retrieved in Žrst place by its summary, against 48% of the basic SMART run. This represents 14% more documents, a 29% improvement with respect to SMART. This is an excellent result, although we should keep in mind that it is obtained with manually disambiguated queries and documents. Nevertheless, it shows that WordN et can greatly enhance text retrieval : the problem resides in achieving accurate automatic word sense disambiguation. d

Indexing by word senses improves performance when considering up to four documents retrieved for each query/summary, although it is worse than indexing by synsets. This conŽrms our intuition that synset indexing has advantages over plain word sense disambiguation, because it permits match± ing semantically similar terms. Taking only the Žrst document retrieved for each summary, the dis± ambiguated collection gives a 53.2% success against a 48% of the plain SMART query, which represents an 11% improvement. For recall levels higher than 0.85, however, the disambiguated collection performs slightly worse. This may seem surprising, as word sense disambiguation should only increase our knowledge about queries and documents. But we should bear d

TABLE 1 M onolingual Experiments

Experiment

% correct document retrieved in Žrst place

Indexing by synsets Indexing by word senses Indexing by words (basic SM ART)

62.0 53.2 48.0

Indexing by synsets with a 5% errors ratio Id. with 10% errors ratio Id. with 20% errors ratio Id. with 30% errors ratio Indexing with all possible synsets (no disambiguation) Id. with 60% errors ratio

62.0 60.8 56.1 54.4 52.6 49.1

Synset indexing with nondisambiguated queries Word± Sense indexing with nondisambiguated queries

48.5 40.9

J . Gonzalo et al.

658

±

±

±

in mind that WordN et 1.5 is not the perfect database for text retrieval, and indexing by word senses prevents some matchings that can be useful for retrieval. In particular, we have conŽrmed the negative eV ects of:

T he lack of cross± part± of± speech relations. This means, for instance, that design as a verb is not related at all with design as a noun in the WordNet 1.5 database. Thus, one of our documents summarized using shoes design cannot be recovered using the three appearances of design as a verb in the document. The same occurs in other documents with temp/temptation, American/America, indiV erent/indiV erence, disarm/ disarming, etc. Remarkably, many of these relations can be captured by a naive stemmer that does not distinguish parts of speech. In Krovetz (1997), it is shown that the Porter stemmer, which does not use a lexicon, is surprisingly good at separating unrelated morphological variants and con¯ ating related ones. Cross± part± of± speech relations are even more important in multilingual settings, as many words shift category when translated in context ; this is discussed in the next section. Lack of topic or domain information. F or instance, a document in our database is summarized including the word soldier, as it is a story about soldiers. But the word soldier itself does not appear in the whole docu± ment (it is evident from the context), and thus the word or concept soldier is not used for retrieval. If WordN et synsets were tagged with domain information, soldier and words in the document such as battle, enemy, etc, would relate summary and document successfully. T oo much Žne± grained sense distinctions. For instance, in these two para± graphs of a Semcor document : 1. It got the kind of scrambled, coarsened performance that can happen –to the best of orchestras when the man with the baton lacks technique and style. 2. N ot the noblest performance we have heard him play, or the most spa± cious, or even the most eloquent. The word performance is tagged with two distinct meanings. The Žrst one corresponds to {performance, public presentation} : a dramatic or musical entertainment ; ``the play ran for 100 performances  or ``the frequent performances of the symphony testify to its popularity Â

and the second to {performance} : the act of presenting a play or a piece of music or other entertainment ; ``we congratulated him on his performance at the recital. Â

Cross± Language T ext Retrieval

659

However, they are unrelated in the WordN et noun hierarchy (one belongs to the ‘‘act’’ subhierarchy and the other one to ‘‘communication.’’ At least from the point± of± view of information retrieval, this is an annoying distinction.

Sensitivity to Disambiguation Errors Figure 2 shows the sensitivity of the synset indexing system to degrada± tion of disambiguation accuracy (corresponding to the experiments 4 and 5 described above). From the plot, it can be seen that : Less than 10% disambiguating errors does not substantially a V ect per± formance. This is roughly in agreement with Sanderson (1994). F or error ratios over 10%, the performance degrades quickly. This is also in agreement with Sanderson (1994). However, indexing by synsets remains better than the basic SMART run up to 30% disambiguation errors. From 30% to 60%, the data does not show signiŽcant diV erences with standard SMART word indexing. This predic± tion diV ers from Sanderson (1994) result (namely, that it is better not to disambiguate below a 90% accuracy). The main diV erence is that we are using concepts rather than word senses. But, in addition, it must be noted that Sanderson’s setup used artiŽ cially created ambiguous pseudo words d

d

d

FIGURE 2. Sensitivity to disambiguation errors.

660

J . Gonzalo et al.

FIGURE 3. Performance with nondisambiguated queries.

d

(such as ‘‘bank/spring’’) which are not guaranteed to behave as real ambiguous words. M oreover, what he understands as disambiguating is selecting ± in the example± bank or spring which remain to be ambigu± ous words themselves. If we do not disambiguate, the performance is slightly worse than dis± ambiguating with 30% errors, but remains better than term indexing, although the results are not deŽnitive. An interesting conclusion is that, if we can disambiguate reliably the queries, WordN et synset indexing could improve performance even without disambiguating the documents. This could be conŽ rmed on much larger collections, as it does not involve manual disambiguation. It is too soon to say if state± of± the± art WSD techniques can perform with less than 30% errors, because each technique is currently evaluated in fairly diV erent settings. Some of the best results on a comparable setting (namely, disambiguating against WordNet, evaluating on a subset of the Brown Corpus, and treating the 191 most frequently occurring and ambiguous words of English) are reported in N g (1997). They reach a 58.7% accuracy on a Brown Corpus subset and a 75.2% on a subset of the Wall Street Journal Corpus. A more careful evaluation of the role of WSD is needed to know if this is good enough for our purposes.

Cross± Language T ext Retrieval

661

Anyway, we have only emulated a WSD algorithm that just picks up one sense and discards the rest. A more reasonable approach here could be giving diV erent probabilities for each sense of a word, and use them to weigh synsets in the vectorial representation of documents and queries.

Performance for Nondisambiguated Queries In Figure 3 we have plotted the results of runs with a nondisambiguated version of the queries, both for word sense indexing and synset indexing, against the manually disambiguated collection (experiment 6). The synset run performs approximately as the basic SMART run. It seems, therefore, useless to apply conceptual indexing if no disambiguation of the query is feasible. This is not a major problem in an interactive system that may help the user to disambiguate his query, but it must be taken into account if the process is not interactive and the query is too short to do reliable disambig± uation.

EUROWORDN ET F EATURES F OR TEXT RETRIEVAL The aim of the EWN project is to develop (semiautomatically) a multi± lingual database resembling WordNet that stores semantic relations between words in several languages of the European community : Dutch, Italian, Spanish, English, F rench, and German. The project began in M arch 1996 and had a duration of 36 months ; the Žrst public release of EuroWordN et was scheduled for Spring 1999. The major feature of the EWN database, comparing to WordN et 1.5, is obviously its multilingual nature. We summarize here the multilingual archi± tecture of the database.

M onolingual wordnets. Each language has its individual wordnet with internal relations that re¯ ect speciŽc properties of that language. However, each monolingual wordnet is being built from a common set of 1,024 base concepts (concepts that are relatively high in the semantic hierarchies and that have many relations with other concepts). These have been veriŽ ed manually to Žt all monolingual wordnets. This is one of the measures that guarantees overlap and compatibility between wordnets, reducing spurious mismatches in the hierarchy. Interlingual± index (ILI ). A superset of all concepts occurring in the monolingual wordnets. The ILI began as a collection of records that matched WordNet 1.5 synsets, and is growing as new concepts are added. Using WordN et 1.5 as a starting point for the ILI is just a pragmatic deci± sion, as it was already available and has a wide coverage. However, it will also be modiŽ ed with respect to WordN et 1.5, as too Žne± grained sense distinctions will be collapsed. P eters et al. (1998b) describes this process in

662

J . Gonzalo et al.

detail. All interlingual relations and language± independent information is linked to the ILI, as explained below.

Cross± Language Relations. Each wordnet is linked to the ILI via cross± language equivalence relations, namely : cross± language synonymy : It :anitra EQ± NEAR± SYNONYM duck cross± language hypernymy : Dut ch:hoofd (human head) EQ± HAS± HYPERNYM head cross± language hyponymy : Sp :dedo (finger or toe) EQ± HAS± HYPONYM Žnger Sp :dedo EQ± HAD± HYPONYM toe

Cross± language complex relations (hypernyms and hyponyms) indicate potentially new ILI records. After each building stage, all complex relations are collected and compared across languages and new ILI records will be added if appropriate. These relations facilitate cross± language retrieval.

T op± concept ontology. A hierarchy of 63 language± independent concepts re¯ ecting explicit opposition relations (e.g., object versus substance). This ontology is linked to the base concepts through the ILI (see Rodriguez et al. 1998).

d

d

Hierarchy of domain labels. Also linked to the ILI and thus inherited by every monolingual wordnet. But besides the multilingual nature of EWN, there are a number of addi± tional features (comparing to WordNet 1.5) that are relevan t from the point± of± view of text retrieval : EuroWordNet will contain about 50,000 word meanings correlating the 20,000 most frequent words (only nouns and verbs in the Žrst stage) in each language. This size should be sufficient to experiment with generic, domain± independent text retrieval in a multilingual setting without the need for training with bilingual parallel corpora. The individual mono± lingual databases will be considerably smaller than WordN et 1.5, but the diV erence in coverage is only for speciŽc subdomains ; the coverage of most frequent words and more generic terms will be similar in both data± bases. The EWN database will be expanded to a higher level of detail for one speciŽc domain, in order to test its adequacy to incorporate domain± speciŽc thesauri. Synsets have domain labels that relate concepts on the basis of topics or scripts rather than classiŽcation. This means that tennis shoes and tennis racquets will be related through a common domain labeled tennis. Such relations are very important for text retrieval and many other tasks, including word± sense disambiguation.

Cross± Language T ext Retrieval

N ouns and verbs do not form separate networks. EuroWordNet includes cross± part± of± speech relations : non± to± verb± hypernym : angling ® catch (from angling : sport of catching Žsh with a hook and line) verb± to± noun± hyponym : catch ® angling noun± to± verb± synonym : adornment ® adorn (from adornment : the act of adorning) verb± to± noun± synonym : adorn ® adornment d

±

±

±

±

663

Again, these relations establish links that are signiŽ cant from the point± of± view of text retrieval. In particular, adorn and adornment are nearly equiva± lent for retrieval purposes, regardless of their diV erent parts± of± speech. N ow we turn to our proposal to exploit the EWN database for language± independent text retrieval.

LAN GUAGE-IN DEPEN DEN T TEXT RETRIEVAL WITH EWN Our proposal, Žrst introduced in Gilarranz et al. (1997), is to index docu± ments in terms of the ILI records (which, in practice, serves as a language± independent ontology). The only diV erence with the monolingual approach described above is that the indexing space is not exactly WordNet 1.5, but a reŽ ned version of WordN et that serves as a link between all the individual wordnets in the EWN database. Thus, the indexing space and its associated tasks, such as weighting terms, become language± independent. Two major processes have to be considered : document indexing and query/document matching.

Document Indexing Document indexing is performed in two stages : a language± dependent one that maps terms to ILI records, and a language± independent one that assigns weights to the representation.

Language-dependent Stage 1. P art± of± speech tagging. This is a Žrst step toward disambiguation and should not cause problems. Part of speech tagging can be performed with more than 96% precision for many languages ; see, for example Brill (1992) and M a´rquez and Padro (1997). N ote that we do not assume words with diV erent categories have diV erent meanings. Although it is a necessary step for disambiguation, EuroWordNet cross± part± of± speech relations may link close meanings that belong to diV erent lexical cate± gories.

664

J . Gonzalo et al.

2. Term identiŽcation. This step includes stemming and reconstruction, and the identiŽcation of multiwords. The detection of multiwords is known to be beneŽcial to text retrieval tasks ; WordN et is rich in multiword infor± mation, thus o V ering a potential for retrieval reŽnement that should be exploited. However, an appropriate treatment of multiwords from a multilingual perspective is not at all simple. As has been stated, the detection of lexicalized multiwords in a mono± lingual setting can enhance precision. F or instance, hot spring can be identiŽed in a document as a lexicalized multiword simply by inspecting WordN et 1.5 entries. We can thus assign a single meaning to hot spring, avoiding a separate inclusion of meanings for hot and spring, which would not re¯ ect the content of the document. Even when WordN et 1.5 includes nonlexicalized phrases such as a great distance or fasten with a screw, it would seem helpful to use these in order to reŽne term identiŽca± tion and matching for monolingual text retrieval. In fact, such non± lexicalized phrases are very common in WordN et 1.5, which oscillates between lexical and conceptual criteria when constructing the synsets. However, with many of such phrases the best solution is probably to search for the lexically signiŽcant words in close co± occurrence, e.g., for fasten near to screw. The handling of nonlexicalized phrases is not a simple task in the cross± language setting, partly because the situation is not symmetric over languages and this asymmetry frequently re¯ ects important diV erences in conceptualization between languages that must not be lost. Consider, for instance, lexical items in one language that do not have equivalents in another. In order to provide an exact translation equivalent, recourse is normally made to a phrase. An example is toe, which does not have a direct equivalent in Spanish. The closest lexical item is dedo, which means Žnger or toe. Thus going from one language to the other we appear to lose information on speciŽcity ; a solution could be to introduce a Spanish synset containing a phrase, even if it is not lexicalized, to describe the concept in Spanish. The appropriate phrase in Spanish would be dedo del pie (del pie 5 of the foot). However, we have to consider whether this is the most correct way to deal with this kind of situation. When a Spanish document is talking about toes, it will probably just use the term dedo. A retrieval system looking for dedo del pie as a single bound item could miss relevant information. The best solution is probably that already suggested above for monolingual retrieval : to search for both dedo and pie in close proximity and also just dedo ; pie can be used as a weight for document ranking. The question of the treatment of multiwords and lexicalized/ nonlexicalized translation equivalents is one that aV ects other possible applications of the EuroWordNet database. The decision taken by the

Cross± Language T ext Retrieval

665

project has been to include only lexicalized concepts in each monolingual wordnet. For CLTR, this means that we should look for cross± language hyponyms or hypernyms when a lexical item does not have a lexicalized equivalent in some target language. 3. Word± sense disambiguation. It is usually assumed that information retrieval systems perform an implicit disambiguation when comparing queries and documents, because the adequate senses for a term are rein± forced by the terms in the context (K rovetz & Croft, 1992 ; Sanderson, 1994). So how should we index in terms of ILI records ? Is it better to disambiguate with a certain error ratio, or can we assign all possible ILI records for each word form ? Would conceptual indexing improve retrieval in a monolingual setting, or would it have only a subtle eV ect, as previous experiments suggest ? These and other issues have been addressed in the experiments reported in the next section.

M apping into Inter Lingual Index . Once the terms in the documents have been disambiguated in terms of the relevant monolingual wordnet, they can be mapped to the Inter Lingual Index via cross± language equivalence relations ; in EuroWordNet, there will be at least one equivalent relation per synset, ensuring a complete mapping. Language-independent Stage Weighting. Using a classical vector± space model, synset weighting can be done employing language± independent criteria. Standard weighting schemes combine within± document term frequency (TF )± a term is more rel± evant in a document if it appears repeatedly± and inverted documents fre± quency (IDF)± a term is more relevant if its frequency in the document is signiŽcantly higher than its frequency in the collection. Such weighting schemes (nnn, atc, etc.) can be rendered language± independent when WordN et synsets are used as indexing terms for the documents in each lan± guage. Besides standard weighting, depth in the conceptual hierarchy can also be used to weight synsets, as synsets deeper in the hierarchy are more spe± ciŽc and therefore more informative. It follows that the uppermost synsets are the least informative and can probably be removed, thus providing a list of stop synsets. This is an interesting possibility provided by the WordN et hierarchy, but its eV ectiveness has to be carefully evaluated , as this may well depend on the homogeneity of the database. It is known that the WordN et hierarchy is not well balanced and thus a simple measure of hierarchical depth might not be reliable for weighting. The building strategy used for the EuroWordNet database is expected to provide a more evenly balanced hier± archy (Rodriguez et al., 1998), but only an evaluation of the Žnal database will be able to guarantee this.

666

J . Gonzalo et al.

The same process will be applied to queries, although performing dis± ambiguation is more difficult because queries are very short compared with documents and thus o V er little contextual information.

Query /document M atching We will experiment with three approaches to query/document compari± son. Each approach adds some information to the previous one: a. Cosine comparison. As formally we have a classical vector model, we use classical cosine comparison as a baseline. Thus we can evaluate separa± tely the impact of the indexing process and the methods for comparison, as has been discussed. b. Weighted expansion. The vector can be expanded± still in a language± independent manner± by including related ILI records. The Žrst candi± dates are cross± POS synonyms, which usually have strongly related meanings (see previous sections). M eronyms also seem to be good candi± dates, as they are likely to appear in context. However, we are aware that expansions beyond synonymy are not guaranteed to improve per± formance, and so careful evaluation of all kinds of expansion is required. c. M easure of semantic relatedness. Instead of simply matching identical concepts, it is possible to measure the semantic relatedness of query and document indexing concepts. A similar approach gave good results for monolingual retrieval in Smeaton and Quigley (1996). In addition, the domain labels could be used to score occurrences of words related to the same topics.

Summing Up

d

d

d

d

This proposal for cross± language text retrieval has attractive advantages over other techniques : It performs language independent indexing, providing ± a semantic structure to perform explicit WSD for indexing ; ± language± independent weighting criteria. It permits language± independent retrieval, by ± concept comparison rather than term comparison ; ± topic comparison. It does not require training or the availability of parallel corpora (a great advantage when thinking of more than two languages or when performing retrieval on unrestricted texts, such as WWW searches). The EuroWordNet architecture seems better suited, a priori, for text retrieval than WordN et 1.5 :

Cross± Language T ext Retrieval

667

words can be conceptually related even if they have diV erent P OS ; besides classiŽ cation relations, synsets have also topic information (domain labels), which is especially useful for text retrieval. ± ±

BILIN GUAL EXPERIM EN TS WITH EUROWORDN ET In any cross± language setting for text retrieval, some kind of query trans± lation from the query language to the document language is required. When going from the query to the target language, query expansion techniques with bilingual dictionaries introduce a genuine cross± language mechanism that degrades retrieval eV ectiveness. However, in our approach, indexing with many languages does not involve any additional operation to the ones performed in our monolingual experiments above. It is reasonable to expect that the degradatio n in our framework when going to a cross± language sce± nario should be less accused than for query expansion techniques. At the time of doing this research, the EuroWordNet database was not yet available. The Spanish wordnet covers only nouns and verbs (with a limited amount of noun to verb relations), has not reached its full coverage, and needs further reŽning and Žltering. However, it is interesting to perform cross± language retrieval experiments in order to understand how the retrieval process works and which features of the database must be improved or newly considered. It also permits one to establish a qualitative reference framework to evaluate its potentiality for cross± language retrieval, as compared to other approaches.

English-Spanish Test Collection We have experimented with Spanish queries to retrieve English docu± ments. To do so, we have prepared manual translations (into Spanish) of the 171 IR± Semcor summaries described previously. Then we have manually indexed the occurrences of nouns and verbs in terms of the Spanish WordN et. When the appropriate sense of a term was missing in the Spanish WordN et, we added it manually and linked it to the EWN interlingual index.

Spanish-English Experiments The main experiment has been using the Spanish queries to retrieve English documents, indexing queries and documents in terms of the Euro± WordN et Inter Lingual Index. As currently we have only nouns and verbs in the Spanish database and we have only used them to index Spanish queries. In order to evaluate the results, we have performed some complementary experiments.

668

J . Gonzalo et al.

Dictionary± based term translation. We performed a naive dictionary± based translation of the terms in the Spanish query, using the VOX± Harraps Spanish± English Dictionary, picking up every possible English translation for each term. The VOX± Harraps contains around 28,000 entries in its Spanish± English version ; if a word was not included in the dictionary, it was not considered, except for proper nouns, which were manually translated into its English equivalents. The queries were then used in a normal SM ART run with the text database. This run is used as a baseline against EWN ± based retrieval, being the simplest term translation of the query. M onolingual retrieval with nouns and verbs. In order to discriminate pos± sible sources of cross± language degradation , we have used only nouns and verbs in the English queries for retrieval, to compare this result directly with the Spanish± English experiment . Retrieval with di€ erent POS. Finally, as the relative coverage of each open± class word will not be the same for English and the rest of languages, we have complemented the previous experience using diV erent combinations of word classes in the queries, in order to know what classes are more rele± vant for retrieval.

Results Again, the most relevant Ž gures are the number of documents correctly retrieved as the most relevant document for its summary, which is separately displayed in Table 2. These are the main results.

Cross-Language Degradation of Synset Indexing In F igure 4, monolingual and cross± language synset indexing are com± pared to measure the degradation of our EWN ± based approach to text retrieval in a cross± language setting, comparing queries where only nouns and verbs are processed. The monolingual experiment gives 60.2% of the appropriate documents retrieved in Žrst place, while the cross± language one gives 48%. This represents a 20% degradation , which is a promising result :

TABLE 2 Bilingual Experiments

Experiment

% correct document retrieved in Žrst place

Bilingual indexing by synsets M onolingual indexing by synsets Dictionary ± based term translation M onolingual indexing by words ± basic SM ART±

48.0 62.0 24.0 48.0

Cross± Language T ext Retrieval

669

FIGURE 4. Cross± language degradation of synset indexing.

note that comparing (Table 2) the monolingual SM ART run with the dictionary± based term translation (48% against 24%) gives a 50% degrada± tion, which is a standard behavior for naive cross± language retrieval. However, this 20% cannot be explained in terms of translation ambi± guity, as both English and Spanish queries are manually disambiguated. We analyze this result on page 672 (Analysis of Translated Queries).

Comparison to Dictionary-Based Term Translation The performance of synset indexing and dictionary± based term trans± lation is compared in F igure 5. Indexing in terms of the Inter Lingual Index improves cross± language retrieval eV ectiveness from 24% to 48%, which rep± resents a 100% improvemen t over dictionary± based term translation, even matching the monolingual results of the standard SM ART run. These results strongly suggest that language± neutral indexing in terms of the EWN Inter Lingual Index may improve cross± language eV ectiveness provided that this indexing can be performed automatically with enough precision. We are cur± rently testing WSD algorithms in this CLTR environmen t to Ž nd out what ‘‘enough precision’’ means for our retrieval task, beyond the experiments described in this paper. It must be noted that, as the monolingual experiments described earlier, these results are obtained for the simplest approach to synset indexing. We

670

J . Gonzalo et al.

FIGURE 5. Comparison between cross± language retrieval strategies.

have considered only nouns and verbs, and we have not used any semantic relation other than synset± membership (obviating hyponymy/hypernymy, cross± part± of± speech relations) of the query. There are many ways in which these results can be further reŽned, in order to improve retrieval and, at the same time, tune the design of our multilingual database for retrieval pur± poses.

Relevance of Di€ erent Word Classes The precision/recall Žgures for the monolingual experiments with diV er± ent word classes are represented in Figure 6 and the most signiŽcant data in Table 3. Retrieving only with the nouns in the queries gives 59.6% of the correct documents in Žrst place, while retrieving with all synsets (nouns, verbs, adjectives, and adverbs) gives 62%. Apparently, treating nouns gives a good Žrst approximation for retrieval. It must be noted, however, that the rest of open± class words (verbs, adjectives, and adverbs) are also meaningful for retrieval, but they are simply less frequent. In Table 4, the retrieval of each word class is compared to their frequency in the queries. The ratio between the mean number of occurrences per query and the number of documents correctly retrieved is very similar for nouns and verbs, and it is higher for adjectives.

Cross± Language T ext Retrieval

FIGURE 6. Performance of synset indexing with diV erent word classes. TABLE 3 Experiments with DiV erent Word Classes

Experiment M onolingual, M onolingual, M onolingual, M onolingual, M onolingual, M onolingual, M onolingual,

all classes only nouns only adjectives only verbs only adverbs all classes except nouns nouns and verbs

Cross± Language, nouns and verbs Cross± Language, only nouns Cross± Language, only verbs

% correct document retrieved in Ž rst place 62.0 59.6 37.4 21.6 0.1 40.4 60.2 48.0 46.2 16.4

TABLE 4 Retrieval EV ectiveness with DiV erent Word Classes in English word class nouns adjectives verbs adverbs



words per query 6.1 2.5 2.2 0.37

% docs. correctly retrieved 59.6 37.4 21.6 0.1

671

J . Gonzalo et al.

672

The results for the bilingual experiments with nouns and verbs present a similar pattern : nouns give 46.2%, verbs 16.4%, and nouns with verbs 48%. However, adjectives possibly play a more important role in Spanish than in English, as what are noun compounds in English are often expressed as nouns with adjectives in Spanish. Thus, relations between adjectives and nouns may be necessary to identify phrases, which is crucial for cross± language retrieval.

Analysis of Translated Queries The 20% degradation for our cross± language experiment with synsets cannot be explained in terms of cross± language ambiguity, as both sets of queries are manually disambiguated. Thus, it is interesting to examine the correlation between English queries and their translations to Spanish, to Žnd out the relevant sources of mismatches. Table 5 shows the percentage of overlapping Inter Lingual Index records between English and Spanish summaries. For nouns, a 63% of the synsets present in the English summaries appeared in their Spanish counterparts. For verbs, this percentage is even lower. A small part of the mismatches may be due to annotation errors, but after a manual inspection of selected queries we have found, as the most important sources of mismatches, the excessive Žne± grainedness of the interlingual index and terms shifting cate± gory from one language to another.

Too much Fine-Grainedness of the Inter Lingual Index Actually, the EWN Inter Lingual Index is still very close to WordN et 1.5, and thus the sense distinctions are too Žne grained, even for a human annotator. This is not a genuine cross± language problem, but pervades every annotation with WordN et 1.5. For instance, in ‘‘Debate on the increase of federal aids for education in Georgia,’’ the term increase was annotated in its ‘‘act’’ sense during the manual annotation of English queries. But during the annotation of the Spanish translations of the queries, the equivalent term, incremento, was indexed in its event sense, thus pointing to a diV erent inter± lingual index record. Unfortunately, this is a very common situation, espe± cially for verbs. TABLE 5 Index Overlapping of English Summaries with Spanish Summaries

Nouns Nouns 1 All

verbs

% synsets in Spanish summary

% synsets only in English

63 60 45

37 40 55

Cross± Language T ext Retrieval

673

This problem interferes with a diV erent one, namely, the diV erent cover± age and granularity of the Spanish WordN et, compared to the English one. The Spanish WordNet has less senses. When the appropriate sense was not found in the database, it was manually introduced during the annotation process. But many times, the senses already in the database were good enough. As for English there were more reŽned senses, the chance of getting the same annotation was lower. These kinds of mismatches should be less relevant when the ongoing sense± clustering task for WordNet 1.5 is accomplished (P eters et al., 1998a). Up to now, the following clusterings have been considered : sisters, word senses that share the same hypernym, such as table in the sense of ‘‘piece of furniture’’ and table in the sense of ‘‘piece of furniture for a meal laid out on it’’. autohyponyms, words whose senses are each others direct hypernyms or hyponyms, such as variety in the sense of ‘‘speciŽc kind of something’’ and in the sense of ‘‘category of things distinguished by some common quality.’’ twins, synsets that have at least three members in common. For instance, { violate, fail to agree with, go against, break, be in violation of} and { violate, go against, breach, break, be in violation of} . cousins, node top pairs whose hyponyms exhibit a speciŽc relation to each other. For instance, hyponyms of the container node and containerful, such as bag, cup, spoon, etc. systematic polysemic patterns as they have been identiŽed in the CoreLex database (Buitelaar, 1998). d

d

d

d

d

However, cases as the increase one stated above are not covered in any of these phenomena. The text retrieval testbed is an excellent way, then, of testing whether the clustering is eV ective or not. With increase and many other cases, the problem is that meanings that are identical for text retrieval are found in totally diV erent parts of the hierarchy. It seems necessary to detect polysemy regularities related to ontological distinctions to handle these cases.

Terms Shifting Category Apart from disambiguation problems, the most accused eV ect is the translation of English noun compounds into Spanish adjectives. For instance, ‘‘bone growth centers’’ is translated into ‘‘centros de crecimiento o seo,’’ where o seo is the Spanish adjective for ‘‘pertaining to the bone.’’ In this way, we are currently losing the most signiŽcant word in the expressio n ‘‘bone growth centers,’’ because we still haven’t considered adjectives in the

674

J . Gonzalo et al.

Spanish database and, of course, we do not have cross± P OS relations between adjectives and nouns yet. This can be a very signiŽ cant problem, as it is related to the proper matching of phrases across languages, which has proven to be essential for cross± language text retrieval (Ballesteros & Croft, 1998). This fact has lead us to give more attention to adjectives in EWN than was previously foreseen. In general, there are also shifts from nouns to verbs and vice versa, due to rephrasing in the translation. Such eV ects should not be a problem if the noun, verb, and adjective hierarchies are highly interconnected, and thus it should improve with the forthcoming version of the EWN database.

CON CLUSION S We have presented a novel approach to cross± language text retrieval. The rationale behind this is to perform conceptual indexing and retrieval of documents in a space of language± independent concepts exploiting the Euro± WordN et large± scale multilingual semantic database. Previous work on monolingual settings had reported empirical evidence of the limited gain of retrieval performance when adding semantic informa± tion in standard IR processes. However, we found that such results fail to distinguish two related but diV erent issues± indexing strategies on one hand and problems of word sense disambiguation on the other. To investigate the viability of our approach, we have Žrst built a collec± tion of fully disambiguated documents and queries (adapted from Semcor) to evaluate empirically the eV ect of concept indexing in contrast with standard text retrieval techniques. Then we have designed and performed a number of experiments, in order to test and compare a variety of strategies. The experi± ments not only show a clear improvemen t in the performance when indexing by synsets, but also establish a potential range of disambiguation errors, where still this approach can enhance standard retrieval results. Finally, we have performed a preliminary experiment on cross± language text retrieval using Spanish queries against English documents. Although the EWN database is far from its Žnal form, indexing by Inter Lingual Index records gives much better results than a naive dictionary± based term trans± lation run used as a baseline. The degradation of the synset indexing approach when going from English± English to Spanish± English retrieval is 20%, which is very promising considering all the problems that still have to be solved concerning the coverage and quality of the database. Although there is still much work to be done, we believe that these results are strong evidence in favor of using multilingual ontologies for cross± language text retrieval. In particular, they seem very appropriate for multilingual searches over heterogeneous domains, (such as WWW

Cross± Language T ext Retrieval

675

searches), where the lack of adequate multilingual parallel corpus makes corpus± based approaches hard to apply.

APPEN DIX We transcribe here an example from our IR± Semcor collection. The text is the sixth fragment of semcor document br± c01, which we have divided into six fragments corresponding to six diV erent reviews of movies, concerts, etc. It is truly odd and ironic that the most handsome and impressive film yet made from Miguel– de– Cervantes ``Don–Quixote  is the brilliant Russian spectacle, done in wide– screen and color, which opened yesterday at the Fifty± fifth– Street and Sixty± eighth–Street–Playhouses. More–than a beautiful visualization of the illustrious adventures and escapades of the tragi ± comic knight± errant and his squire, Sancho– Panza, in seventeenth century Spain, this inevitably abbreviated rendering of the classic satire on chivalry is an affectingly warm and human exposition of character. Nikolai–Cherkasov, the Russian actor who has played such heroic roles as Alexander– Nevsky and Ivan –the– Terrible, performs the lanky Don– Quixote, and does so with a simple dignity that bridges the inner nobility and the surface absurdity of this poignant man. His addle± brained knight± errant, self appointed to the ridiculous position in an age when armor had already been relegated to museums and the chivalrous code of knight± errantry had become a joke, is, as Cervantes no–doubt intended, a gaunt but gracious symbol of good, moving soberly and sincerely in a world of cynics, hypocrites and rogues. Cherkasov does not caricature him, as some actors have been– inclined to do. He treats this deep± eyed, bearded, bony crackpot with tangible affection and respect. Directed by Grigory– Kozintsev in a tempo that is studiously slow, he develops a sense of a high tradition shining brightly and passing gravely through an impious world. The complexities of communication have been considerably abetted in this case by appropriately stilted English–language that has been excellently dubbed in– place– of the Russian dialogue. The voices of all the characters, including that of Cherkasov, have richness, roughness or color to conform with the personalities. And the subtleties of the dialogue are most helpfully conveyed. Since Russian was being spoken instead of Spanish, there is no violation of artistry or logic here. Splendid, too, is the performance of Yuri–Tolubeyev, one of Russia s leading comedians, as Sancho– Panza, the fat, grotesque ``squire  . Though his character is broader and more comically rounded than the don, he gives it a firmness and toughness± a sort– of peasant dignity± too. It is really as though the Russians have seen in this character the oftentimes underlying vitality and courage of supposed buffoons. The episode in– which Sancho– Panza concludes the joke that is played on him when he is facetiously put in command of an ``island is one of the best in the film.

J . Gonzalo et al.

676

True, the pattern and flow of the drama have strong literary qualities that are a– bit wearisome in the first– half, before Don– Quixote goes to the duke s court. But strength and poignancy develop thenceforth, and the windmill and deathbed episodes gather the threads of realization of the wonderfulness of the old– boy. There are other good representations of peasants and people of the court by actors who are finely costumed and magnificently photographed in this last of the Russian films to reach this country in the program of joint cultural exchange. Also on the bill at the Fifty± fifth– Street is a nice ten minute color film called ``Sunday –in– Greenwich– Village  , a tour of the haunts and joints.

The summary used as query is : A new Russian film based on the novel Don–Quixote turns –out to be the most impressive rendering of this classic.

As an example of indexation by word senses, the summary is indexed as : new%3 :00 :00 : : russian%3 :01 :00 : : film%1 :10 :01 : : base%2 :31 :00 : : novel%1 :10 :00 : : don– quixote%1 :18 :01 : : turn– out%2 :42 :01 : : be%2 :42 :03 : : most%4 :02 :01 : : impressive%3 :00 :00 : : rendering%1 :04 :01 : : classic%1 :06 :00 : :

Indexing the summary by WordNet synset identiŽers gives : a01256444 a02130100 n04323474 v00358438 n04198190 n05837444 v01490020 v01472320 r00055410 a00973705 n00056790 n01993371

where, for instance, n0432347 4 is a unique identiŽer for the WordN et synset : {movie, filml, picture2, moving picture, motion picture, picture show, flick} ±± (a form of entertainment provided by a sequence of images giving the illusion of continuous movement ; ``they went to a movie every Saturday night  )

The Spanish translation is :  cula rusa basada en la novela Doni–Quijote resulta ser Una nueva peli  n ma  s impresionante de este cla  sico. la versio

and indexed by synsets : n04323474 v00358438 n02818521 n05837444 v01489871 v01506899 n04220780 n01993371

Cross± Language T ext Retrieval

677

The dictionary± based term translation gives : film base be–based novel don– quijote result turn–out–to– be come– out be belong come–from be being core version classic classical classic

REF EREN CES Ballesteros, L., and B. Croft. 1996. Dictionary± based methods for cross± lingual information retrieval. In Proc. of the 7th International DEXA Conference on Database and Expert Systems Applications, pp. 791­ 801. Ballesteros, L., and B. Croft. 1998. Resolving ambiguity for cross± language retrieval. In Proceedings of SIGIR’98. Brill, E. 1992. A simple rule± based part of speech tagger. In Proceedings of the T hird Conference on Applied Natural Language Processing. Buitelaar, P. 1998. CoreLex : Systematic polysemy and underspeciŽcation. Ph.D. thesis, Department of Computer Science, Brandeis University, Boston, M A. Carbonell, J., Y. Yang, R. Frederking, R. Brown, Y. Geng, and D. Lee. 1997. Translingual information retrieval. In Proceedings of IJ CAI’97 . Dumais, S., T. Landauer, and M . Littman. 1996. Automatic cross± linguistic information retrieval using latent semantic indexing. In W orking Notes of the W orkshop on Cross± Linguistic Information Retrieval, ACM SIGIR’96, pp. 16­ 23. Gilarranz, J., J. Gonzalo, and M . Verdejo. 1997. An approach to cross± language text retrieval with the EuroWordnet semantic database. In AAAI Spring Symposium on Cross± Language T ext and Speech Retrieval, pp. 49­ 55. AAAI P ress SS± 97­ 05. Grefenstette, G. 1998. The problem of cross± language information retrieval. In Cross± Language Informa± tion Retrieval. K luwer AP. Harman, D. K . 1993. The Žrst text retrieval conference (TREC± 1). Inform. Process. Management 29(4):411­ 414. Hull, D., and G. Grefenstette. 1996. Querying across languages . A dictionary± based approach to multilin± gual information retrieval. In Proc. of the 19th ACM SIGIR Conference, pp. 49­ 57. Krovetz, R. 1997. Homonymy and polysemy in information retrieval. In Proceedings of ACL/ EACL:’97 . Krovetz, R., and W. Croft. 1992. Lexica l ambiguity and information retrieval. ACM T rans. Inform. System 10(2):115­ 141. Landes, S., C. Leacock , and R. Tengi. 1998. Building semantic concordances. In W ordNet : An Electronic Lexical Database. M IT Press. M a´ rquez, L., and L. Padro . 1997. A ¯ exible POS tagger using an automatically acquired language model. In Proceedings of ACL/ EACL’97 . M iller, G. 1990. Special issue. Wordnet : An on± line lexical database. International J . Lexicography, 3(4). M iller, G., M . Chodorow, S. Landes, C. Leacock , and R. Thomas. 1994. Using a semantic concordance for sense identiŽ cation. In Proceedings of the ARPA Human Language T echnology W orkshop. M iller, G. A., C. Leacock, R. Tengi, and R. Bunker. 1993. A semantic concordance. In Proceedings of the ARPA W orkshop on Human Language T echnology. M organ K au V man. Ng, H. T. 1997. Exemplar± based word sense disambiguation : Some recent improvements. In Proceedings of the Second Conference on Empirical Methods in NLP. Oard, D. 1997. Alternative approaches for cross± language text retrieval. In AAAI Spring Symposium on Cross± Language T ext and Speech Retrieval. AAAI Press SS± 97­ 05. Peters, W., I. Peters, and P. Vossen. 1998a . Automatic sense clustering in EuroWordN et. In Proceedings of the First International Conference on Language Resources and Evaluation. Peters, W., P. Vossen, P. Dõ ez± Orzas, and G. Adriaens. 1998b. The multilingual design of the EuroWord± net database. In Computers and the humanities, Special Issue on EuroW ordNet . Picchi, E., and C. Peters. 1996. Cross language information retrieval : A system for comparable corpus querying. In ed. G. Grefenstette, W orking Notes of the W orkshop on Cross± Linguistic Information Retrieval, ACM SIGIR’96, p. 24­ 33. Richardson, R., and A. Smeaton. 1995. Using Wordnet in a knowledge± based approach to information retrieval. In Proceedings of the BCS± IRSG Colloquium, Crewe.

678

J . Gonzalo et al.

Rodriguez, H., S. Climent, P. Vossen, L. Bloksma, A. Roventini, F. Bertagna, A. Alonge, and W. Peters. 1998. The top± down strategy for building EuroWord± net : Vocabulary coverage, base concepts and top ontology. In Computers and the humanities, Special Issue on EuroW ordNet . Salton, G., ed. 1971. T he SM ART retrieval system : Experiments in automatic document processing. Prentice± Hall. Sanderson, M . 1994. Word sense disambiguation and information retrieval. In Proceedings of 17th Inter± national Conference on Research and Development in Information Retrieval. Sheridan, P., and J. Ballerini. 1996. Experiments in multilingual information retrieval using the spider system. In Proc. of the 19th ACM SIGIR Conference, p. 58­ 65. Smeaton, A., F. K elledy, and R. O’Donnell. 1995. TREC± 4 experiments at Dublin City University : Thresholding posting lists, query expansion with Wordnet and POS taggin g of Spanish. In Pro± ceedings of T REC± 4 . Smeaton, A., and A. Quigley . 1996. Experiments on using semantic distances between words in image caption retrieval. In Proceedings of the 19th International Conference on Research and Development in IR. Voorhees, E. M . 1994. Query expansion using lexica l± semantic relations. In Proceedings of the 17th Annual International ACM ± SIGIR Conference on Research and Development in Information Retrieval. Vossen, P. 1998. Introduction to EuroWordnet. In Computers and the humanities, Special Issue on Euro± W ordNet .

Suggest Documents