English. French. German. Italian. English docs French docs German docs merger
.... (1) Top-3 English translations of Chinese word 'C1' from LDC wordlist.
Multilingual Information Retrieval Using English and Chinese Queries
Aitao Chen
School of Information Management and Systems University of California, Berkeley CLEF 2001 Workshop: 3-4 Sept, 2001, Darmstadt, Germany
Outline • • • • •
Overview over what we did at CLEF-2001 German decompounding Chinese topics translation Merging strategies and alternative methods Conclusions
Participation in CLEF-2001 • Monolingual task (German and Spanish) • Bilingual task (Chinese to English) • Multilingual task (English and Chinese)
Overview of Multilingual Information Retrieval Using English Queries Query
SYSTRAN and L&H
English docs
Documents
English
IR
English
French
IR
French
German
IR
German
Italian
IR
Italian
Spanish
IR
Spanish
French docs
German docs
Italian docs Spanish docs
merger combined ranked list of documents
Overview of Multilingual Information Retrieval Using Chinese Queries Query Chinese
bilingual dict parallel texts search engine
SYSTRAN and L&H
Documents
English
IR
English
French
IR
French
German
IR
German
Italian
IR
Italian
Spanish
IR
Spanish
English docs French docs German docs Italian docs Spanish docs
merger combined ranked list of documents
German Decompounding Procedure • Create a German base dictionary consisting of single words only (compounds are excluded). • Decompose a compound into component words found in the German base dictionary. • Choose the decomposition with the minimum number of component words. • If there are more than one decompositions having the minimum number of component words, choose the decomposition with the highest probability.
German Decompounding: Example 1 Compound: filmfestspiele (film festival) 1. Base dictionary … film fest fests festspiele piele s …
2. Decompositions: 1. 2. 3. 4.
film film film film
fest s piele fest spiele fests piele festspiele
3. Result: filmfestspiele = file festspiele
German Decompounding: Example 2 Compound: hungerstreiks (hunger strike) 1. Base dictionary erst hung hunger hungers hungerst reik reiks s streik streiks
2. Decompositions:
log p(D)
1. 2. 3. 4. 5. 6.
-55.2 -38.0 -38.7 -21.4 -52.1 -34.9
hung erst reik s hung erst reiks hunger streik s hunger streiks hungerst reik s hungerst reiks
3. Result: hungerstreiks = hunger streiks
German Decompounding: Probability of Decomposition C = W1 W2 W3 W4 p (C ) = p (W1 ) ∗ p (W2 ) ∗ p (W3 ) ∗ p (W4 )
p ( w) =
tfc ( w ) n
∑ tfc ( wi )
i =1
tfc(w) is the number of times word w occurs in a corpus. n is the number of unique words (including compounds) in a corpus.
Overview of Chinese to English Retrieval Chinese topics
segmentation stopwords removal
Translation resources
Term selection & weighting
LDC bilingual wordlist
term selection
Chinese words
Bilingual dict (parallel texts)
term selection
Preprocessing
Chinese search engine
term selection
de-segmentation
Monolingual IR
term merging & weighting
English queries English docs (in words)
IR system
retrieval results
Chinese Topics Preprocessing: De-segmentation
Translation Resources: Creation of Bilingual Dictionary From Parallel Texts • Parallel texts: Hong Kong news (4/984/2001) and FBIS Chinese collection. • Document alignment: IR + LDC wordlist. • Paragraph & sentence alignment: adapted from Gale and Church’s length-based model. • Association measure: Dunning’s maximum likelihood ratio statistic.
Term Translation Using Search Engine
Term Selection, Merging, and Weighting (1) Top-3 English translations of Chinese word ‘C1’ from LDC wordlist. Translations are ranked by occurrence frequency in the LA Times collection. (2) Top-2 English translations of Chinese word ‘C1’ from parallel texts. Translations are ranked by association weight.
(1) E1 1 E2 1 E3 1
(2) E3 1 E4 1
E1 E2 E3 E4
1 1 2 1
(3)
(5) C1 2
Original query term frequency of C1 Final term weights for translations of C1
E1 E2 E3 E4
.20 .20 .40 .20
(4)
E1 E2 E3 E4
.40 .40 .80 .40 (6)
Precision
Translation Resources Versus Chinese-to-English IR Performance 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Conclusions • German decompounding can significantly improve retrieval performance. Keeping only component words in the query works better than keeping both compounds and component words. • Chinese search engine is a valuable resource for translating Chinese proper nouns into English. • Merging documents by adjusted probability of relevance works reasonably well.