Multilingual Information Retrieval Using English and ... - CLEF

5 downloads 82 Views 73KB Size Report
English. French. German. Italian. English docs French docs German docs merger .... (1) Top-3 English translations of Chinese word 'C1' from LDC wordlist.
Multilingual Information Retrieval Using English and Chinese Queries

Aitao Chen

School of Information Management and Systems University of California, Berkeley CLEF 2001 Workshop: 3-4 Sept, 2001, Darmstadt, Germany

Outline • • • • •

Overview over what we did at CLEF-2001 German decompounding Chinese topics translation Merging strategies and alternative methods Conclusions

Participation in CLEF-2001 • Monolingual task (German and Spanish) • Bilingual task (Chinese to English) • Multilingual task (English and Chinese)

Overview of Multilingual Information Retrieval Using English Queries Query

SYSTRAN and L&H

English docs

Documents

English

IR

English

French

IR

French

German

IR

German

Italian

IR

Italian

Spanish

IR

Spanish

French docs

German docs

Italian docs Spanish docs

merger combined ranked list of documents

Overview of Multilingual Information Retrieval Using Chinese Queries Query Chinese

bilingual dict parallel texts search engine

SYSTRAN and L&H

Documents

English

IR

English

French

IR

French

German

IR

German

Italian

IR

Italian

Spanish

IR

Spanish

English docs French docs German docs Italian docs Spanish docs

merger combined ranked list of documents

German Decompounding Procedure • Create a German base dictionary consisting of single words only (compounds are excluded). • Decompose a compound into component words found in the German base dictionary. • Choose the decomposition with the minimum number of component words. • If there are more than one decompositions having the minimum number of component words, choose the decomposition with the highest probability.

German Decompounding: Example 1 Compound: filmfestspiele (film festival) 1. Base dictionary … film fest fests festspiele piele s …

2. Decompositions: 1. 2. 3. 4.

film film film film

fest s piele fest spiele fests piele festspiele

3. Result: filmfestspiele = file festspiele

German Decompounding: Example 2 Compound: hungerstreiks (hunger strike) 1. Base dictionary erst hung hunger hungers hungerst reik reiks s streik streiks

2. Decompositions:

log p(D)

1. 2. 3. 4. 5. 6.

-55.2 -38.0 -38.7 -21.4 -52.1 -34.9

hung erst reik s hung erst reiks hunger streik s hunger streiks hungerst reik s hungerst reiks

3. Result: hungerstreiks = hunger streiks

German Decompounding: Probability of Decomposition C = W1 W2 W3 W4 p (C ) = p (W1 ) ∗ p (W2 ) ∗ p (W3 ) ∗ p (W4 )

p ( w) =

tfc ( w ) n

∑ tfc ( wi )

i =1

tfc(w) is the number of times word w occurs in a corpus. n is the number of unique words (including compounds) in a corpus.

German Decompounding: Failed Cases 1. erdatmosphäre = erde + atmosphäre (earth atmosphere)

2. mittagessenzeit = mittag essen zeit (noon meal time) (mittagessenzeit = mittagessen zeit) lunch time 3. And others

German Decompounding and Monolingual Retrieval Performance Test collections

-Decompounding -Stemming + Decompounding -Expansion

Change

CLEF-2001 (49/225K)

.3673 (1877/2130)

.4314 (1949/2130)

+17.45%

CLEF-2000 (37/154K)

.3189 (673/821)

.4112 (770/821)

+28.94%

TREC-6/7/8 (73/252K)

.2993 (1907/2626)

.3368 (2172/2626)

+12.53%

Only component words of compounds are kept in the queries.

Precision

German Monolingual Retrieval Performance 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

BK2GGA1 (.4050) BK2GGA2 (.3551) bk2gga1* (.4436)

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

Features: +stemming, +decompounding, -expansion

Overview of Chinese to English Retrieval Chinese topics

segmentation stopwords removal

Translation resources

Term selection & weighting

LDC bilingual wordlist

term selection

Chinese words

Bilingual dict (parallel texts)

term selection

Preprocessing

Chinese search engine

term selection

de-segmentation

Monolingual IR

term merging & weighting

English queries English docs (in words)

IR system

retrieval results

Chinese Topics Preprocessing: De-segmentation

Translation Resources: Creation of Bilingual Dictionary From Parallel Texts • Parallel texts: Hong Kong news (4/984/2001) and FBIS Chinese collection. • Document alignment: IR + LDC wordlist. • Paragraph & sentence alignment: adapted from Gale and Church’s length-based model. • Association measure: Dunning’s maximum likelihood ratio statistic.

Term Translation Using Search Engine

Term Selection, Merging, and Weighting (1) Top-3 English translations of Chinese word ‘C1’ from LDC wordlist. Translations are ranked by occurrence frequency in the LA Times collection. (2) Top-2 English translations of Chinese word ‘C1’ from parallel texts. Translations are ranked by association weight.

(1) E1 1 E2 1 E3 1

(2) E3 1 E4 1

E1 E2 E3 E4

1 1 2 1

(3)

(5) C1 2

Original query term frequency of C1 Final term weights for translations of C1

E1 E2 E3 E4

.20 .20 .40 .20

(4)

E1 E2 E3 E4

.40 .40 .80 .40 (6)

Precision

Translation Resources Versus Chinese-to-English IR Performance 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

LDC+HKF+YAHOO (.4112) LDC+HKF (.3599) LDC (.2679) HKF (.2675) English Mono (.5553) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

Multilingual Information Retrieval: Merging Strategy English docs E1 E2 … E50 E51 … E1000 E1 E2 … E50 E51 … E1000

e1 e2 e50 e51 e1000

.8*e1 + 1 .8*e2 + 1 .8*e50 + 1 .8*e51 .8*e1000

French docs F1 F2 … F50 F51 … F1000

f1 f2

F1 F2 … F50 F51 … F1000

f1 + 1 f2 + 1

f50 f51 f1000

f50 + 1 f51 f1000

Italian docs I1 I2 … I50 I51 … I1000

i1 i2

I1 I2 … I50 I51 … I1000

i1 + 1 i2 + 1

i50 i51 i1000

i50 + 1 i51 i1000

German docs G1 G2 … G50 G51 … G1000

g1 g2

G1 G2 … G50 G51 … G1000

g1 + 1 g2 + 1

g50 g51 g1000

g50 + 1 g51 g1000

Spanish docs S1 S2 … S50 S51 … S1000

s1 s2

S1 S2 … S50 S51 … S1000

s1 + 1 s2 + 1

(1) combine lists; (2) sort by adjusted weight; (3) take top 1000 docs

s50 s51 s1000

s50 + 1 s51 s1000

Performance of Multilingual Information Retrieval Using English Long Queries Query

SYSTRAN and L&H

English docs

(.5553)

Documents

English

IR

English

French

IR

French

German

IR

German

Italian

IR

Italian

Spanish

IR

Spanish

French docs

(.4776)

German docs

(.3789) merger

Italian docs Spanish docs

(.3934)

(.3424) combined ranked list of documents

(.4703)

Performance of Multilingual Information Retrieval Using Chinese Long Queries Original Query

Query (.4122)

Chinese

bilingual dict parallel texts search engine

SYSTRAN and L&H

Documents

English

IR

English

French

IR

French

German

IR

German

Italian

IR

Italian

Spanish

IR

Spanish

English docs French docs German docs Italian docs Spanish docs

(.4122)

(.2874)

(.2619)

(.2509)

(.2942)

merger (.2217) combined ranked list of documents

Multilingual Information Retrieval: Alternative Merging Strategy English docs E1 E2 … E50 E51 … E1000 E1 E2 … E50 E51 … E1000

e1 e2 e50 e51 e1000

e1/e1 e2/e1 e50/e1 e51/e1 e1000/e1

French docs F1 F2 … F50 F51 … F1000

f1 f2

F1 F2 … F50 F51 … F1000

f1/f1 f2/f1

f50 f51 f1000

f50/f1 f51/f1 f1000/f1

Italian docs I1 I2 … I50 I51 … I1000

i1 i2

I1 I2 … I50 I51 … I1000

i1/i1 i2/i1

i50 i51 i1000

i50/i1 i51/i1 i1000/i1

German docs G1 G2 … G50 G51 … G1000

g1 g2

G1 G2 … G50 G51 … G1000

g1/g1 g2/g1

g50 g51 g1000

g50/g1 g51/g1 g1000/g1

Spanish docs S1 S2 … S50 S51 … S1000

s1 s2

S1 S2 … S50 S51 … S1000

s1/s1 s2/s1

(1) combine lists; (2) sort by adjusted weight; (3) take top 1000 docs

s50 s51 s1000

s50/s1 s51/s1 s1000/s1

Multilingual Information Retrieval: Alternative Method 1 Multilingual Query

translator

English French German Italian

Multilingual Document Collection

IR engine

Spanish

English French German Italian Spanish

ranked list of docs in multiple languages

Multilingual Information Retrieval: Alternative Method 2 Translated documents

Query

Original documents

English English

IR engine

ranked list of docs in English

English

English

translator

French

English

translator

German

English

translator

Italian

English

translator

Spanish

Multilingual Information Retrieval: Alternative Method 3 Query

translator

English docs

English

IR

English

French

IR

French

German

IR

German

Italian

IR

Italian

Spanish

IR

Spanish

French docs translator

IR

English docs

Documents

English docs

German docs translator English docs

Italian docs translator English docs

combined ranked list of documents

Spanish docs translator English docs

Precision

Performance of Different MLIR Methods 1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0

BK2MUEAA1 (.3424) NormalizedMerging (.3286) MLIR Alternative 1 (.3126) MLIR Alternative 3 (.3648) 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 Recall

Conclusions • German decompounding can significantly improve retrieval performance. Keeping only component words in the query works better than keeping both compounds and component words. • Chinese search engine is a valuable resource for translating Chinese proper nouns into English. • Merging documents by adjusted probability of relevance works reasonably well.