Document not found! Please try again

Across Languages: A Dictionary-Based Approach to ... - CiteSeerX

13 downloads 147 Views 1MB Size Report
multilingual domain. Therefore, we are able to leverage ...... transfer dictionary. Q53: rachat fmanc6 par emprunt. + leveraged buyout. Q55: dt%t d'initi6. ~ imider.
Across

Querying

Languages:

A Dictionary-Based

to Multilingual David

A.

Information

Rank 6 chemin

Retrieval

Gregory

Hull

Xerox

Grefenstette

Research

de Maupertuis,

Approach

Centre

38240

Meylan

France

{huU,grefen}@grenoble.rxrc.xerox.com

Abstract

We

demonstrate

feasible The

multilingual

will

need

information

to be able

boundaries.

This

particularly

in

the

translation. IR

system

transfer

lartguage lags

suggest

also

translated

although

mance.

From

primary

a series

of

some

and

that

crossThe

and

transla-

y in

restrict

our

IR

and

trans-

to

performance.

resources

vances

to more

overcome

tem

differences

will

become

such

and

network,

which

have

more

important Therefore,

language.

Beyond

and

performing

ing

for

we have tors

After tion

begun

We then queries

a series

to

language

in making our

we introduce

an

have

ad-

text

multilingual

Defining

important

pair

and

the

can

system

the

to

implementatwe

begin

are

for our future and the reasons than

we

try

multilingual

a simple

of

re-

Rather,

research,

directions of when

Multilingual

most

research. why the

numerical

Information

There

is no common

currently

lingual

information

retrieval

sets

in

ferent

At

Xerox,

used

fac-

information

to

results

Retrieval

In

this

problem.

lined

a broad retrieval

section, of the

previous

different

for has

(IR)

we will and

definition The

term

range

of

using

present

multilingual

authors Five

accepted (MLIR).

cover

information

descriptions by

the

past

to

languages.

search-

what

the

proaches

in

to explore

this

ideal

and

experiments.

used

retrieval

help

boundaries.

it

native

character

the

2

to explore

in their

provide

the

time

effective

from

components

two

suggesting

build

language

arise

which

that

the

ap-

or

more

a number

our

definitions

been

dMerent one

IR task

outline

multi-

that

own

for

of dif-

have

been

approach

MLIR

to

are

out-

below.

work.

discuss

describe and

written

to

as the

for

From

are more

by

are considerable.

requirements

find some examples

are

sys-

access,

wish

extended

of experiments

important

presenting and

are not

to

is not

a single

that

perfor-

we learn

problems. system,

a system.

and the failed

task

is a sig-

become

computer

information

identification,

will

Web

technological

and

searchers

accepting

across

systems

retrieval,

problem

future

information

are most

ret rieval

that

language

of the

interface, for

merely

Wide and

impeded

common

of documents

World

countries,

the

collections

systems

as the

more

these

to

basic

understand

system

conclude

we

there

results,

and

IR

this

problems

such

of the

experiments

for

that

and

multilingual

terminology

resolving

attention

the

of

error,

these

the

but and

is

dictionary

missing

multilingual

required

of the As Internet

in

sources

ion

Introduction

accessible

for

goal

understand

important

amblguit

methods

and

of

fully-functional

performance

most

ambiguity

Our

tools,

analysis

retrieval

bfingual

monolingual

a failure

sources

information

language

analysis

between

translation to

mrdtiliugual

a general

linguistic

gap

working

standard.

single

basic

nificant

important

learned

identification

to poor

is

queries

although

is the

system,

contributes

are most

monolingual

correct

some

are required

conducting

we have

future

language problem

we are

factors

terminology in the

Xerox,

is feasible,

that

IR

resources

Using

the

of the

across

classical

and

what

behind

of error

lation

system

IR

of multi-word

source

At

work.

multilingual

experiments

the

dictionary,

considerably

tion

of

to understand

making

system

documents

ax significant

a multilingual

a bilingual

1

extension

query

experiments

retrieval

retrieve

challenging,

to perform build

to

that

using

the

our first English

definition

of multilingual

several previous round document

basic research

informa-

approaches work

of experiments collection

in this using

to the

(1)

Et in any language

(2)

Et on a parallel document document collection where the query language.

(3)

Ill on queried

(4)

II% on a mnMlingual retrieve documents

(5)

Et on multilingurd documents, can be present in the individual

area. French

(TIPSTER).

Permission to make digital/hard copy of all part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or fee. ACM 0-89791-792Switzer1and@1996 Zurich, SIGIR’96, 8/96/08.$3.50

ern

IR

to

Engfish.

collection the search

document in multiple

comes where

as the

systems

information

49

(1) 1995],

referred

than

a monolingual document in multiple languages.

Definition [Harman,

other

that

they

because

collection

collection, languages.

which

where

i.e. more than documents

the IR

multilingual

claim

retrieval

from the

or on a mukilingual space is restricted to

recent track.

A

in

are

capable

be

can

one language

conferences Spanish

number

are performing they

queries

TREC

experiments

can

of

are mod-

multilingual of handling

extended

character

text

retrieval

this

definition

sets

in

other

is (2),

is working

with

documents

in each

dependent the

language

are

document

the

the

and

in-

single

is entered, parallel

in any

and

languages

documents IR that

plied

effectively beginning

document

access MLIR

requires tirely

new

and

requires

rated

into

using

form

order

to

machine

ordy

and

the

entire

Therefore problems

et al.,

query

translation

relevant

can

that

it

is necessary

However,

suggests

[Davis issues

be possible

by relying that

& Dunning, which

This

Our

research

concentrates

First,

that

are both

reliable

test

document

For

non-existent

in

to leverage if we rely

lowing

and

the the

to

obtain

the is

a reasonDavis

with

such

multilingual

resources

are

research,

several

process. scores

different

a single

ranked

document

There

obt airted languages list

of

about

volves

the

ing

and

the

direct

will

translation

text

be comparable,

documents

requires

numis op-

from

the

body

availability

of current

language

translation

less than

entirely

tor

target

space

we attack

resent

generalis-

raises

language.

ation

of the

a number

weighting

language on

searnlessly

translation

setting

enough

the

of gen-

to make

Experiments

by

which

all

of its

th~

op-

by Radwan

consider

each

with

term

in

the

definitions

based this

term

word

possible

on the

vec-

extended

However,

issues

using

each

strategies

query.

Should

strat-

advantage

However,

the

evaluate

of important

strategies.

This

machine

we can

to

translated

struc-

text.

in

by

using

to take

products.

Retrieval

cart

in-

process

suspicions.

a process

model

ap-

is characterised

researcher

alternative,

is mapped

high-end generally

language

It

systems

these

di-

a machine

in translation

is dismal

confirm

the

the

natural

La-

language

source

translation

satisfactory.

robust

and

describes

represents

research

MT

eral

1994]

the

y of commercial

which

translation,

using

of

the

of

approaches

languages

source

allow

major

correspondence,

techniques.

tion

this

rep-

approach

respect

to

be weighted

term

accord-

ing to the number of translations? For example, a term with four translations may have its importance artificially inflated with

respect

to

a term

2 Latent Semantic a psrallel document

so cre research

3 Oard

1 A paratlel document documents are presented

the

a realistic

translation

of ambiguity

might

extensive

in the

docu-

use

generation

resolution

the

target

sophisticated

language

fol-

more

This

source

query

document

with

currently

from

machine

MLIR

the

query

since

we are

that

top

having

for

three

Text

system.

translation,

of the

on

without

translation

vector

(MT)

information for

term

the

or

vector

collections.

is no guarantee by

built

document

term

MLIR,

As a more

Definition (4) adds art additional layer of complexity to IR problem because of the languagedependent nature

to multilingual

later

the

collection,

of efficiency.

be

linearly

defines

Coindexing2.

to

Second,

docHow-

translating

package

is not

1994]

translation,

one

[Radwan,

in traditional

collection.

worry

to have

the

sting

to

relevance

Therefore, work

sev-

scarce

that

in the

Furthermore,

of interlingual

of mapping

proach

and

our

into

and

for

ordy

context

possible

in terms

(unless

trans-

(often much

easily

increase

which

et al.,

text

into

tural

results

of

previous test

above

numbers

of experimental first,

(3)

domain.

a monolingual

tradition

translation

Semantic

of

expanding

it is necessary

large

moment,

problem

similarity

tent

iru-

of the

iug

ment

call

egy

simplest

of the

they

not

is one

languages

[Oard

translation

on statistical

in this

costs

problem

rectly

the

results

the

process

for identifying

is

system.

storage

dynamically,

Oard

direct

by

entire

the

to why

document

between

software

trans-

performance.

realistic

the

of supported

to

being.

experimental

quantifiable,

off the extensive

upon

a long

on problem

in order

collection

judgments. able

redesign

IR

language

designed

short

document can

of

principle

provide

better

choice

more

in

is quite

to

module

existing

been

using

not

every

much

across

very

to

to MLIR,

or document has

be

so it

the

translation

already

to

lead

with

according

approach

query

does

or translating

performance

reasons.

a

minimize

tion).

of research.

eral

of

to

cooccurrence

is no reason

tend

which

seems

on the

be approached

process,

performed

ma-

alone

not

faced

documents

system

There

could

query

ber

retrieval

research

to be explored

to

order

documents

MLIR

translation

translation,

repr~

and

Our

words),

former

to

across

whether

information

1996].

needs

the The

tractable

to achieve

recent

such

in

rank

for

translation once

of ambiguity

primarily

the

when

problem

a human

optimal

level

ourselves

based

relevant

Queries

two

query

automatic

clear

or

ever,

(3-s)

translated

Information

is not

the

restrict

For our

mechtim

could

ap

be incorpo-

resolution

on

Def-

experiments.

documents.

translation.

sug-

an en-

generation

or even

It may

information.

Dunning

sufficient

syntactic

it

adds

systems

capture

some

instead.

of an

more

by

for

199s].

MLIR

measures

is required.

ument

researchers

IR

for

read

be avoided.

science

of performance

similarity

are

incorrect

on

be

access

as the be

of

199s]

base

also

of our

to

and

to

problem

for

work

define

is a far

queries

never

issue

documents.

level

or

the

descriptions

document

MLIR

be re-

can

tries

use query

any

systems.

knowledge

be combmed

We

Approaches

in queries

in

one

traditional

to happen

et al.,

(English/French)

Retrieval

lation

to mul-

to

pair

similarity

which

further

This

scores

requirements

boundaries,

of cross-language

like

to the support

with

1994]

an inexact

would

terms

lation

However, problem

translation,

need

associated

such

Basic

search.

et al.,

translation.

However,

of documents

use

3

a mul-

can

techniques

retrieval

as we

system

than

sent ations

full-text

an extensive

barriers.

problem

the

of query

that

resource

that

beginning

~oorhees

documents.

language

the

a single

Recent

Buckley

languages.

of complexity

the

language

problem

and

to address

to

complexity.

1995,

to

documents

MLIR

systems

problem

some

requires

statistical

and retrieving (4), a simple

be searched

still

is

of process-

move

definition

The

other

can

where

of their

et al., IR

to

are only The

(5).

modern

we can

in response

this

in ord”er

[Broglio

gests

(5) individual

is ordy

of their

collection

is capable

which

languages

are listed

Spanish

system

that problems

in that

covers

document

definition.

we can generalise

tilingual

the

retrieval

collection

in multiple

Finally,

IR

also

be searched

that

rmevious .

of its component

field

query

inition

system but

documents

definition can

the

strategies

merging

merging

monolingual

of

Information

the

of the

tiliugual

portant

the

only

in a number of diferent languages across language boundaries. With

extension

and

This

assumes

but

queries documents

able

that

as separate

Once and

into

perform version

collection

viewed

is fixed, , which

(3)

ing

[Oard

is assumed

are

to

expanded

languages.

monolingual,

chine

modilied

document

of the collection.

searched.

Definition

above

be An

it

language

collections

component

trieved

where

of interest

document

can

a multilingual

parts

language

and

languages.

extracted

collection is defined as one where the same in two or more languages.

useful

50

[Oard from

for

et

Indexing

a single

~andauer

translation & Lktman,

urdess

this

1990] applied

to

collection. al.,

1994]

a parallel

describing

with

expected aligned

that corpus,

our dictionary-based

the

term

but

we

approach

vectors find

also.

this

would concept

be

one

to many

Perhaps

mapping

translations more

of each

be much

mon

term

tions

are Iinked than

be weighted that

term

each

would

be

variety

more

methods

MLIR

Whale

direct

translation

that

derive

corpus. and

to

query

pelling

and

et

al.

techniques

which

uses

the

document which lection.

tested

obtain

[Davis

& Dunning, approaches

lection.

For in the

tences

in other

corpora

problem.

can

match

and

are

query

would

strongly

se.sociated using

and

corpora

for direct

line

(2)

Resource

requirements

are one allel

sen-

terms

sen-

indirect

main of

information

with

requires

that

additionrd

where the

researchers

resources,

the

query

same

and

depen-

the

character

and

multilingual

from

some

sets for

& Spitz,

1994,

cented

character

sets

are some

many

text

ized

when

much

more

these

Efficient veloped work

will

not

generalize stemmers

Stemmers

are being

not

been

have of

which guity

for

it

English

same

works

Spanish

which

can

a new

already

as well 1996]

be

used

been

built

have

but

translation de-

ties

for

of

this

the

training

not

but

the

It

other

is

create

than

tion,

high-

but

based

the

lin-

on

matching

the

gested

is particularly

conflates

word

so no additional

to

construct

in

8-10 most

a morphological

person/months, Western

and

tionary

European

in

1994]

Much

effort. for

requires

immense

models

of great not

the

IBM

not

from

aligned not

from

ordy

term of

with

cau-

and

they

work

been

approaches,

have

1993]

compelling

domains

The

quansophis-

distribution

be applied

has

vector

probabti-

is extremely

simpler

Dunning

term

et al.,

to other

terms

and

two

have

languages.

51

resources,

are more similar

is not the

IBM

bilingual

prevalent

is de-

tested such

been

on

as the

sentences

yet

sug-

proven

to

makes

the

former

task

our a sense,

tides

but

coverage

experiments one The

can

use consider

transfer

of the

dic-

of sufficient

a transfer

required

as complementary. shallow

creating effort

approaches broad

While

language

texts

the

Therefore, In

general parallel

process,

approach

dictionary.

than

coverage.

a simple

comparison.

transfer

analyzer tools

problem.

that dictio-

be used

of context.

associated

vocabulary

translation

and

gen-

definitions

a bilingual

must

generalize

translation

Davis

of

model

probabiLities not

Bilingual

generate

approach

independent

size to have

algorithms improvements

[Chanod,

They

dictionary

verbose

et al. [Brown

accurately

This

machine

these

tionaries

ment

methodology

statistical

of each

transfer

corresponding

which

may

of by

Of

ambi-

1996]. a

link

also

for

trans-

translations

is a non-trivial can

strategy. but

vector

be effective.

forms

of IR perfor-

stemming similar

MLIR

Converting

of Brown

this

translation

for

amounts

IR.

and

corpus.

as they

signed

most

Xerox

each

are generated

English.

languages

work

vectors

been

most

to

The

large

tend

to be adapted term

more

read-

are designed

thesaurus).

so accurately

text

epitomizes

direct

contain for

doing

of training

IBM

is

the

a bilingual

collections

but

need

as a bilingual

dictionary

text

tication.

capital-

process.

In terms

provides

tities

In

approach

solution

as traditional and

and Our

to a transfer

translation,

but

called

be suitable

Parallel

ac-

surface,

not

nary

for

be addressed.

approach

root,

would

recognition

Fortunately,

translation.

sup bene-

Support

on the

as it ordy

et al.,

for

including

other

This

developed

language

examples

is

may

English

alternative

inflectionrd before

[Hearst has

for

translation,

[Hull,

Xerox for

the

and

that

required

tested.

an

dictionaries

languages

analysis.

is introduced

mance, in

developed

for term have

for

language

sometimes

ordy

is

they

they For

of

be required.

rdthough

thus

field

in machine

available, and

define

(also

languages.

be

needs

we will

eral

years,

other

developed

extensively

inflectional

valuable

to would

system

to

stemming

for

which

in

must

twenty-five

the

entry,

in the

may

systems.

be an

as a sur-

t rauslations

dictionaries

in mind

able

accents.

algorithms

past

lation,

available

the

previ-

can

be used

Unfortunately,

retrieval

both

language

the

readers

by automatic

be

is inconsistent,

their

to obtain.

for

corpora

effort

more

Parbetween

in the

of extracting

language

are

must

collections

that

during

investment

performance

use

lose

problems

the

what

guists

issues

number

are not

manual

and

returned

simple

accenting

stemming

over

known

have

important

general are more

human

use

and

mapping.

as a reference

can

are

dictionaries

described and

process

approaches,

1995].

seem

difiicnlt

form

or

in-

systems

vector

monolingual

corpora the

(3)

relationships

strategies

terminology

parallel

although

term

ap sys-

monolingual

translation

translation

for

and/or

extract

Domain-specific

Bilingual

language

automatic

may

sometimes

to handle

each

Grefenstette,

collections,

letters

of

vector

to

each

the

translation

Bilingual

for

used

for on

dictionaries,

texts

Machine

translation

source

rogate

do-

be

transfer

translation.

simplest

system

document

facilities

[Sibun there

MLIR

btigual

resources

a machine

or

of definitions

term

interest,

or indirect

a significant

the

documents

The

handle

for

recogni-

in

Depending

parallel

query

query

ous section.

techniques.

multilingual

subsequent

phrase

entries

extensive

(1)

from

to be expensive

the

obtain

even the

language.

ported fit

to

term

consideration. thesauri

can

important

MLIR

retrieval

Genosse, stemming,

in any

as noun

finding

corpora.

source for

with Extending

for

include

direct

corpora

of research.

for

may

bilingual

for

able 4

feminine

Gartner,

step

3rd word

composed

principled

such

requires

derived

useful

training

context

by other

as the the

a

word

German

as

Wein,

is a necessary

under

domain-specific

col-

against

in matching

be missed

parallel

to be a promising

query

domain

se-

alter-

terms

analyzed

providing

processing,

translation

formation

Dunning

document

is

The

of

for

French

analyzed

words

to

crucial

pair this

tem, col-

been

the

of joindre.

of the

is also

speech form

Genosse(n)#schajt

addition analysis

Query preach,

yet

example,

ften

of

dictionaries.

a parallel

and

translation

to capture

that

not

It

language

interesting

the

find

that

Indirect

exploiting

translation

several a parallel

tion.

com-

the

In

language

Indexing,

of

Davis

use

one

be able

has

also

natural

representations

languages

examine

languages

relationships

We consider

technique

language

topic. may

this 1996]

same

vector

the

MLIR

which

example,

tences

query

the

of

form

agglutination

schaft.

parts

a normalized

For

Wein#Gurtner#

morphological

1990]

very

Semantic

term all

most

possible

provide

is morphologically

preterite

noun

of the

a training

present

decomposition

across

the

& Littman,

1991]

(joined) plural

the

and

forms.

Weingar-tnergenossenscha

methods

using

Latent

value

to

for

native

dent

on

Unfortunately,

riously

the

based

like

surface

plural

should

consider

[Landauer

et al.,

comparable

seems

also

indirectly

Llttman

singular

query can

[Evans

collection are

one

translations

Landauer Evans

of the

MLIR,

a text

ques-

and

approach

of

person

be considered. viable

recognize

in

What

These

for

token

joignaient

com-

higher?

information?

analyzers

all

of a term

Should

proportionally this

These

where

translations

others.

corpus-driven

weights.

model

by disjunction

some

common

do we use to obtain

suggest

in the

Boolean

Furthermore,

more

translations

resources

for

or weighted

appropriate.

will

is accounted

an extended

seem

simple

a btigual these

dictionary

language.

dic-

to imple-

two pr~

AU major

words

in the

ology

language

is missing

able.

On

but

the

deep

single

hand,

parallel

technical

MLIR

of these

results

on a

query

will

lowing

future

with

resources.

Previous

Experiments

in

against

is

the

a long

context

[Salton, the

of

1970]

early

form

oped

based systems

most,

reasons

translated

these

thesaurus ten

not large

much

systems

fully

automated),

modern

more

dynamic boolean

siderable

skill

robust

and

uments more

to

be utilized

information review

[Oard

Experimental

on

systems

the

Christian

Fluhr

developed

a ranked

boolean

compound,

French wan the

and

and

and

Fluhr

Cranfield the

- 0.345,

MLIR

dictionaries TRAN)

using

-0.215, These

vector

translation

tion.

Davis of

and

a Spanish

per

El

line

has

system, their

yet

been

worse

than

the

appears to

build

without

in the

dictionary

construction

Xerox

Multilingual much

the

4 Much

IR same of

http://www.newc-tle.

thk

Davis

entirely

effectively

The

an

considerable

systems fashion

tunately, which

from

the

IR

nunciation, not

intervention

EMIR

systems, European

Although allowed

in

Project:

rese~ch.ec.org/espayn/text/53l2.html.

coherent

our translation

and

termi-

language

French

dictionary

was

which

involved automatic

formatted these

creating In

En-

Unfor-

a fully

in

of sections

examples,

assignments.

~

a word-based

information.

nor

filtering

or correct,

to

retrieval,

of excess a simple

and

that trans-

one

some

SGML,

such

as pro-

markings

were

source

cases

it

of errors was

possi-

ble to detect these marking label for manual treatment

errors. For example, we would a.uy translation containing per-

sonal

pronouns

such

we.

were

manually

treated

for

and

complex

due

to

example,

52

etymology,

always

pecially

using

the

to lan-

short

and

of the

1994)

for text

for automated

with words)

acronyms

btigual

Hachette,

neither

likely

native

removed.

amounts

was

transqueries

more

to work

majority

been

suitable

of large this

process.

per-

automatically

removal

real

experiments

some

on-line

(Oxford

dictionary

to

Approach be evaluated

dictionary

TREC

While

query

of seven

be

a more

with

in their

length

may

most

is even

we decided

overwhehnhg an

contrast

that

French. has

the

gener-

issue,

previous

the

this

to attain

this

to obtain

that

working

(average

translated

was able

associated

not

most FDIC,

translation.

While

we wish

and

are

not

unique

system

therefore

Iran-Contra,

are

one

recognized

evidence

also

MLIR

domain,

converted

to

No one

for

into

has

was

Bush,

of performance.

We transfer

that

created

remain,

glish

base

Apache)

independent

in

as monolingual is

newspa-

(Reagan,

AH-64

long,

he queries

them

nology

process.

can

been

lated

tried

queries

Dunning

human

had

transla-

OPEC,

names

due fields

terminology,

II,

words this

oft

Concept

of specific

SALT

problems

To address

sufficient

system

a few

the

example,

have

guage.

translated

In

and

ordy

of the

versions

automatically.

automatic

Experimental

work

MLIR.

are

term

suggests be

Researchers

that

monolingual

not

picture

at 10-90%

have

their

general

queries

ret rieval

For

a technical

lation.

prelimi-

translator.

levels

users

Mexican of

for

amounts

the

high

some

original

human

information,

cssw when

English

research

will for

that

forms

6

Their

alone

translations

able

(the all

performance it

query

NMSU on

that

1996].

and

transfer

machine

from

methods

found

strategies

acceptable

Fluhr’s ated

far

& Dunning,

learning

provide

Dunning

this

and the

information

terminology

We transla-

and

Abrams,

be the

evidence

t hart

technical

most

query

dictionary.

that

OSHA,

or trans-

translator.

translation

proper

and

reasonable

(SYS-

precision

and

were

good

LBO,

M-1

unrealistically

system

translation

strong

(GATT,

Wall

Mercury

documents

with

In particular,

large

no

by the

LAN)

Using

content.

has

Toshiba,

clear

the

Jose

queries

queries

multilingual

cent ain

in English

NRA,

using

and

and

which

acronyms

Rad-

queries

translation

collection

have

performed

[Davis

corpus

Ted learning

queries

on EngIish-

monolingual

length

for

it

the

collec-

component

from

a million

transfer of

made

suitable

of

bilingual

experiments

not

much

ex-

a bilingual

experiments

news

San

model,

English

text

experiments

by a professional

examination

of the

full

by

the

the

half

translation

and

TREC

50 selected

externally

methodol-

queries

articles

and

The

vector

careful

and

have

retneva14. French

effective

generated

to their

notable

uses

French

of

to roughly

strategy.

well-tested

recent

to the perfordescribed

of this

to use only

newswire,

by the

relative

the TIPSTER

consists

of text.

term

For

to

INSTN

to work

by average

more

document and

two at

machine

provide

and

corpus

Norte) sets

vector

tions

were

com-

MLIR

which

244]:

using

experiments

Mark

a variety

query

term MLIR

is

with

translated

as measured

recall.

of

colleagues

~.

amounted GB

into

chose

AP

1.6 the

from

which

and

nary

doc-

the

News,

left

conducted

results

-0.270,

Oard’s

dictionaries

and

following

research

indeed,

1994]

collection

obtained

retrieval

cross-language

[Radwan,

rank

51-100

about used

and used

language. by human

experiments

French

fol-

queries,

be compared

of the

the

experiments

with

a single

variant

translated

We

Journal,

lated

All

the

are retranslated

sense

use some

if the

to

language

then

well-motivated

queries

in

the the

system

rise

Start

another can

system.

with

collection,

A

be less

a good

1995].

the

con-

to consult

system

idiom

based

to

can

a

1996].

and

French-German

in

techniques.

application

is sparse

ceptions. term,

text

& Dorr,

work

retrieval

which

are encouraged

feasible

get

Our

and

Street

in

require

tend

matching

on multilingual

not

also

of

is of-

language

and

systems

effort

is generally

effectively

than

fore-

(which

use

systems

statistical

readers

prehensive

these

and

is just”

which

As retrieval

model,

full-text

applications,

text

fashion.

less effective

using

that

collections

for

to

this

[Harman,

vocabu-

human

indexing

tion

devel-

results

section

MLIR.

queries

the

we worked

documents.

of

into

translated

MLIR

Following

retrieval

First

considerable

something

text

a simple

1996].

document

been

text

of the previous

ogy,

a

number

controlled

modern

mance

gives

judgments

and

queries

in the

per-

given

have

for

in

systems

applications

& Dorr,

and

Salton

counterparts

However,

require

construction

for

retrieval

These

system,

in

in

to compare

of the

This

to evaluation relevance

However, desire

performance

translated.

and

original

in

demonstrated

A large

ideal

[Oard

retrieval

thesaurus.

approach.

are less than of

text

commercial

on this

a number

1972]

single-language btigual

and

text systems.

[Pevzner,

cross-language

as their

constructed

lary

multilingual

judgments. is a strong

optimal

the queries

translators.

vocabulary

Pevzner

70’s that

governmental

of

controlled

and

as well

carefully

on

history

the

approach

MLIR There

there

perfectly

documents,

MLIR

relevance

context,

were

Have

5

known

multilingual

narrow

concentrate of the

queries

avail-

provide

they

system

age of both

termin-

are not

corpora

when

ideal

a.dvant

most

probabtities

particularly The

to take

but

translation

other

coverage,

domain.

want

are defined

and

bugs

common that in

the

automatic

as I or based words,

the

462

on the

or

extraction

filtering

misplaced of

34000

filter.

dictionary

automatic

filter

of the

this

definitions

Sometimes, entry

es-

was so long

would

fail,

either

SGML

markers.

For

translations

of the

com-

English:

politically

French:

motivated

troubles

Term

vector

– turmoil

– civil

discord

civilian – character

politique

– political

Table

1:

disturbances

mation

politique

An

unrest

disturbance

the

disorder

cult

diplomatic of the

politician query

policy

translation

process

for

the

ously

discovered

In

French

success,

word

break,

oneself,

lay,

prendre(

catch, idea,

one ‘s, arm,

initions

as they

of the

would

entries

translations.

pick,

but

a serious

common terms

removed

from

lection,

or more each

A further use

and

translation are not

clues

These and

example,

the

ation,

still

ister.

loss.

alents.

left

perhaps

the

terms

ject

that

more

The placed each

root

the

translations. are

translated

query

This

issues

ing

is the

ambiguity

serve

Table

These the

final

using function

the to

term

weights

are

whale

documents

the

unique

translated

for

a sample

form

warse

guity

introduced

link

term

model.

and

document by

The IR since resolv-

is designed

the

1985],

which

query

and

to

upon

this

SMART

in-

of

(particularly

would

queries

unit

square-

IDF inner

factor Docuproduct

expect ones)

to errors

process.

preckion ss the

only

3rd (nat

online)

5

may

rachat

Q55:

dt%t

d’initi6

fmanc6

Q56:

taux

d’emprunt

Q58:

chemin

par

imider

coup

d’6tat

Q64:

prise

d’otage

Q65:

syst~me

trieval

system

terms

used

shart

with

some

and

same

terms

to look

Collins

work.

French-

In for

hindsight, autamatic

inconsistency

may

in more

in the

be mare

definitions

campre for

is generated

some from

dictionary.

+

leveraged

buyout

trading

pr6f6rentiels

=+

coup

~

+

prime

lending

rate

d’etat

hostage-taking

de recherche

Q66:

hmgage

Q70:

m.?me porteuse

Q82:

g6nie

naturel

g.h$tique brut

* ~

~

genetic

p6trole

gas naturel

~

Q96:

programme

informatique of the

crude natural

s

information

re-

language

surrogate

Q88:

A list

do-entaire

natural

+

Q90:

2:

is

version

de fer =$_ railroad

Q62:

query

definition

query

representation

emprunt

~

the have

are

this

transfer

in

entries

the

dictionaries

cause

resulting query

constructed

Q53:

for

the

dictionary

a clean

unique

one dictionary

another,

manually

300

Robert

dWerent

because

build

for

about

the

to

of

described

correct

also

queries

edition

usin

second

the to

the

uses

dictionary

manually

are

translation

than

many

queries

versions

mother engineering

oil gas

important

=+

computer multi-word

program expressions

in

sample

specific

in general will

language

use

English

pracess

sometimes

Since

the

largely The

the

judgments.

suggest

each

retrieved

version

we decided

that

terms.

Table

is evaluated

relevance

due

Query

length.

of their

MLIR

short

by the translation

the

performance

to known

One

apply

dacu-

frequencies. traditional

order

measures each

col-

necessitate

to

diiferent

the

that

and

dictionary

hensive this

inflectional

version

We

the

and

list

original

query.

irnprave

to have

to the query

the

the

space

in descending ranked

a

transfer

and

in

original

by

dictionary

used

realized

results,

term

to MLIR. It

will

[Buckley,

evaluation.

than

ignored.

between

multiplied

vs. manual

query.

characteristics

queries

acmroach

efforts

are normalized

respect

strategies

are

is re-

monolingual to their

probably

choose

by

first

Given

there

We

English

the

The

experiment.

we

all

the

term-weighting

future

vectar

are ranked

by comparing

normalized

Dossible

system

the

to a traditional

use a moditied

of the

with

sent

and

retrieval

The

to

unchanged

translation

ment

score

from

to ;peci&zed

root

ments

missing

transfer

the

First,

builds

of

with

document

documents

generated

terms

up.

up

compares

section.

duplication,

laaks

and

are

1 shows

strength

dictionary

20

automatically

an ob-

term

system

which

experiments

formation

the

concatenation

simrdest

in

each

the

are also

as a baseline,

model.

Second,

transfer

is then

relating

and

and

these

Results

completely,

of the in

that

steps.

taking

passed

Documents

roots. all

by

Terms

dictionary

root.

bilingual

query

basic

analyzed

inflectional

in

a dictionary.

of three

missed

considerable

than

a thesaurus

15,

dictionary.

extraneous

reg-

expu~ion,

created

10,

retranslation

previous

stop-

is evident

add

we

for

measure.

transfer

equiv-

filtering

document

to doc-

is working

a multilingual

reasons,

one

that

user

would

resort

relevant

Note

the

relevant

5,

constructed

the

striking,

it

goal.

that

a an

relevant

can

mare

least

experiment

three

mdiof

out

radiation,

setting, medicine

consists

is morphologically

translated

system.

and automatic

process

by its

retain

Our

to: loss

aa translation

final

find

be

a few

system

feedback

Experimental

Our

For

radiation,

an IR

license,

resembles

MLIR

query

only

7

of the

tiltered

at

evaluation

but

heart

register,

to

in

shauld

Once

the

read

documents

precision

system.

For

these

a particularly ta quickly

of

relevance For

im-

previ-

word

filtering.

Filtering

medicine

In

results.

was the

ezpulsion,

one-wauld

register,

in the

automatic

from

rwactice.

disbarring.

to the

the

disbarring.

disbarring,

Ideally,

and

where

mdiation

medicine,

license.

how

and

at

which

that

word,

embedded

ofl

words

high

is the

be

lots

feedback

recall

boundaries. from

be expected

collected,

collection.

good

averaged

the

cannot

MLIR

the dMi-

set,ting.

we are assuming

of interest.

was

on

head

after

word

striking

to pmctice

words

are often

system

dictionary

about

reappear

French

ezpulsion,

license

the

clues thus

our

elaborate

contextual

of the user

for

which

introducing

to a human

definition

error

definitions

translations

is used.

noise

pernicious

of ~he ~ord, proper

rather

more

of encvcloDedic

high

obtaining

defi-

nition. the

if

a monolingual

transfer

ten

an

been

language

will

help in

more

techniques,

translated)

relevance

experiments,

clean-

521 of the had

uments

def-

for

have

(with

much

information

Therefore,

goal

cross

feedback documents

user

user

appears

becomes

multilingual

have

language.

monolingual

put,

noisy

manual

words)

were

client,

such

the (or

of the which

incorporating

in the

evaluate

documents

jind,

task must

relevant

tool

impartant

including

thicken,

We left

be beneficial.

(mostly

Duplicate

words

take,

handle,

customer.

produced,

dicionary

dictionary

stiflen,

charge,

waist,

were

gave23

sink,

bring,

accent, ing

totake)

set,

This

by

ability

vocabulary

relevance

query

addition,

and

on the

find

terminology that

prove

foreign mon

the

suggests

important

Q67

to

of interest.

when

This

depends

system)

documeuts

nature

example

query

trouble

retrieval

from

courteous

caract.?xe

short

civil

& caract&e

retranslation:

trouble civil

civils

tend and

Success

that

5 We remind the reader that in thk context placing a source language term with all entries tionary. No lexical amliguity is resolved.

to perambi-

in infor-

53

translation means r~ from the transfer dic-

During

the

the

translation

ual

unit

of

ated

transfer Therefore,

(MWE)

query,

2 for

tion.

of these

expressions.

(a)

queries

original

based

transfer transfer

ficial,

in the

dictionary, sense for

that

this

a diferent

ations

of

the

current

averaged

level

over

this

built

and

(d)

without

to help

not

by

sample

out

I

word-based I transfer

0.393 Table

3:

three

Average for

presented

perform

that

the

real

Test

in

to the guages

as it

can

sufficiently makes

in

There

biggest are

results.

word

transfer

Building

had

dictionary

a similar

guage

would

be

of Table

mance

represents

attainable.

features,

such

performance are scored

baseline

best

which IR

query

performance.

has none term

of these matching

to

lancol-

ways

better

in which

such suited

we might

as adopting for

short

be able

term-weighting

to improve

the

number

diet

trans

Manual multi-word

dlct

trans

diet

12

9

9

12

8

3

0.0

original

of the (Orig)

of queries

performance

English in each

from

of the

queries.

translated

Values

given

are

category.

having

improvement

is applied

difference

at least

to

in performance

the

dictionary

is reflected

as more

construction

primarily

in queries

no

relevant

documents

ranked

in

some

relevant

documents

scoring

well,

Recovering better

at

than

feedback

queries

least

one

finding

becomes

relevant

none,

a viable much

better

to gain

a better

understanding

top

20

which

is

document

option.

perform

promoving

the

because

which

with

which

did

based

transfer

worse

(i.e.

is

monolingual There

are even

in their

translated

the are

English are

queries.

to

to

suffered

in but

information

This

is in

not

noun produce

phrase

our given

phrase

translation

multilingual in Table

on a word

the it

by

happens the

experiments. 2 lose word

vital basis.

from

one

that

single

is often

to

pairs

content

suc-

generally perfor-

components in translation,

lost.

import

half

to

monolingual

in average

enough

most

in

recognizing

crucial

the individual meanings

About semantic

is

gains

a loss

17 because

problem.

or word

often

extraneous

than

contrast

phrases

the

of

th~

dramatic

entire

but

4 suffered is greater

distinct

always

sense

(i.e.

expressions

so the

case,

8 had

translation

demonstrate

is that diferent

wordfailure of the

more

mance. The key difference of phrases often have very

manual

a detailed correctly,

total

results

the

17 queries

as a result

and

the

problems

the

expressions in

from

identifying does

9 lost

multi-word

MLIR.

where

performed

query), that

experimental

cess

using

and

amblgnity

Note

translating

rect

54

translated

that

of the

we selected

multi-word

added

and IR,

when

translate

queries

helps

translation,

dictionary

due

Our

or

query

We found to

some

Were

that

This

definitions

originrd

only).

cess.

problems

many

features

a steady

effort

retranslation.

expansion

the

Manual word-based

Comparison

analysis.

currently

have

strategies

4: and

failure

perfor-

such techniques applied, the gap in performance between original and translated queries could well increase. There baseline,

acgreater

10

=

asociated

results.

Our

Automatic word-based

Orig

In order

final

is not

translation

precision

26

Suggest Documents