multilingual domain. Therefore, we are able to leverage ...... transfer dictionary. Q53: rachat fmanc6 par emprunt. + leveraged buyout. Q55: dt%t d'initi6. ~ imider.
Across
Querying
Languages:
A Dictionary-Based
to Multilingual David
A.
Information
Rank 6 chemin
Retrieval
Gregory
Hull
Xerox
Grefenstette
Research
de Maupertuis,
Approach
Centre
38240
Meylan
France
{huU,grefen}@grenoble.rxrc.xerox.com
Abstract
We
demonstrate
feasible The
multilingual
will
need
information
to be able
boundaries.
This
particularly
in
the
translation. IR
system
transfer
lartguage lags
suggest
also
translated
although
mance.
From
primary
a series
of
some
and
that
crossThe
and
transla-
y in
restrict
our
IR
and
trans-
to
performance.
resources
vances
to more
overcome
tem
differences
will
become
such
and
network,
which
have
more
important Therefore,
language.
Beyond
and
performing
ing
for
we have tors
After tion
begun
We then queries
a series
to
language
in making our
we introduce
an
have
ad-
text
multilingual
Defining
important
pair
and
the
can
system
the
to
implementatwe
begin
are
for our future and the reasons than
we
try
multilingual
a simple
of
re-
Rather,
research,
directions of when
Multilingual
most
research. why the
numerical
Information
There
is no common
currently
lingual
information
retrieval
sets
in
ferent
At
Xerox,
used
fac-
information
to
results
Retrieval
In
this
problem.
lined
a broad retrieval
section, of the
previous
different
for has
(IR)
we will and
definition The
term
range
of
using
present
multilingual
authors Five
accepted (MLIR).
cover
information
descriptions by
the
past
to
languages.
search-
what
the
proaches
in
to explore
this
ideal
and
experiments.
used
retrieval
help
boundaries.
it
native
character
the
2
to explore
in their
provide
the
time
effective
from
components
two
suggesting
build
language
arise
which
that
the
ap-
or
more
a number
our
definitions
been
dMerent one
IR task
outline
multi-
that
own
for
of dif-
have
been
approach
MLIR
to
are
out-
below.
work.
discuss
describe and
written
to
as the
for
From
are more
by
are considerable.
requirements
find some examples
are
sys-
access,
wish
extended
of experiments
important
presenting and
are not
to
is not
a single
that
perfor-
we learn
problems. system,
a system.
and the failed
task
is a sig-
become
computer
information
identification,
will
Web
technological
and
searchers
accepting
across
systems
retrieval,
problem
future
information
are most
ret rieval
that
language
of the
interface, for
merely
Wide and
impeded
common
of documents
World
countries,
the
collections
systems
as the
more
these
to
basic
understand
system
conclude
we
there
results,
and
IR
this
problems
such
of the
experiments
for
that
and
multilingual
terminology
resolving
attention
the
of
error,
these
the
but and
is
dictionary
missing
multilingual
required
of the As Internet
in
sources
ion
Introduction
accessible
for
goal
understand
important
amblguit
methods
and
of
fully-functional
performance
most
ambiguity
Our
tools,
analysis
retrieval
bfingual
monolingual
a failure
sources
information
language
analysis
between
translation to
mrdtiliugual
a general
linguistic
gap
working
standard.
single
basic
nificant
important
learned
identification
to poor
is
queries
although
is the
system,
contributes
are most
monolingual
correct
some
are required
conducting
we have
future
language problem
we are
factors
terminology in the
Xerox,
is feasible,
that
IR
resources
Using
the
of the
across
classical
and
what
behind
of error
lation
system
IR
of multi-word
source
At
work.
multilingual
experiments
the
dictionary,
considerably
tion
of
to understand
making
system
documents
ax significant
a multilingual
a bilingual
1
extension
query
experiments
retrieval
retrieve
challenging,
to perform build
to
that
using
the
our first English
definition
of multilingual
several previous round document
basic research
informa-
approaches work
of experiments collection
in this using
to the
(1)
Et in any language
(2)
Et on a parallel document document collection where the query language.
(3)
Ill on queried
(4)
II% on a mnMlingual retrieve documents
(5)
Et on multilingurd documents, can be present in the individual
area. French
(TIPSTER).
Permission to make digital/hard copy of all part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or fee. ACM 0-89791-792Switzer1and@1996 Zurich, SIGIR’96, 8/96/08.$3.50
ern
IR
to
Engfish.
collection the search
document in multiple
comes where
as the
systems
information
49
(1) 1995],
referred
than
a monolingual document in multiple languages.
Definition [Harman,
other
that
they
because
collection
collection, languages.
which
where
i.e. more than documents
the IR
multilingual
claim
retrieval
from the
or on a mukilingual space is restricted to
recent track.
A
in
are
capable
be
can
one language
conferences Spanish
number
are performing they
queries
TREC
experiments
can
of
are mod-
multilingual of handling
extended
character
text
retrieval
this
definition
sets
in
other
is (2),
is working
with
documents
in each
dependent the
language
are
document
the
the
and
in-
single
is entered, parallel
in any
and
languages
documents IR that
plied
effectively beginning
document
access MLIR
requires tirely
new
and
requires
rated
into
using
form
order
to
machine
ordy
and
the
entire
Therefore problems
et al.,
query
translation
relevant
can
that
it
is necessary
However,
suggests
[Davis issues
be possible
by relying that
& Dunning, which
This
Our
research
concentrates
First,
that
are both
reliable
test
document
For
non-existent
in
to leverage if we rely
lowing
and
the the
to
obtain
the is
a reasonDavis
with
such
multilingual
resources
are
research,
several
process. scores
different
a single
ranked
document
There
obt airted languages list
of
about
volves
the
ing
and
the
direct
will
translation
text
be comparable,
documents
requires
numis op-
from
the
body
availability
of current
language
translation
less than
entirely
tor
target
space
we attack
resent
generalis-
raises
language.
ation
of the
a number
weighting
language on
searnlessly
translation
setting
enough
the
of gen-
to make
Experiments
by
which
all
of its
th~
op-
by Radwan
consider
each
with
term
in
the
definitions
based this
term
word
possible
on the
vec-
extended
However,
issues
using
each
strategies
query.
Should
strat-
advantage
However,
the
evaluate
of important
strategies.
This
machine
we can
to
translated
struc-
text.
in
by
using
to take
products.
Retrieval
cart
in-
process
suspicions.
a process
model
ap-
is characterised
researcher
alternative,
is mapped
high-end generally
language
It
systems
these
di-
a machine
in translation
is dismal
confirm
the
the
natural
La-
language
source
translation
satisfactory.
robust
and
describes
represents
research
MT
eral
1994]
the
y of commercial
which
translation,
using
of
the
of
approaches
languages
source
allow
major
correspondence,
techniques.
tion
this
rep-
approach
respect
to
be weighted
term
accord-
ing to the number of translations? For example, a term with four translations may have its importance artificially inflated with
respect
to
a term
2 Latent Semantic a psrallel document
so cre research
3 Oard
1 A paratlel document documents are presented
the
a realistic
translation
of ambiguity
might
extensive
in the
docu-
use
generation
resolution
the
target
sophisticated
language
fol-
more
This
source
query
document
with
currently
from
machine
MLIR
the
query
since
we are
that
top
having
for
three
Text
system.
translation,
of the
on
without
translation
vector
(MT)
information for
term
the
or
vector
collections.
is no guarantee by
built
document
term
MLIR,
As a more
Definition (4) adds art additional layer of complexity to IR problem because of the languagedependent nature
to multilingual
later
the
collection,
of efficiency.
be
linearly
defines
Coindexing2.
to
Second,
docHow-
translating
package
is not
1994]
translation,
one
[Radwan,
in traditional
collection.
worry
to have
the
sting
to
relevance
Therefore, work
sev-
scarce
that
in the
Furthermore,
of interlingual
of mapping
proach
and
our
into
and
for
ordy
context
possible
in terms
(unless
trans-
(often much
easily
increase
which
et al.,
text
into
tural
results
of
previous test
above
numbers
of experimental first,
(3)
domain.
a monolingual
tradition
translation
Semantic
of
expanding
it is necessary
large
moment,
problem
similarity
tent
iru-
of the
iug
ment
call
egy
simplest
of the
they
not
is one
languages
[Oard
translation
on statistical
in this
costs
problem
rectly
the
results
the
process
for identifying
is
system.
storage
dynamically,
Oard
direct
by
entire
the
to why
document
between
software
trans-
performance.
realistic
the
of supported
to
being.
experimental
quantifiable,
off the extensive
upon
a long
on problem
in order
collection
judgments. able
redesign
IR
language
designed
short
document can
of
principle
provide
better
choice
more
in
is quite
to
module
existing
been
using
not
every
much
across
very
to
to MLIR,
or document has
be
so it
the
translation
already
to
lead
with
according
approach
query
does
or translating
performance
reasons.
a
minimize
tion).
of research.
eral
of
to
cooccurrence
is no reason
tend
which
seems
on the
be approached
process,
performed
ma-
alone
not
faced
documents
system
There
could
query
ber
retrieval
research
to be explored
to
order
documents
MLIR
translation
translation,
repr~
and
Our
words),
former
to
across
whether
information
1996].
needs
the The
tractable
to achieve
recent
such
in
rank
for
translation once
of ambiguity
primarily
the
when
problem
a human
optimal
level
ourselves
based
relevant
Queries
two
query
automatic
clear
or
ever,
(3-s)
translated
Information
is not
the
restrict
For our
mechtim
could
ap
be incorpo-
resolution
on
Def-
experiments.
documents.
translation.
sug-
an en-
generation
or even
It may
information.
Dunning
sufficient
syntactic
it
adds
systems
capture
some
instead.
of an
more
by
for
199s].
MLIR
measures
is required.
ument
researchers
IR
for
read
be avoided.
science
of performance
similarity
are
incorrect
on
be
access
as the be
of
199s]
base
also
of our
to
and
to
problem
for
work
define
is a far
queries
never
issue
documents.
level
or
the
descriptions
document
MLIR
be re-
can
tries
use query
any
systems.
knowledge
be combmed
We
Approaches
in queries
in
one
traditional
to happen
et al.,
(English/French)
Retrieval
lation
to mul-
to
pair
similarity
which
further
This
scores
requirements
boundaries,
of cross-language
like
to the support
with
1994]
an inexact
would
terms
lation
However, problem
translation,
need
associated
such
Basic
search.
et al.,
translation.
However,
of documents
use
3
a mul-
can
techniques
retrieval
as we
system
than
sent ations
full-text
an extensive
barriers.
problem
the
of query
that
resource
that
beginning
~oorhees
documents.
language
the
a single
Recent
Buckley
languages.
of complexity
the
language
problem
and
to address
to
complexity.
1995,
to
documents
MLIR
systems
problem
some
requires
statistical
and retrieving (4), a simple
be searched
still
is
of process-
move
definition
The
other
can
where
of their
et al., IR
to
are only The
(5).
modern
we can
in response
this
in ord”er
[Broglio
gests
(5) individual
is ordy
of their
collection
is capable
which
languages
are listed
Spanish
system
that problems
in that
covers
document
definition.
we can generalise
tilingual
the
retrieval
collection
in multiple
Finally,
IR
also
be searched
that
rmevious .
of its component
field
query
inition
system but
documents
definition can
the
strategies
merging
merging
monolingual
of
Information
the
of the
tiliugual
portant
the
only
in a number of diferent languages across language boundaries. With
extension
and
This
assumes
but
queries documents
able
that
as separate
Once and
into
perform version
collection
viewed
is fixed, , which
(3)
ing
[Oard
is assumed
are
to
expanded
languages.
monolingual,
chine
modilied
document
of the collection.
searched.
Definition
above
be An
it
language
collections
component
trieved
where
of interest
document
can
a multilingual
parts
language
and
languages.
extracted
collection is defined as one where the same in two or more languages.
useful
50
[Oard from
for
et
Indexing
a single
~andauer
translation & Lktman,
urdess
this
1990] applied
to
collection. al.,
1994]
a parallel
describing
with
expected aligned
that corpus,
our dictionary-based
the
term
but
we
approach
vectors find
also.
this
would concept
be
one
to many
Perhaps
mapping
translations more
of each
be much
mon
term
tions
are Iinked than
be weighted that
term
each
would
be
variety
more
methods
MLIR
Whale
direct
translation
that
derive
corpus. and
to
query
pelling
and
et
al.
techniques
which
uses
the
document which lection.
tested
obtain
[Davis
& Dunning, approaches
lection.
For in the
tences
in other
corpora
problem.
can
match
and
are
query
would
strongly
se.sociated using
and
corpora
for direct
line
(2)
Resource
requirements
are one allel
sen-
terms
sen-
indirect
main of
information
with
requires
that
additionrd
where the
researchers
resources,
the
query
same
and
depen-
the
character
and
multilingual
from
some
sets for
& Spitz,
1994,
cented
character
sets
are some
many
text
ized
when
much
more
these
Efficient veloped work
will
not
generalize stemmers
Stemmers
are being
not
been
have of
which guity
for
it
English
same
works
Spanish
which
can
a new
already
as well 1996]
be
used
been
built
have
but
translation de-
ties
for
of
this
the
training
not
but
the
It
other
is
create
than
tion,
high-
but
based
the
lin-
on
matching
the
gested
is particularly
conflates
word
so no additional
to
construct
in
8-10 most
a morphological
person/months, Western
and
tionary
European
in
1994]
Much
effort. for
requires
immense
models
of great not
the
IBM
not
from
aligned not
from
ordy
term of
with
cau-
and
they
work
been
approaches,
have
1993]
compelling
domains
The
quansophis-
distribution
be applied
has
vector
probabti-
is extremely
simpler
Dunning
term
et al.,
to other
terms
and
two
have
languages.
51
resources,
are more similar
is not the
IBM
bilingual
prevalent
is de-
tested such
been
on
as the
sentences
yet
sug-
proven
to
makes
the
former
task
our a sense,
tides
but
coverage
experiments one The
can
use consider
transfer
of the
dic-
of sufficient
a transfer
required
as complementary. shallow
creating effort
approaches broad
While
language
texts
the
Therefore, In
general parallel
process,
approach
dictionary.
than
coverage.
a simple
comparison.
transfer
analyzer tools
problem.
that dictio-
be used
of context.
associated
vocabulary
translation
and
gen-
definitions
a bilingual
must
generalize
translation
Davis
of
model
probabiLities not
Bilingual
generate
approach
independent
size to have
algorithms improvements
[Chanod,
They
dictionary
verbose
et al. [Brown
accurately
This
machine
these
tionaries
ment
methodology
statistical
of each
transfer
corresponding
which
may
of by
Of
ambi-
1996]. a
link
also
for
trans-
translations
is a non-trivial can
strategy. but
vector
be effective.
forms
of IR perfor-
stemming similar
MLIR
Converting
of Brown
this
translation
for
amounts
IR.
and
corpus.
as they
signed
most
Xerox
each
are generated
English.
languages
work
vectors
been
most
to
The
large
tend
to be adapted term
more
read-
are designed
thesaurus).
so accurately
text
epitomizes
direct
contain for
doing
of training
IBM
is
the
a bilingual
collections
but
need
as a bilingual
dictionary
text
tication.
capital-
process.
In terms
provides
tities
In
approach
solution
as traditional and
and Our
to a transfer
translation,
but
called
be suitable
Parallel
ac-
surface,
not
nary
for
be addressed.
approach
root,
would
recognition
Fortunately,
translation.
sup bene-
Support
on the
as it ordy
et al.,
for
including
other
This
developed
language
examples
is
may
English
alternative
inflectionrd before
[Hearst has
for
translation,
[Hull,
Xerox for
the
and
that
required
tested.
an
dictionaries
languages
analysis.
is introduced
mance, in
developed
for term have
for
language
sometimes
ordy
is
they
they For
of
be required.
rdthough
thus
field
in machine
available, and
define
(also
languages.
be
needs
we will
eral
years,
other
developed
extensively
inflectional
valuable
to would
system
to
stemming
for
which
in
must
twenty-five
the
entry,
in the
may
systems.
be an
as a sur-
t rauslations
dictionaries
in mind
able
accents.
algorithms
past
lation,
available
the
previ-
can
be used
Unfortunately,
retrieval
both
language
the
readers
by automatic
be
is inconsistent,
their
to obtain.
for
corpora
effort
more
Parbetween
in the
of extracting
language
are
must
collections
that
during
investment
performance
use
lose
problems
the
what
guists
issues
number
are not
manual
and
returned
simple
accenting
stemming
over
known
have
important
general are more
human
use
and
mapping.
as a reference
can
are
dictionaries
described and
process
approaches,
1995].
seem
difiicnlt
form
or
in-
systems
vector
monolingual
corpora the
(3)
relationships
strategies
terminology
parallel
although
term
ap sys-
monolingual
translation
translation
for
and/or
extract
Domain-specific
Bilingual
language
automatic
may
sometimes
to handle
each
Grefenstette,
collections,
letters
of
vector
to
each
the
translation
Bilingual
for
used
for on
dictionaries,
texts
Machine
translation
source
rogate
do-
be
transfer
translation.
simplest
system
document
facilities
[Sibun there
MLIR
btigual
resources
a machine
or
of definitions
term
interest,
or indirect
a significant
the
documents
The
handle
for
recogni-
in
Depending
parallel
query
query
ous section.
techniques.
multilingual
subsequent
phrase
entries
extensive
(1)
from
to be expensive
the
obtain
even the
language.
ported fit
to
term
consideration. thesauri
can
important
MLIR
retrieval
Genosse, stemming,
in any
as noun
finding
corpora.
source for
with Extending
for
include
direct
corpora
of research.
for
may
bilingual
for
able 4
feminine
Gartner,
step
3rd word
composed
principled
such
requires
derived
useful
training
context
by other
as the the
a
word
German
as
Wein,
is a necessary
under
domain-specific
col-
against
in matching
be missed
parallel
to be a promising
query
domain
se-
alter-
terms
analyzed
providing
processing,
translation
formation
Dunning
document
is
The
of
for
French
analyzed
words
to
crucial
pair this
tem, col-
been
the
of joindre.
of the
is also
speech form
Genosse(n)#schajt
addition analysis
Query preach,
yet
example,
ften
of
dictionaries.
a parallel
and
translation
to capture
that
not
It
language
interesting
the
find
that
Indirect
exploiting
translation
several a parallel
tion.
com-
the
In
language
Indexing,
of
Davis
use
one
be able
has
also
natural
representations
languages
examine
languages
relationships
We consider
technique
language
topic. may
this 1996]
same
vector
the
MLIR
which
example,
tences
query
the
of
form
agglutination
schaft.
parts
a normalized
For
Wein#Gurtner#
morphological
1990]
very
Semantic
term all
most
possible
provide
is morphologically
preterite
noun
of the
a training
present
decomposition
across
the
& Littman,
1991]
(joined) plural
the
and
forms.
Weingar-tnergenossenscha
methods
using
Latent
value
to
for
native
dent
on
Unfortunately,
riously
the
based
like
surface
plural
should
consider
[Landauer
et al.,
comparable
seems
also
indirectly
Llttman
singular
query can
[Evans
collection are
one
translations
Landauer Evans
of the
MLIR,
a text
ques-
and
approach
of
person
be considered. viable
recognize
in
What
These
for
token
joignaient
com-
higher?
information?
analyzers
all
of a term
Should
proportionally this
These
where
translations
others.
corpus-driven
weights.
model
by disjunction
some
common
do we use to obtain
suggest
in the
Boolean
Furthermore,
more
translations
resources
for
or weighted
appropriate.
will
is accounted
an extended
seem
simple
a btigual these
dictionary
language.
dic-
to imple-
two pr~
AU major
words
in the
ology
language
is missing
able.
On
but
the
deep
single
hand,
parallel
technical
MLIR
of these
results
on a
query
will
lowing
future
with
resources.
Previous
Experiments
in
against
is
the
a long
context
[Salton, the
of
1970]
early
form
oped
based systems
most,
reasons
translated
these
thesaurus ten
not large
much
systems
fully
automated),
modern
more
dynamic boolean
siderable
skill
robust
and
uments more
to
be utilized
information review
[Oard
Experimental
on
systems
the
Christian
Fluhr
developed
a ranked
boolean
compound,
French wan the
and
and
and
Fluhr
Cranfield the
- 0.345,
MLIR
dictionaries TRAN)
using
-0.215, These
vector
translation
tion.
Davis of
and
a Spanish
per
El
line
has
system, their
yet
been
worse
than
the
appears to
build
without
in the
dictionary
construction
Xerox
Multilingual much
the
4 Much
IR same of
http://www.newc-tle.
thk
Davis
entirely
effectively
The
an
considerable
systems fashion
tunately, which
from
the
IR
nunciation, not
intervention
EMIR
systems, European
Although allowed
in
Project:
rese~ch.ec.org/espayn/text/53l2.html.
coherent
our translation
and
termi-
language
French
dictionary
was
which
involved automatic
formatted these
creating In
En-
Unfor-
a fully
in
of sections
examples,
assignments.
~
a word-based
information.
nor
filtering
or correct,
to
retrieval,
of excess a simple
and
that trans-
one
some
SGML,
such
as pro-
markings
were
source
cases
it
of errors was
possi-
ble to detect these marking label for manual treatment
errors. For example, we would a.uy translation containing per-
sonal
pronouns
such
we.
were
manually
treated
for
and
complex
due
to
example,
52
etymology,
always
pecially
using
the
to lan-
short
and
of the
1994)
for text
for automated
with words)
acronyms
btigual
Hachette,
neither
likely
native
removed.
amounts
was
transqueries
more
to work
majority
been
suitable
of large this
process.
per-
automatically
removal
real
experiments
some
on-line
(Oxford
dictionary
to
Approach be evaluated
dictionary
TREC
While
query
of seven
be
a more
with
in their
length
may
most
is even
we decided
overwhehnhg an
contrast
that
French. has
the
gener-
issue,
previous
the
this
to attain
this
to obtain
that
working
(average
translated
was able
associated
not
most FDIC,
translation.
While
we wish
and
are
not
unique
system
therefore
Iran-Contra,
are
one
recognized
evidence
also
MLIR
domain,
converted
to
No one
for
into
has
was
Bush,
of performance.
We transfer
that
created
remain,
glish
base
Apache)
independent
in
as monolingual is
newspa-
(Reagan,
AH-64
long,
he queries
them
nology
process.
can
been
lated
tried
queries
Dunning
human
had
transla-
OPEC,
names
due fields
terminology,
II,
words this
oft
Concept
of specific
SALT
problems
To address
sufficient
system
a few
the
example,
have
guage.
translated
In
and
ordy
of the
versions
automatically.
automatic
Experimental
work
MLIR.
are
term
suggests be
Researchers
that
monolingual
not
picture
at 10-90%
have
their
general
queries
ret rieval
For
a technical
lation.
prelimi-
translator.
levels
users
Mexican of
for
amounts
the
high
some
original
human
information,
cssw when
English
research
will for
that
forms
6
Their
alone
translations
able
(the all
performance it
query
NMSU on
that
1996].
and
transfer
machine
from
methods
found
strategies
acceptable
Fluhr’s ated
far
& Dunning,
learning
provide
Dunning
this
and the
information
terminology
We transla-
and
Abrams,
be the
evidence
t hart
technical
most
query
dictionary.
that
OSHA,
or trans-
translator.
translation
proper
and
reasonable
(SYS-
precision
and
were
good
LBO,
M-1
unrealistically
system
translation
strong
(GATT,
Wall
Mercury
documents
with
In particular,
large
no
by the
LAN)
Using
content.
has
Toshiba,
clear
the
Jose
queries
queries
multilingual
cent ain
in English
NRA,
using
and
and
which
acronyms
Rad-
queries
translation
collection
have
performed
[Davis
corpus
Ted learning
queries
on EngIish-
monolingual
length
for
it
the
collec-
component
from
a million
transfer of
made
suitable
of
bilingual
experiments
not
much
ex-
a bilingual
experiments
news
San
model,
English
text
experiments
by a professional
examination
of the
full
by
the
the
half
translation
and
TREC
50 selected
externally
methodol-
queries
articles
and
The
vector
careful
and
have
retneva14. French
effective
generated
to their
notable
uses
French
of
to roughly
strategy.
well-tested
recent
to the perfordescribed
of this
to use only
newswire,
by the
relative
the TIPSTER
consists
of text.
term
For
to
INSTN
to work
by average
more
document and
two at
machine
provide
and
corpus
Norte) sets
vector
tions
were
com-
MLIR
which
244]:
using
experiments
Mark
a variety
query
term MLIR
is
with
translated
as measured
recall.
of
colleagues
~.
amounted GB
into
chose
AP
1.6 the
from
which
and
nary
doc-
the
News,
left
conducted
results
-0.270,
Oard’s
dictionaries
and
following
research
indeed,
1994]
collection
obtained
retrieval
cross-language
[Radwan,
rank
51-100
about used
and used
language. by human
experiments
French
fol-
queries,
be compared
of the
the
experiments
with
a single
variant
translated
We
Journal,
lated
All
the
are retranslated
sense
use some
if the
to
language
then
well-motivated
queries
in
the the
system
rise
Start
another can
system.
with
collection,
A
be less
a good
1995].
the
con-
to consult
system
idiom
based
to
can
a
1996].
and
French-German
in
techniques.
application
is sparse
ceptions. term,
text
& Dorr,
work
retrieval
which
are encouraged
feasible
get
Our
and
Street
in
require
tend
matching
on multilingual
not
also
of
is of-
language
and
systems
effort
is generally
effectively
than
fore-
(which
use
systems
statistical
readers
prehensive
these
and
is just”
which
As retrieval
model,
full-text
applications,
text
fashion.
less effective
using
that
collections
for
to
this
[Harman,
vocabu-
human
indexing
tion
devel-
results
section
MLIR.
queries
the
we worked
documents.
of
into
translated
MLIR
Following
retrieval
First
considerable
something
text
a simple
1996].
document
been
text
of the previous
ogy,
a
number
controlled
modern
mance
gives
judgments
and
queries
in the
per-
given
have
for
in
systems
applications
& Dorr,
and
Salton
counterparts
However,
require
construction
for
retrieval
These
system,
in
in
to compare
of the
This
to evaluation relevance
However, desire
performance
translated.
and
original
in
demonstrated
A large
ideal
[Oard
retrieval
thesaurus.
approach.
are less than of
text
commercial
on this
a number
1972]
single-language btigual
and
text systems.
[Pevzner,
cross-language
as their
constructed
lary
multilingual
judgments. is a strong
optimal
the queries
translators.
vocabulary
Pevzner
70’s that
governmental
of
controlled
and
as well
carefully
on
history
the
approach
MLIR There
there
perfectly
documents,
MLIR
relevance
context,
were
Have
5
known
multilingual
narrow
concentrate of the
queries
avail-
provide
they
system
age of both
termin-
are not
corpora
when
ideal
a.dvant
most
probabtities
particularly The
to take
but
translation
other
coverage,
domain.
want
are defined
and
bugs
common that in
the
automatic
as I or based words,
the
462
on the
or
extraction
filtering
misplaced of
34000
filter.
dictionary
automatic
filter
of the
this
definitions
Sometimes, entry
es-
was so long
would
fail,
either
SGML
markers.
For
translations
of the
com-
English:
politically
French:
motivated
troubles
Term
vector
– turmoil
– civil
discord
civilian – character
politique
– political
Table
1:
disturbances
mation
politique
An
unrest
disturbance
the
disorder
cult
diplomatic of the
politician query
policy
translation
process
for
the
ously
discovered
In
French
success,
word
break,
oneself,
lay,
prendre(
catch, idea,
one ‘s, arm,
initions
as they
of the
would
entries
translations.
pick,
but
a serious
common terms
removed
from
lection,
or more each
A further use
and
translation are not
clues
These and
example,
the
ation,
still
ister.
loss.
alents.
left
perhaps
the
terms
ject
that
more
The placed each
root
the
translations. are
translated
query
This
issues
ing
is the
ambiguity
serve
Table
These the
final
using function
the to
term
weights
are
whale
documents
the
unique
translated
for
a sample
form
warse
guity
introduced
link
term
model.
and
document by
The IR since resolv-
is designed
the
1985],
which
query
and
to
upon
this
SMART
in-
of
(particularly
would
queries
unit
square-
IDF inner
factor Docuproduct
expect ones)
to errors
process.
preckion ss the
only
3rd (nat
online)
5
may
rachat
Q55:
dt%t
d’initi6
fmanc6
Q56:
taux
d’emprunt
Q58:
chemin
par
imider
coup
d’6tat
Q64:
prise
d’otage
Q65:
syst~me
trieval
system
terms
used
shart
with
some
and
same
terms
to look
Collins
work.
French-
In for
hindsight, autamatic
inconsistency
may
in more
in the
be mare
definitions
campre for
is generated
some from
dictionary.
+
leveraged
buyout
trading
pr6f6rentiels
=+
coup
~
+
prime
lending
rate
d’etat
hostage-taking
de recherche
Q66:
hmgage
Q70:
m.?me porteuse
Q82:
g6nie
naturel
g.h$tique brut
* ~
~
genetic
p6trole
gas naturel
~
Q96:
programme
informatique of the
crude natural
s
information
re-
language
surrogate
Q88:
A list
do-entaire
natural
+
Q90:
2:
is
version
de fer =$_ railroad
Q62:
query
definition
query
representation
emprunt
~
the have
are
this
transfer
in
entries
the
dictionaries
cause
resulting query
constructed
Q53:
for
the
dictionary
a clean
unique
one dictionary
another,
manually
300
Robert
dWerent
because
build
for
about
the
to
of
described
correct
also
queries
edition
usin
second
the to
the
uses
dictionary
manually
are
translation
than
many
queries
versions
mother engineering
oil gas
important
=+
computer multi-word
program expressions
in
sample
specific
in general will
language
use
English
pracess
sometimes
Since
the
largely The
the
judgments.
suggest
each
retrieved
version
we decided
that
terms.
Table
is evaluated
relevance
due
Query
length.
of their
MLIR
short
by the translation
the
performance
to known
One
apply
dacu-
frequencies. traditional
order
measures each
col-
necessitate
to
diiferent
the
that
and
dictionary
hensive this
inflectional
version
We
the
and
list
original
query.
irnprave
to have
to the query
the
the
space
in descending ranked
a
transfer
and
in
original
by
dictionary
used
realized
results,
term
to MLIR. It
will
[Buckley,
evaluation.
than
ignored.
between
multiplied
vs. manual
query.
characteristics
queries
acmroach
efforts
are normalized
respect
strategies
are
is re-
monolingual to their
probably
choose
by
first
Given
there
We
English
the
The
experiment.
we
all
the
term-weighting
future
vectar
are ranked
by comparing
normalized
Dossible
system
the
to a traditional
use a moditied
of the
with
sent
and
retrieval
The
to
unchanged
translation
ment
score
from
to ;peci&zed
root
ments
missing
transfer
the
First,
builds
of
with
document
documents
generated
terms
up.
up
compares
section.
duplication,
laaks
and
are
1 shows
strength
dictionary
20
automatically
an ob-
term
system
which
experiments
formation
the
concatenation
simrdest
in
each
the
are also
as a baseline,
model.
Second,
transfer
is then
relating
and
and
these
Results
completely,
of the in
that
steps.
taking
passed
Documents
roots. all
by
Terms
dictionary
root.
bilingual
query
basic
analyzed
inflectional
in
a dictionary.
of three
missed
considerable
than
a thesaurus
15,
dictionary.
extraneous
reg-
expu~ion,
created
10,
retranslation
previous
stop-
is evident
add
we
for
measure.
transfer
equiv-
filtering
document
to doc-
is working
a multilingual
reasons,
one
that
user
would
resort
relevant
Note
the
relevant
5,
constructed
the
striking,
it
goal.
that
a an
relevant
can
mare
least
experiment
three
mdiof
out
radiation,
setting, medicine
consists
is morphologically
translated
system.
and automatic
process
by its
retain
Our
to: loss
aa translation
final
find
be
a few
system
feedback
Experimental
Our
For
radiation,
an IR
license,
resembles
MLIR
query
only
7
of the
tiltered
at
evaluation
but
heart
register,
to
in
shauld
Once
the
read
documents
precision
system.
For
these
a particularly ta quickly
of
relevance For
im-
previ-
word
filtering.
Filtering
medicine
In
results.
was the
ezpulsion,
one-wauld
register,
in the
automatic
from
rwactice.
disbarring.
to the
the
disbarring.
disbarring,
Ideally,
and
where
mdiation
medicine,
license.
how
and
at
which
that
word,
embedded
ofl
words
high
is the
be
lots
feedback
recall
boundaries. from
be expected
collected,
collection.
good
averaged
the
cannot
MLIR
the dMi-
set,ting.
we are assuming
of interest.
was
on
head
after
word
striking
to pmctice
words
are often
system
dictionary
about
reappear
French
ezpulsion,
license
the
clues thus
our
elaborate
contextual
of the user
for
which
introducing
to a human
definition
error
definitions
translations
is used.
noise
pernicious
of ~he ~ord, proper
rather
more
of encvcloDedic
high
obtaining
defi-
nition. the
if
a monolingual
transfer
ten
an
been
language
will
help in
more
techniques,
translated)
relevance
experiments,
clean-
521 of the had
uments
def-
for
have
(with
much
information
Therefore,
goal
cross
feedback documents
user
user
appears
becomes
multilingual
have
language.
monolingual
put,
noisy
manual
words)
were
client,
such
the (or
of the which
incorporating
in the
evaluate
documents
jind,
task must
relevant
tool
impartant
including
thicken,
We left
be beneficial.
(mostly
Duplicate
words
take,
handle,
customer.
produced,
dicionary
dictionary
stiflen,
charge,
waist,
were
gave23
sink,
bring,
accent, ing
totake)
set,
This
by
ability
vocabulary
relevance
query
addition,
and
on the
find
terminology that
prove
foreign mon
the
suggests
important
Q67
to
of interest.
when
This
depends
system)
documeuts
nature
example
query
trouble
retrieval
from
courteous
caract.?xe
short
civil
& caract&e
retranslation:
trouble civil
civils
tend and
Success
that
5 We remind the reader that in thk context placing a source language term with all entries tionary. No lexical amliguity is resolved.
to perambi-
in infor-
53
translation means r~ from the transfer dic-
During
the
the
translation
ual
unit
of
ated
transfer Therefore,
(MWE)
query,
2 for
tion.
of these
expressions.
(a)
queries
original
based
transfer transfer
ficial,
in the
dictionary, sense for
that
this
a diferent
ations
of
the
current
averaged
level
over
this
built
and
(d)
without
to help
not
by
sample
out
I
word-based I transfer
0.393 Table
3:
three
Average for
presented
perform
that
the
real
Test
in
to the guages
as it
can
sufficiently makes
in
There
biggest are
results.
word
transfer
Building
had
dictionary
a similar
guage
would
be
of Table
mance
represents
attainable.
features,
such
performance are scored
baseline
best
which IR
query
performance.
has none term
of these matching
to
lancol-
ways
better
in which
such suited
we might
as adopting for
short
be able
term-weighting
to improve
the
number
diet
trans
Manual multi-word
dlct
trans
diet
12
9
9
12
8
3
0.0
original
of the (Orig)
of queries
performance
English in each
from
of the
queries.
translated
Values
given
are
category.
having
improvement
is applied
difference
at least
to
in performance
the
dictionary
is reflected
as more
construction
primarily
in queries
no
relevant
documents
ranked
in
some
relevant
documents
scoring
well,
Recovering better
at
than
feedback
queries
least
one
finding
becomes
relevant
none,
a viable much
better
to gain
a better
understanding
top
20
which
is
document
option.
perform
promoving
the
because
which
with
which
did
based
transfer
worse
(i.e.
is
monolingual There
are even
in their
translated
the are
English are
queries.
to
to
suffered
in but
information
This
is in
not
noun produce
phrase
our given
phrase
translation
multilingual in Table
on a word
the it
by
happens the
experiments. 2 lose word
vital basis.
from
one
that
single
is often
to
pairs
content
suc-
generally perfor-
components in translation,
lost.
import
half
to
monolingual
in average
enough
most
in
recognizing
crucial
the individual meanings
About semantic
is
gains
a loss
17 because
problem.
or word
often
extraneous
than
contrast
phrases
the
of
th~
dramatic
entire
but
4 suffered is greater
distinct
always
sense
(i.e.
expressions
so the
case,
8 had
translation
demonstrate
is that diferent
wordfailure of the
more
mance. The key difference of phrases often have very
manual
a detailed correctly,
total
results
the
17 queries
as a result
and
the
problems
the
expressions in
from
identifying does
9 lost
multi-word
MLIR.
where
performed
query), that
experimental
cess
using
and
amblgnity
Note
translating
rect
54
translated
that
of the
we selected
multi-word
added
and IR,
when
translate
queries
helps
translation,
dictionary
due
Our
or
query
We found to
some
Were
that
This
definitions
originrd
only).
cess.
problems
many
features
a steady
effort
retranslation.
expansion
the
Manual word-based
Comparison
analysis.
currently
have
strategies
4: and
failure
perfor-
such techniques applied, the gap in performance between original and translated queries could well increase. There baseline,
acgreater
10
=
asociated
results.
Our
Automatic word-based
Orig
In order
final
is not
translation
precision
26