domain, for example, the goal of extraction is to identify information such as the companies forming the joint venture, the name of the new company, the location.
A
System
for Discovering Extraction from Jack West
G.
Mary
Conrad
Publishing St.
Relationships Text Databases
Digital
Company
Paul,
MN
conrad@research
Hunter
Corporation
Littleton,
.westlaw.
MA
utt@netcur.
com
Utt
Equipment
55164
by Feature 1
01460
enet. dec. com
Abstract A method ments from
for accessing
alone large
is presented. text
databases,
those features. cation
using
Journal names.
named
run
and that
thk
and precision,
domain-specific
is the ability significant
approach
relationships
System,
application,
features
the
on the Associations
evaluation
System
generated
measures
identified,
rather
that
feature
In addition
being
domain-specific
studied
an appli-
Wall Street
the
are company
demonstrate
between
from
using
docu-
features
or associations
extracted
are reliable.
than
extract
and examples
are illustrated
are currently
in various
features
to automatically
are discussed,
the Associations
the relationships
of the relationships
using
approach
statistically
supporting
particular
series of tests
sures of recall
1
In this
accurate,
usefulness
and identify
these techniques,
The
information
basis of this
The techniques
database.
be quite
text-based The
and
person
extraction
can
to conventional which
will
mea-
indicate
the
contexts.
Introduction
Information designed
retrieval to store,
systems have traditionally
retrieve,
and display
text
been document-orzente~ documents
that
such as newspaper
is, IR systems
articles
Furthermore, many of today’s hypertext systems have inherited this paradigm of information tion to the extent that hypertext nodes are typically short text documents, possibly derived sources.
In some instances,
browsing
of index terms associated
with
have been
or legal summaries.
the text documents
representafrom longer
is also supported
[1, 2], but this is generally regarded as a secondary activity in relation to the primary task of identifying relevant documents (or text nodes). From the user’s standpoint, however, it is usually the information contained in the text documents that is the goal of the search, not the documents themselves. In some application domains, the target information is well-defined, for example, financial figures, information transaction dates, product types, etc. In these domains, it may be possible to construct retrieval and hypertext-like browsing on the text documents that embody portant
categories
systems based on the internal information rather than exclusively it. As a result, systems like these should be able to answer im-
of queries and support
alternative
document-based systems. For example, a traditional a real-time query which requests a list of companies
means of access currently
impossible
with
standard
text retrieval system could not be expected to satisfy with which Ross Perot had business dealings in 1988.
By contrast, a feature-based system could. The techniques described in this paper have formed the basis for the Associations System. This implementation is an information retrieval system which pursues a concept-oriented rather than documentoriented approach; it focuses on the recognition of domain-specific features in a textual database and relationships identified between those features. In the following sections, we describe techniques and experiments in three major areas needed to support this application: ●
Automatic
●
Gene?’sting diTect hnlcs - techniques association in free text. Generating
●
feat ures a feature 1 This Amherst.
research
feature
indirect based
extraction
links
on shared
- techniques
used to recognize
for quantifying
- techniques
for indexing
classifications,
as well
features
the relationship
features as offering
in large, free-text between
and identifying possible
features
indirect
starting
databases. based on their
relations
points
for
between
browsing
in
network. was
performed
at
the
Center
for
Intelligent
Information
Retrieval
at
the
University
of
Massachusetts
at
261 Our experiments of Wall a part
Street
with
Journal
the company
and person
name recognizes
use a database
of one year (1987)
articles.
It consists of 46,449 articles containing 249 words on average, and is document collection [3]. Subsequent recall and precision experiments with the
of the TIPSTER
articles, It Associations System use as a database a more recent year (1991) of Wail Street Journal contains 42,652 articles averaging 232 words each, also from the TIPSTER collection. We have found that evaluating some of these proposed techniques is more difficult than a typical information retrieval experiment,
2
and this issue will be discussed in the sections
Automatic
The problem searchers
Feature
of feature
follow.
Extraction
or fact extraction
in the context
that
from
unrestricted
of the Message Understanding
text
has been studied
Conferences
by a number
[4] and the TIPSTER
of re-
project
[3].
The basic approach has been to use a variety of natural language processing to extract predetermined types of facts for a specific domain. In the TIPSTER
and statistical joint venture
example, the goal of extraction the name of the new company,
forming the joint venture, of the new company, and
the amount
is to identify the location
information such as the companies of the new company, the products
techniques domain, for
of money involved.
Accurate
extraction
of some types of information
requires
either
sophisticated
analysis
or significant
amounts of training data. There are, however, a number of important and fairly general features which can be recognized using relatively simple techniques. These include the names of companies, the names The task of collecting this information could be of people, locations, monetary amounts, and dates. described as the recognition and categorization of certain noun phrases. In other words, a feature is essentially
an object
which
falls into a special word grouping
it [5]. High rates of accuracy
are possible
and has certain
because of the relative
simplicity
attributes
associated
of the task.
with
It is, for example,
much easier to recognize the presence of a company name in an article about a joint venture than to identify the role that company is playing. The ability to recognize these simple features can be used to develop
powerful
new approaches
For the application names and person
names.
lexical
scanners built
2.1
The
The
The techniques
name
Name
recognize
[6].
the two feature
recognizes
used for these feature
using lez [7], or a similar
Company
company
to accessing information
we address in this paper,
tool,
required
recognizes
are for company
involve
a combination
of
and table lookup.
Recognize
scans
the text
for proper
nouns
(capitalized
words)
that
have
the appropri-
Corporation, Ltd. that are particularly useful for recognition [8]. In a given document, the company name recognize will use these special words to recognize the first mention of a company name and store it in a temporary table. This table permits the recognition of subsequent uses of that company name, even if ate format
for
a company
name.
Company
names
often
include
special
words
such
as Inc.,
or Pty.
the special words are not used. In newspaper story generally uses the full form. In a simple test of the company database
and compared
of 139 articles In
this
test,
was 8970 and
the the
was 79Y0.2 Many are combined
29,000
precision
using
to company
words.
(percentage
recall of the
name recognize,
the results
containing
(percentage precision
‘and’
and
articles,
The
of company errors
‘of’,
such
were
scan
names
manually. database
in the
sample
difficult
The
that
that company
Z Corporation
test
identified
as companies
by two
Y and
the first use of a company
name in a
it to a sample of the Wall Street Journal
of the
identified
caused
as in X,
we applied
names identified
manual
of names
for example,
database
actually
were
consisted
334 company were
identified
name
as companies)
formats
and X of Y Inc.
names.
companies) where
Although
names
these
can be valid formats (e.g., Mutual of Omaha), they tend to introduce too many errors. We are currently revising the company name recognize to improve recall by introducing a company name table that will cent ain common names and synonyms (e.g., for American Telephone & Telegraph/AT & T and Digital Equipment /DEC). This modification is based on the observation are more likely not to use the full form of the name. 2The tests performed
were name-based
rather
than occurrence-based.
that references to well-known
companies
262
The
2.2
Person
Name
Many
application-dependent
three
decades
similar sequence
of words names
company
but
places
last
that
used.
to personal
application
more
with
such
are used
subsequent Checks
L. L. Bean
are
Inc.,
name
domain,
emphasis
a title
names
recognize,
is not
example,
of our
begins
and
name
name
approaches
Because
to the above,
of first full
[9].
Recognize
on table
references
to people
to ensure to later
recognize
name A name
Chairman,
as ifs.,
names
referred
have
person
lookup.
to identify
made
identification
the
that
do not
in the
same
as “L.
in a story
and
relies
techniques a capitalized
In addition,
As in the
is not
even
a company
is not
lists
case of the
are recognized,
name
L. Bean”
when
titles.
story
over the last
upon
so forth.
contain
a recognized
developed
is recognized
President,
that
been
if the
name
recognized
(for
as a person
name). Finally, list
because
of common
sentences
with
whether
the
recognize
last
name
Santa
are
currently
the
same,
tables.
Monica,
Carson
recognized
City)
as two
specified
same.
Given
that
needs
to be addressed.
two
the
names
GM,
it is likely they
involved
3
that
are in
recognizing
to each
other
names
relationship). the
measure
By
contrast,
of words these
the
number of size
11 would
window
sizes
The
occurrences a single quently
and
include here of
[10],
Bill
company
The
name
person
name
names
in the
identifying
Clinton
any)
can provide Roger
locations
the
name
the
it is not
B. Smith
are both
experiments
have
as
be the
of variant solution.
evidence
highly
that
correlated
with
person-company
Despite that
show
to
complete
additional
section.
Clinton
resolution
the
to conflate
next
William
to be recognized
last
significant
is used
in
and
but
and
stories
companies,
help,
the next
be used
as the
Associations when
word
the
can
the
these
links
complexities
names
dist ante study
and/or
in the has been
the
can
be
person
in
names
direct
occur (how
or
‘close)
far
apart
or in a subject-object
to be the
concentrated
a hypertext
either
measure
sentence
shown
relationships
links using
distance
same
we have
is to identify
for
be identified
by a simple
example,
this
step
basis
company
either
(for
so for
found
section,
the
additional
we used
side of a target the
target
were
window of GM
and how
and
IBM
by presumed
the
strongest
on name
preceding
201 words. two
evidence
distance
as the
in
a text
the
importance), phi-squared
and
sizes were
have
can
and
similar
be used
people.
links.
A window
refers
name).
For example,
the
succeeding
five
chosen
types
to derive
empirically,
to the
a window words. and
The
roughly
associations.
features
common
direct
or person
names
contexts
of companies
words,
These
person
word
to define
(i.e. , company five
and
how
retrieval
windows
depends features
window
in the database.
[1 I] and
company
we discuss
document-level
between
are mentioned
(ElkfDi4)
the
and
and
text
feature
feature,
51 words
when
next
support
paragraph-level
a text
measure
occurs
In the study,
association
companies
relationships
information
on either
co-occurrence these
to avoid
and
will
in the text,
be measured
association
associations
average in
text.
due to unsuitable
people
our
can
context
relationships,
associations
strength
were
recognize
example,
technique
people
occurs
can
linguistic
them.
used
approximate
This
texts.
association
with
of words
of the
sample
(if
and
names,
and
experiments
an indirect
direct
in the
of association.
associated
indirect
In
evaluation
initial
described
associations source
Closeness
of phrasal
the
Links
and
some
In previous
presence
primary
text.
or using
Smith
person
These
A direct
in the
are)
purpose.
for
in different
themselves
person.
to companies
them. people,
associations.
a stop begins
Jane Doe ... We are investigating
this
names
synonyms
techniques and
to a name, frequently
databases.
references between
between
Roger
same
the
two
middle
of common
Direct
of companies,
indirect
connections
company
in textual
identified
network
the
are the
For
For
name,
connections
using
Generating
Having
for
they
recognized
first
A table
the
Jmmnal
for
of the errors
a problem.
people.
If, for example,
generated
or associations
the
to make
company-person
are the same.
after reliably
we want
Many
in addition
Street
names.
is also the
words
FVali
names
we are modifying
different
that
used
269 person
93% recall.
as person
names
names
Fortunately,
and
general)
database
identified
other the
... or InvestoT
(and
same
As an example,
synonymous
we have
the
scan
contain
example,
Joe Smith
effective
recognize,
manual
may
For
as Added
be more
92~o precision
name
Recognizing
may
The
achieved
and
(e.g.,
person
words
is maintained. such
lookup
was used.
recognize
of capitalized
words
constructions
dictionary
To test
first
sequences
problem
is not
the
number
the
whole
in likely
To determine
we use two (~z
on are
) [12].
statistical
of
associations
database.
For
to be significant, significant measures,
given
associations the
or
co-
example, how
fre-
(or to rank
expected
mutual
263 The and
expected
mutual
y, together
simplified The
to the
version
measure
information
of this
used
measure
probability
compares
of observing
measure
that
the
ignores
the
two
terms
probability
features
of observing
independently.
involving
two
In this
probabilities
that
features,
paper,
features
x
we use a
do not
occur.
is:
P(z, y) .EMIM(Z,
When a strong
relationship
chance and EMIM(z, The calculation
exists between
y) will be greater
y) = log2
the features, than
P(z) F’(y)
the joint
probability
(P(x,y))
will be greater
than
O.
of both EMH14 and 42 makes use of a contingency
table.
This table can be represented
z and
y co-occur
as follows:
YY
E44 The
upper-left-hand
[b] records y occurs table
the but
cell
number
z does
to estimate
[a] records
of times
not.
the
~ occurs
Finally,
cell
probabilities,
The
+2 measure more.
has been
This
[d] records
suggested
is calculated
The
two
people) Two there
measures that
different
were
number
of windows
Table and
2 lists
Evaluating appear
the
or person
for
is not
about
question clear
relevant
68 EMIM
EMIM
documents
d)’
compute the
For window
second
of documents
0