A System for Discovering Relationships by Feature Extraction from ...

13 downloads 6161 Views 890KB Size Report
domain, for example, the goal of extraction is to identify information such as the companies forming the joint venture, the name of the new company, the location.
A

System

for Discovering Extraction from Jack West

G.

Mary

Conrad

Publishing St.

Relationships Text Databases

Digital

Company

Paul,

MN

conrad@research

Hunter

Corporation

Littleton,

.westlaw.

MA

utt@netcur.

com

Utt

Equipment

55164

by Feature 1

01460

enet. dec. com

Abstract A method ments from

for accessing

alone large

is presented. text

databases,

those features. cation

using

Journal names.

named

run

and that

thk

and precision,

domain-specific

is the ability significant

approach

relationships

System,

application,

features

the

on the Associations

evaluation

System

generated

measures

identified,

rather

that

feature

In addition

being

domain-specific

studied

an appli-

Wall Street

the

are company

demonstrate

between

from

using

docu-

features

or associations

extracted

are reliable.

than

extract

and examples

are illustrated

are currently

in various

features

to automatically

are discussed,

the Associations

the relationships

of the relationships

using

approach

statistically

supporting

particular

series of tests

sures of recall

1

In this

accurate,

usefulness

and identify

these techniques,

The

information

basis of this

The techniques

database.

be quite

text-based The

and

person

extraction

can

to conventional which

will

mea-

indicate

the

contexts.

Introduction

Information designed

retrieval to store,

systems have traditionally

retrieve,

and display

text

been document-orzente~ documents

that

such as newspaper

is, IR systems

articles

Furthermore, many of today’s hypertext systems have inherited this paradigm of information tion to the extent that hypertext nodes are typically short text documents, possibly derived sources.

In some instances,

browsing

of index terms associated

with

have been

or legal summaries.

the text documents

representafrom longer

is also supported

[1, 2], but this is generally regarded as a secondary activity in relation to the primary task of identifying relevant documents (or text nodes). From the user’s standpoint, however, it is usually the information contained in the text documents that is the goal of the search, not the documents themselves. In some application domains, the target information is well-defined, for example, financial figures, information transaction dates, product types, etc. In these domains, it may be possible to construct retrieval and hypertext-like browsing on the text documents that embody portant

categories

systems based on the internal information rather than exclusively it. As a result, systems like these should be able to answer im-

of queries and support

alternative

document-based systems. For example, a traditional a real-time query which requests a list of companies

means of access currently

impossible

with

standard

text retrieval system could not be expected to satisfy with which Ross Perot had business dealings in 1988.

By contrast, a feature-based system could. The techniques described in this paper have formed the basis for the Associations System. This implementation is an information retrieval system which pursues a concept-oriented rather than documentoriented approach; it focuses on the recognition of domain-specific features in a textual database and relationships identified between those features. In the following sections, we describe techniques and experiments in three major areas needed to support this application: ●

Automatic



Gene?’sting diTect hnlcs - techniques association in free text. Generating



feat ures a feature 1 This Amherst.

research

feature

indirect based

extraction

links

on shared

- techniques

used to recognize

for quantifying

- techniques

for indexing

classifications,

as well

features

the relationship

features as offering

in large, free-text between

and identifying possible

features

indirect

starting

databases. based on their

relations

points

for

between

browsing

in

network. was

performed

at

the

Center

for

Intelligent

Information

Retrieval

at

the

University

of

Massachusetts

at

261 Our experiments of Wall a part

Street

with

Journal

the company

and person

name recognizes

use a database

of one year (1987)

articles.

It consists of 46,449 articles containing 249 words on average, and is document collection [3]. Subsequent recall and precision experiments with the

of the TIPSTER

articles, It Associations System use as a database a more recent year (1991) of Wail Street Journal contains 42,652 articles averaging 232 words each, also from the TIPSTER collection. We have found that evaluating some of these proposed techniques is more difficult than a typical information retrieval experiment,

2

and this issue will be discussed in the sections

Automatic

The problem searchers

Feature

of feature

follow.

Extraction

or fact extraction

in the context

that

from

unrestricted

of the Message Understanding

text

has been studied

Conferences

by a number

[4] and the TIPSTER

of re-

project

[3].

The basic approach has been to use a variety of natural language processing to extract predetermined types of facts for a specific domain. In the TIPSTER

and statistical joint venture

example, the goal of extraction the name of the new company,

forming the joint venture, of the new company, and

the amount

is to identify the location

information such as the companies of the new company, the products

techniques domain, for

of money involved.

Accurate

extraction

of some types of information

requires

either

sophisticated

analysis

or significant

amounts of training data. There are, however, a number of important and fairly general features which can be recognized using relatively simple techniques. These include the names of companies, the names The task of collecting this information could be of people, locations, monetary amounts, and dates. described as the recognition and categorization of certain noun phrases. In other words, a feature is essentially

an object

which

falls into a special word grouping

it [5]. High rates of accuracy

are possible

and has certain

because of the relative

simplicity

attributes

associated

of the task.

with

It is, for example,

much easier to recognize the presence of a company name in an article about a joint venture than to identify the role that company is playing. The ability to recognize these simple features can be used to develop

powerful

new approaches

For the application names and person

names.

lexical

scanners built

2.1

The

The

The techniques

name

Name

recognize

[6].

the two feature

recognizes

used for these feature

using lez [7], or a similar

Company

company

to accessing information

we address in this paper,

tool,

required

recognizes

are for company

involve

a combination

of

and table lookup.

Recognize

scans

the text

for proper

nouns

(capitalized

words)

that

have

the appropri-

Corporation, Ltd. that are particularly useful for recognition [8]. In a given document, the company name recognize will use these special words to recognize the first mention of a company name and store it in a temporary table. This table permits the recognition of subsequent uses of that company name, even if ate format

for

a company

name.

Company

names

often

include

special

words

such

as Inc.,

or Pty.

the special words are not used. In newspaper story generally uses the full form. In a simple test of the company database

and compared

of 139 articles In

this

test,

was 8970 and

the the

was 79Y0.2 Many are combined

29,000

precision

using

to company

words.

(percentage

recall of the

name recognize,

the results

containing

(percentage precision

‘and’

and

articles,

The

of company errors

‘of’,

such

were

scan

names

manually. database

in the

sample

difficult

The

that

that company

Z Corporation

test

identified

as companies

by two

Y and

the first use of a company

name in a

it to a sample of the Wall Street Journal

of the

identified

caused

as in X,

we applied

names identified

manual

of names

for example,

database

actually

were

consisted

334 company were

identified

name

as companies)

formats

and X of Y Inc.

names.

companies) where

Although

names

these

can be valid formats (e.g., Mutual of Omaha), they tend to introduce too many errors. We are currently revising the company name recognize to improve recall by introducing a company name table that will cent ain common names and synonyms (e.g., for American Telephone & Telegraph/AT & T and Digital Equipment /DEC). This modification is based on the observation are more likely not to use the full form of the name. 2The tests performed

were name-based

rather

than occurrence-based.

that references to well-known

companies

262

The

2.2

Person

Name

Many

application-dependent

three

decades

similar sequence

of words names

company

but

places

last

that

used.

to personal

application

more

with

such

are used

subsequent Checks

L. L. Bean

are

Inc.,

name

domain,

emphasis

a title

names

recognize,

is not

example,

of our

begins

and

name

name

approaches

Because

to the above,

of first full

[9].

Recognize

on table

references

to people

to ensure to later

recognize

name A name

Chairman,

as ifs.,

names

referred

have

person

lookup.

to identify

made

identification

the

that

do not

in the

same

as “L.

in a story

and

relies

techniques a capitalized

In addition,

As in the

is not

even

a company

is not

lists

case of the

are recognized,

name

L. Bean”

when

titles.

story

over the last

upon

so forth.

contain

a recognized

developed

is recognized

President,

that

been

if the

name

recognized

(for

as a person

name). Finally, list

because

of common

sentences

with

whether

the

recognize

last

name

Santa

are

currently

the

same,

tables.

Monica,

Carson

recognized

City)

as two

specified

same.

Given

that

needs

to be addressed.

two

the

names

GM,

it is likely they

involved

3

that

are in

recognizing

to each

other

names

relationship). the

measure

By

contrast,

of words these

the

number of size

11 would

window

sizes

The

occurrences a single quently

and

include here of

[10],

Bill

company

The

name

person

name

names

in the

identifying

Clinton

any)

can provide Roger

locations

the

name

the

it is not

B. Smith

are both

experiments

have

as

be the

of variant solution.

evidence

highly

that

correlated

with

person-company

Despite that

show

to

complete

additional

section.

Clinton

resolution

the

to conflate

next

William

to be recognized

last

significant

is used

in

and

but

and

stories

companies,

help,

the next

be used

as the

Associations when

word

the

can

the

these

links

complexities

names

dist ante study

and/or

in the has been

the

can

be

person

in

names

direct

occur (how

or

‘close)

far

apart

or in a subject-object

to be the

concentrated

a hypertext

either

measure

sentence

shown

relationships

links using

distance

same

we have

is to identify

for

be identified

by a simple

example,

this

step

basis

company

either

(for

so for

found

section,

the

additional

we used

side of a target the

target

were

window of GM

and how

and

IBM

by presumed

the

strongest

on name

preceding

201 words. two

evidence

distance

as the

in

a text

the

importance), phi-squared

and

sizes were

have

can

and

similar

be used

people.

links.

A window

refers

name).

For example,

the

succeeding

five

chosen

types

to derive

empirically,

to the

a window words. and

The

roughly

associations.

features

common

direct

or person

names

contexts

of companies

words,

These

person

word

to define

(i.e. , company five

and

how

retrieval

windows

depends features

window

in the database.

[1 I] and

company

we discuss

document-level

between

are mentioned

(ElkfDi4)

the

and

and

text

feature

feature,

51 words

when

next

support

paragraph-level

a text

measure

occurs

In the study,

association

companies

relationships

information

on either

co-occurrence these

to avoid

and

will

in the text,

be measured

association

associations

average in

text.

due to unsuitable

people

our

can

context

relationships,

associations

strength

were

recognize

example,

technique

people

occurs

can

linguistic

them.

used

approximate

This

texts.

association

with

of words

of the

sample

(if

and

names,

and

experiments

an indirect

direct

in the

of association.

associated

indirect

In

evaluation

initial

described

associations source

Closeness

of phrasal

the

Links

and

some

In previous

presence

primary

text.

or using

Smith

person

These

A direct

in the

are)

purpose.

for

in different

themselves

person.

to companies

them. people,

associations.

a stop begins

Jane Doe ... We are investigating

this

names

synonyms

techniques and

to a name, frequently

databases.

references between

between

Roger

same

the

two

middle

of common

Direct

of companies,

indirect

connections

company

in textual

identified

network

the

are the

For

For

name,

connections

using

Generating

Having

for

they

recognized

first

A table

the

Jmmnal

for

of the errors

a problem.

people.

If, for example,

generated

or associations

the

to make

company-person

are the same.

after reliably

we want

Many

in addition

Street

names.

is also the

words

FVali

names

we are modifying

different

that

used

269 person

93% recall.

as person

names

names

Fortunately,

and

general)

database

identified

other the

... or InvestoT

(and

same

As an example,

synonymous

we have

the

scan

contain

example,

Joe Smith

effective

recognize,

manual

may

For

as Added

be more

92~o precision

name

Recognizing

may

The

achieved

and

(e.g.,

person

words

is maintained. such

lookup

was used.

recognize

of capitalized

words

constructions

dictionary

To test

first

sequences

problem

is not

the

number

the

whole

in likely

To determine

we use two (~z

on are

) [12].

statistical

of

associations

database.

For

to be significant, significant measures,

given

associations the

or

co-

example, how

fre-

(or to rank

expected

mutual

263 The and

expected

mutual

y, together

simplified The

to the

version

measure

information

of this

used

measure

probability

compares

of observing

measure

that

the

ignores

the

two

terms

probability

features

of observing

independently.

involving

two

In this

probabilities

that

features,

paper,

features

x

we use a

do not

occur.

is:

P(z, y) .EMIM(Z,

When a strong

relationship

chance and EMIM(z, The calculation

exists between

y) will be greater

y) = log2

the features, than

P(z) F’(y)

the joint

probability

(P(x,y))

will be greater

than

O.

of both EMH14 and 42 makes use of a contingency

table.

This table can be represented

z and

y co-occur

as follows:

YY

E44 The

upper-left-hand

[b] records y occurs table

the but

cell

number

z does

to estimate

[a] records

of times

not.

the

~ occurs

Finally,

cell

probabilities,

The

+2 measure more.

has been

This

[d] records

suggested

is calculated

The

two

people) Two there

measures that

different

were

number

of windows

Table and

2 lists

Evaluating appear

the

or person

for

is not

about

question clear

relevant

68 EMIM

EMIM

documents

d)’

compute the

For window

second

of documents

0

Suggest Documents