Automating a Classification Task Based on an Augmented ... - CiteSeerX

1 downloads 0 Views 502KB Size Report
order, we also group certain numbers as “travel agents” and. “restaurants.” Also, ..... batting glove. + baseball equipment. * sports equipment. FAA tags clothing.
Automating

a Classification

Task

Based

on an Augmented

Thesaurus Eunok

Pack

Hye-Jeong

Jeon

Information Technology Lab. LG Electronics Research Center 16 Woomyeon-dong, Seocho-gu Seoul, {pack,

137-140,

Republic

of Korea

hjeong}@crown.crl.

goldstar.co.kr

ABSTRACT

ignored

classification tasks that have been tackled for automation are ones involving complex chains of causal reasoning. However, our daily lives are filled with sim-

much

ple

sis,

Most

classification

lution

to

tions,

and

tasks.

the

personal

problem

hence

inherit far

provide

our

ante,

from

that

to augment kinds

the

for

tion

regarding

and

objects

the

our

face

to the

lexical

with

provides

personal

with

information

var-

lexical

user’s

domain

expert’s [4].

On

do not

involve

they

and

We

believe

and easy-to-use

inter-

management

this

such

paper,

heavily

In our ing

daily

The

an address

diagnosis. tion

lives,

often.

in

book,

Keeping the

range

sense

the

task.

tools,

lexical

an address that

in

alphabetical

numbers

as

“travel

medical

diagnosis

because

of symptoms

we not

classifica-

only

write

findings

grouping

from

of causal

rea-

classification

tasks

if any.

Instead,

and commonsense classification,

it

classification

lot

of commonsense

group

part

phone

on

simple

classification

tasks

simple

classification

relies

we suggest

in

tasks

a semantic found,

In

however,

being

It

here

is well

knowledge

that

necessary

limited

and

by extending

because require

a

understood

is a formidable

however,

knowledge

au-

a generic

purpose

considered

new

commonly

that

our

than this

for the

lexicon,

to serve

We believe,

of a thesaurus

automation

necessary

knowledge.

is fairly

a lexicon-based

its

approach.

commonsense [3].

task

for

knowledge

We

the

sification

for

can

the

amount

a simple

clas-

be encoded

as a

its expressiveness.

certain In this

Also,

of our

age of computerized information

formation

combinations

ing and

(Artificial Intelligence), tasks has been mostly

right notice, the title .Of the publication and Its date appear, and notice is given that cop yright IS by permission of ttm ACM, lnc. To copy othmwim, to republish, to post on serversor to redistributeto lists, requiresspecific

LLl

We want Much

not just

a simple following is to

constructing sification nal

time

on-line

in order task

the have

guidelines

warrants

for that

simple

is

internal

classification. interface requires

to have family

of au-

however,

has some user

information chosen

inand

of creat-

by means

information,

information

for

that

the process

It usually

We

much

of on-line

on its creation

an intelligent

an on-line

organization.

is spent information

certain

provide

we have

amount

to facilitate

collection.

Our

goal

As the

on-line of our

structure

a specific

..$3.50

more

maintaining

tomation.

information,

on-line.

grows,

maintenance.

a particular

Permission to make. digitzWhwd copies of all or part of this nmterirrl for personal or classroom use is granted witbout fee provided that the copies are not Inade or r!istributed for profit or commercial advantage. the copy-

permission and/or fee. IUI 97, Orlando Florida USA @1997 ACM 0-89791-839-8/96/01

in

complex

Since

sufficient

of commonsense

as a classification certain

as coming

cause. Within the field of AI automating simple classification

also

down

of the

is not

encoding

“restaurants.”

the

encoded

solution.

rule-based

a thesaurus.

in itself

involves

can be understood

it involves and

called

task

book we

suc-

thus,

for

chains

it a simple

suitable

is encoded

that

as medical

and

and

inference,

semantics,

is more

much

tasks

such

order,

agents”

one,

focus

aforementioned

one, like keep-

classification

a simple

to a complex

numbers

task

with

from

diagnomost

been

semantics

we call

the

suitable

simple

to a simple

we

that

thesaurus

we are faced

tasks

were

causal

keeping.

on lexical

approach

INTRODUCTION very

received

as medical

problem

require

on lexical

itself

as book

tomation

knowledge-based

has

been

has usually

hand,

much

lend

the

as they

Though

not

such have

systems

other

rely

knowledge.

In

actions

items,

Keywords classification, inherit ante

solving

knowledge

the

heavily

does

to

tasks

soning.

tasks systems

Rule-based

approach,

Automatic semantics,

ones

informathe

knowledge.

a simple

was

it neces-

as the

about

certain

of commonsense

system

content we felt

personal

knowledge

approach

classification

back-

expert

cessful

the

semantic

complex

classification

rule-based

rules

for

such

criteria,

user,

for

automating

For complex

a classical

database

information

associated

kinds

that

a basis

information

existing

classification

interface

constitutes

In particular,

of contextual

preference

its

sufficient.

ious

certain

database

so-

classifica-

Although

as it provides

we found

sary

user

management.

system

being

simple

an intelligent

of a lexical

for

a lexicon-based

of automating

information

organization bone

We propose

whilst at tent ion.

for clas-

an inter-

accounting

classification,

as be-

cause

its

the same

automation

is feasible

time,

for the everyday

useful

application

interface

FAA

FAA

a simple

natural

takes

family its

income

records

In the tion

and

following

are analyzed cessing)

by

tion

In

relies

thesaurus

four

ways.

and

matic

the

illustrate

classification

to FAA

describe the

the

in

augmented

the-

supports

1997.

TRANSPORTATION

unit,

of its income

and

expenses,

omy

as to

as well

penditure

[2].

it is important

help

unit

Inputs

to

guage

only.

a balanced for

its

matter

which

HOUSING,

PUBLIC

EDUCATION,

SOCIAL

RECREATION,

and

MISCELLANEOUS,

his/her

thirteen

categories,

it

might

the record into a work a spreadsheet program. t age in using nance. matic It

is, however,

to customize Even still

as well a very

if it is already has to know thirteen

FAA

relieve

helps the

program natural that

case that for

there

The

exists

simply

user needs.

the

user

all these

not

expense

into

not might

overheads.

customizes It

not

to

know

even

also

It is

provides

a

program,

how

have

to

so

use

to know

the that

program. In order to hide the from the user, FAA automatically input into one of the thirteen catewhen “bus fare, $2.00” is given as enters this information in the appro-

a spreadsheet

(International

Labor

Organization)

as an input

change

no

medium.

LANGUAGE

to

isolate

a head

suggests

noun

definition, phrase:

man

with

items

from

a head

“man”

phrase

way

noun

we need

input.

According

is what

makes

noun

Although to decide

np is testing

is a kind

automatically,

the

is a head

a book.”

a useful

“np

PROCESSING

input

if the

to a

a phrase

in an expression

there

is no clear-cut

if h is a head following

noun

in

sentence

is

of h.”

According

to repair

unit

this

definition,

is repair.

is a potential

a head

For FAA, head

in

an

however,

expression any mean-

if it has a categorization

tag

using

only

principle.

must be noted that our definition of a head includes not only a set of nouns, but any meaning unit as well. In particular, we allow certain suffixes to be a head because some suffixes often carry a significant meaning in Korean and are only at t ached to a noun.

a spreadsheet

to a spreadsheet have

in-

does not

It

user

program,

every

accounting.

interface does

spreadsheet program classifies the user’s gories. For instance, an input to FAA, it 1ILO

classify

auto-

his/her

appropriately,

the user from

household

user

program.

must

mainte-

categories.

FAA

language the

for

to fit

lan-

accounting

and

forms

task

program

is used

to

a naive

make

provide

to use a spreadsheet

he/she

one of the

update

as diverse

tasks

Korean

an intelligent

of summary.

these

to

idea

programs

customized

how

that

easy

challenging

a spreadsheet

to mention

not

for

spreadsheet

computation

a good

wants

into

sheet that can be processed by There certainly is an advan-

a computer

In addition,

TAX, GOODS,

a user

classified

be

restricted

of providing

associated, where each categorization tag is a label for one of the thirteen expenditure categories. For example, clock is a head in an expression clock repair as it is tagged with a label HOUSEHOLD GOODS. We associate the tag with clock because it is important to know that the money was spent on a clock. On the other hand, repair does not have any categorization tag because we can hardly decide which is the right category just from the fact that the money was spent on some kind of repair. In family accounting, when expenses are made for a certain object whether it in fact was a purchase, a repair, or an upgrade of the object, they all belong to the same category. We call this criterion a family

CARE,

When

expenditure

idea

language

to classify

clock

HOUSEHOLD

SAVINGS1.

daily

fare

ex-

to have all

TRANSPORTA-

PERSONAL/MEDICAL

EXPENSES,

to record

UTILITY,

7.

classification

order

true:

Let us suppose that we want to classify our household FOOD, expenses into the following thirteen categories: CLOTHING,

1.

currently

the

In

ing

TION,

are

But

to simple

a noun

econ-

future

however, and categorized.

It is time-consuming,

the expenses recorded

FAA

terface

a noun

to keep the records

plan

as follows:

$2.00

definition,

to practice

the

bus

item:

linguistic

auto-

AGENT

For any economic

classification

category:

NATURAL

an ordi-

accounting.

ACCOUNTING

right

date:

“a

FAMILY

the

knowledge

is augmented

thesaurus

family

the

classifica-

lexical

thesaurus

pro-

follows

automatic our

We

with

classifica-

of information

how for

keeps

language

of which

FAA’s

our

the

expressions

(natural

kinds

provides,

different

saurus

).

regarding and

worksheet

amount:

describe

of NLP

task.

to

priate

expense

Input

on a thesaurus,

addition

at this

Agent

memo

as an input

illustration

on FAA’s

heavily

nary

an

Accounting

we first

a series

and,

We call

manner.

by FAA.

modules,

description base.

section,

user.

language

expenditure

performed

challenging

(Family

in an appropriate

task

but

4

food, clothing, housing, and miscellaneous. We believe that onr 13 categories are specific enough so that some categories can later be merged.

cat egories:

222

In this paper, inputs are limited only to noun phrases consisting of a sequence of nouns. For FAA, the goal of NLP is extracting a head from an input expression and is achieved in three stages. First, a collection of low-level text specialists, designed to recognize numeric expressions, works on a user’s input to identify dates and revenue objects. Second, a morphological analyzer transforms each word into a sequence of meaning units. For instance, “e] ~~” (a word for “barbershop” in Korean) is transformed into >7 . (( 77 ‘s a noun for “hriircut ,“ and’’’?~*~s~~u~~::O~~;e Finally, a semantic analyzer tries to isolate a head from a sequence of meaning units. The semantic analyzer is closely coupled with the automatic classifier and its augmented thesaurus, so it can be viewed as a part of the

automatic

classifier.

we consider

But,

for the purpose

it as a separate

NLP

of illustration,

antonymy,

module.

hypernymy,

contains

only

We will The

semantic

match

first

a user’s

input

between

thesaurus.

If the search

for each head the

analyzer

meaning

noun

the

information.

conflicting times,

there

are

the

last

difficulty

exceptions

these

such noun

to

exceptions

is a potential

as “bread

rule

follow

user

the

case

knife.”

was for

We “clothes”

is not

head.

must

When

these

other

potential

there

is a potential for to

cleaning”

must

the

“carpet”

assumption type

of conflict,

have

implicit

by

targets

over

ation

Another

pattern

“glass.”

Each

these

for

paper”.

Our

“carpet

word

means

and

When

them last

like

other

it

from word

objects

“paper”

of

and which

word

in the

hierarchy

in

of

following

tomatically mentation

is made

we will

user to the

inputs

and

how

In

constructing

amount

of semantic

to be a very 6000

a system

overall

process

general

rule

words

was

four

kinds

tool.

rule

ways.

First,

as our

of relations

au-

of aug-

between

content

is not

ficulties,

we

generic

word

vehicle

tricycle,

this

tag.

we

can

use the the

For with

which

case,

than

is a hy-

simply

tag

put

associated

one with

vehicle.

classification But

for

four

thesaurus

in two

of multiple

the

different

inheritance, Another

is that

FAA.

that

and the inher-

for FAA

kinds

follows

we find

tree-structured.

sufficient made

its

diffi-

information

To overcome

these

dif-

augmentation

to

the

of

thesaurus.

first

at ion type

of augmentation and

problem.

In

glove

FAA,

results

-+ handwear

clothing

with item

and

if those

vehicle,

they

we prefer

the

Other

types

generic

fol-

[1]. At contains

2X of

X.

Ofx.

synonymy,

223

are

level

is a holonym

of

sports

Y Y

FAA

+ tags

equipment

with

tags

a sin-

potential tagged

for

words

by hypernymy

are made

insufficiency of

glove glove

are not decide and

links,

thus

tag.)

of augmentations

is a hypernym

links

batting

batting equipment.

two

in-

hypernymy

links, then we cannot (In the case of tricycle

connected

lower

thesaurus’

X

and are two

statistical

a multiple

paths:

sports

CLOTHING

using

solve

following

in two

*

If there

lexical

to

-+ clothing;

equipment

RECREATION.

involves

is introduced

a fair

pair:

the

thesaurus

conduct

is known

a word

level

and

is not

the

AUG-

research,

not

levels

lower

is a problem

USING

of about

this

or

we can

multiple

the

of automatic

connected by hypernymy which is the right tag.

organization in WordNet

at

are insufficient

there

thesaurus

of batting

thesaurus

as part

in its dictionary noun dictionary

is en-

HOUSEHOLD

hierarchy,

tagging

of a generic

alone

with

when

a thesaurus

A Korean

constructed

to

with

of inheritance.

hierarchy

itance

gle

has

understanding,

useful

lowing WordNet the moment, the

that

hierarchy

by

to tricycle

this

heritance

thesaurus.

AUTOMATIC CLASSIFICATION MENTED THESAURUS

we would

we do not tag chair

to tag

In

The

words

FAA

kind

in-

chairs

for furniture

inheritance

preferably

information,

case of “wall

what

are words

furniture

want but

tricycle

--+ glove

describe

using that

a

in a compound.

section,

classifies

classifier

of furniture,

preferring

vehicle.

baseball In the

tag

is exbecause

telling

on semantic

any

and

tag

Augment

heads

kinds as there

exceptions

we might

The

of candidate

by the

hypernyms,

with

encode

generic

are raw

for these

some tags

only

is common

that out

are

but

in categorization,

as in the looks

the list

ex-

kind

they

thepairs.

a thesaurus

a thesaurus

information

easily

culty

last

word

classification

be exploited

in each word’s

semantic

nouns

from

is the

a conflict

analyzer

this

material

used to specify

semantic

and removes

some

causes

resolve

thesaurus.

words

generic

GOODS.

with

ob-

is one

automatic

the

in

automatic

Without

as many

a different

implicit

verbal

FAA

As the

FAA’s

between

encoded

for

can

As the

coded

how

as Inheritance

[5].

items.

for

of relation

hierarchy

important

to put

describe

classification.

this

by preferring

This

thesaurus antonymy.

thesaurus.

supplemented

kinds

bookshelves

have

first for

sufficient

have

new

semantic

ponym

its target

can

these

then

generic

our

that

that ones.

are made.

they

so that We

and

In

The

we will

is not

TRANSPORTATION,

with

carpet.

namely

to the

certain

are the

principle,

from

specified

implicit

are usually

material

HOUSING.

has to do with

noun

words

with

specifying

made

objects

compound

cleaning

with

example,

“carpet

be associated

tag

objects,

noun,

an expression

saurus

As

with

may

be overridden.

target

together

a compound

accounting its

“cleaning,” can

augment

other

categorization

and

is explicitly about

is a house,

plicit

proper

Given

family

retrieve

as

with

are used

Our except

as a generic

thesaurus

we

bookshelf,

noun such

by themselves

carpet

GOODS

it

objects

to form

conflict.

HOUSEHOLD

case,

nouns

whether nouns,

thesaurus

and

by is

target nouns

example,

According

ject

verbal

verbal

generic

heritance

pat-

case, the verbal

some

their

head

“washing”

In this

be associated

tags.

cleaning,”

from

But

do indicate

thus,

a tag

tell

or “cars.”

a potential

“parking,” and

cannot

generic

its hierarchy

but

found

One

the

tremely

Most

We

patterns.

of

tern has to do with verbal nouns. Most verbal nouns themselves do not tell us on which object the action performed.

thesaurus

Classification

for

is a head,

of thumb.

certain

an as-

of deciding

in

in a sequence

this

has

to the

lies

meronymy2.

of relations

section,

utilizes

classification,

we consider

no way

resort

this

In the following

in the

searches

If no unit

we have

The

entries

agree,

call

kinds

maximal

separate

tags

we must

categorization,

of the that

tag, thus

the

one of them

to be successful.

category,

further

it makes

If only

categorization

right

find

and the word

fails,

unit.

to

or if all categorization

classification

sociated

tries

and

three

(Y (Y

is is

because

of information a hyponym a meronym

of the content.

of

X)

if

Y

is

of

X)

if

Y

is a part

a kind

The

first

tries

in the generic

word

augmentation

senses.

tween

Secondly,

actions

type

of link

as post types

for

a new

one over lexical

the

and

context ous

in

mind.

easily

encode

that

is not

clear.

sociated

class

after

it

the hard

a spectrum

of preference,

link

a frequency

hypernymy

with

selects one link on the selected

over link.

have to take

When

contextual

ambiguation. tual

alone.

we can

the

information.

amount

of money

provides

associated

with

with

for

a single

test

words

found

for

rental,

Our

knowledge

kind

of purchase and

Similarly, the

or

the

can

help

can

help

help

pencils

of money

cer-

we have

spent

for

each

Objects

common

categorization.

For

make

it

gory.

For example,

equipment

for

but

the

us with

any

user

which

in

for

to MISCELLANEOUS

to

telephone

clue for sense

decide

could

communication amount

3 g.he~e ~~age~ of the language

FAA

words

of actions target

as their

spent

disambiguation.

word

mean

used “video”

does FAA

the were

word.

the

is used

to over-

is performed

on the

case, must

be

as links

right

must

this

paper,

input

performed location

otiice

$2.79.” it must

locations

We encode words

vertheir

correctly,

at these

if any.

we propose

this

and their

a lexicon-based

classification. classification While

This

as infor-

actions

tasks the

that

current

do not involve

provides

a backbone

we found

that

its information

ficient

for

achieving

our

goal,

cation

for

family

means

for that

accounting.

for

semantic

a natural

language

supports

automatic

for

classification,

current

interfaces

et.

Five

by

classifi-

four

kinds

providing By

of var-

providing

accounting

we have

in-

is insuf-

of automatic

family

rea-

on-line

content

disambiguation.

interface

an

classification

thesaurus,

for

causal

of

We propose

existing

to

is suitable

organization

thesaurus

to the

approach

approach

heritance,

1. G.

Miller

cal

Report

Reports,

omy.

user and

that

significantly

to personal

information

al.

Cognitive Princeton

Edward

papers

Univ., and

on wordnet.

Science

Techni-

Laboratory

Technical

New

Jersey,

1993.

F. Williams.

The

Famihj

Brothers,

Inc.,

Ann

Arbor,

Econ-

Michigan,

1973. 3. D.

Lenat

and

R.

Knowledge-Based it-

4.

provide

Edward

H. Shortliffe.

szdtat ions:

ask the

obs.erwxl ~0%

as “post

this

with

to specify

REFERENCES

be

cate-

For

such

objects,

between

simple

an electronic not

Similarly users

objects.

soning.

cate-

or telecommunication

of money he/she

the

of a

Then,

in this

for

to classify

well

a child,

attachment as a mechanism for words are genuinely ambiguous

impossible

are used

is a part

inare

must

without

FAA

enhanced the management.

about

notebooks

location

for

2. C. Fitzsimmons

FAA also uses test prompting. Certain

rean

and

is very

be informed

ious

of

gory.

self,

to Actions

augmentation

meaning

decision.

a family

belong

helpful

omit

recorder3.

and

action noun,

performed

words

noun

by

actions

CONCLUSION

statements

and the expense

For

notebooks

right

the

a head

it

with

automatic

information

pencils

children

as EDUCATION. and

the

Link

nouns,

our

the

that

word.

mation

a videotape,

the right

of personal

us make for

the

or event

examples,

amount

assumption

In order

we

and

often

us determine

has children,

to be bought

classified

usual

a verbal

non-action

to exploit

to mean

when

Hence,

In

a videotape

us make

knowledge

if a user

world

used

even

about thus

our

user

likely

In real

is often

consider

object.

Place’s

we

for dis-

It is very

users

these

noun

are

amount

words.

because

brevity.

“video”

videotape

st ante,

polysemous

are in fact

when

compound

meanings

conditional

nouns

of the

the

of contex-

object

order

the

we have

attachments

that

“video”

about

item,

to certain

to have

In

verbal

part

our

the

objects

non-action linked

occurs,

spent

multiple

to

described

that

cleaning.”

ride

other,

kinds

a certain

word.

knowledge

a lexical

attached

for when

to assume

as “carpet

criterion we as-

these

objects

is safe

linked

the

But,

It

such

and

clue

functions

noun

bal

of money

spent

a good

commonsense spent

amount

prompting

compound

to imag-

account

two

user

record.

the

word

into

we adopt

personal

often

tain

information

Currently,

information:

user’s The

a polysemous

with

on or for

input Test Attachment

are

ambiguity.

nouns.

associated

a

obvi-

item

preference

attached

it can be used for classifying

compound

In the

and

is not

this

so that

the

all,

time.

it is often

tagging

However,

the user the count

is,

be preferred, by

differ-

to prefer use for

same

however,

cases where

To encode

each

Whenever increment

glove

at the

preference

are

a specific

batting

should

problem

to two

tests

semantic

Another augmentation to the Action’s Link to Objects generic thesaurus is linking actions to its target objects. Links are added from cleaning to house, from subs cript ion to newspaper and to magazine. It might seem simpler to tag these verbal nouns with appropriate categories than adding links from actions to objects. recorded But we found it useful to keep this information

such

on various

it is hard

we have

A

category. there

time,

accounting,

this

a preferred

locations

belongs

words, resolve

be-

a new

inheritance

item

equipment

super

added

description

same

unless

baseball

which

en-

among

is added

we

for specifying

lexical

at the

of family

of link

If multiple

other

item

glove

type

detailed

a single

classes

to lexical

follows.

Information

super

tests

disambiguation

Finally,

used

More

because

for

objects.

words

office.

Frequency

ine

and

of augmentation

arises ent

is attaching

thesaurus,

MYCIN.

V.

Systems.

Guha.

Building Addison-Wesley,

Computer-based American

Elsevier,

Large 1989.

Medical New

ConYork,

1976.

these 5.

Ko-

David

S. Touretzky.

Systems.

speakers.

224

Morgan

The Mathematics Kaufmann,

California,

of Inheritance 1986.

Suggest Documents