order, we also group certain numbers as âtravel agentsâ and. ârestaurants.â Also, ..... batting glove. + baseball equipment. * sports equipment. FAA tags clothing.
Automating
a Classification
Task
Based
on an Augmented
Thesaurus Eunok
Pack
Hye-Jeong
Jeon
Information Technology Lab. LG Electronics Research Center 16 Woomyeon-dong, Seocho-gu Seoul, {pack,
137-140,
Republic
of Korea
hjeong}@crown.crl.
goldstar.co.kr
ABSTRACT
ignored
classification tasks that have been tackled for automation are ones involving complex chains of causal reasoning. However, our daily lives are filled with sim-
much
ple
sis,
Most
classification
lution
to
tions,
and
tasks.
the
personal
problem
hence
inherit far
provide
our
ante,
from
that
to augment kinds
the
for
tion
regarding
and
objects
the
our
face
to the
lexical
with
provides
personal
with
information
var-
lexical
user’s
domain
expert’s [4].
On
do not
involve
they
and
We
believe
and easy-to-use
inter-
management
this
such
paper,
heavily
In our ing
daily
The
an address
diagnosis. tion
lives,
often.
in
book,
Keeping the
range
sense
the
task.
tools,
lexical
an address that
in
alphabetical
numbers
as
“travel
medical
diagnosis
because
of symptoms
we not
classifica-
only
write
findings
grouping
from
of causal
rea-
classification
tasks
if any.
Instead,
and commonsense classification,
it
classification
lot
of commonsense
group
part
phone
on
simple
classification
tasks
simple
classification
relies
we suggest
in
tasks
a semantic found,
In
however,
being
It
here
is well
knowledge
that
necessary
limited
and
by extending
because require
a
understood
is a formidable
however,
knowledge
au-
a generic
purpose
considered
new
commonly
that
our
than this
for the
lexicon,
to serve
We believe,
of a thesaurus
automation
necessary
knowledge.
is fairly
a lexicon-based
its
approach.
commonsense [3].
task
for
knowledge
We
the
sification
for
can
the
amount
a simple
clas-
be encoded
as a
its expressiveness.
certain In this
Also,
of our
age of computerized information
formation
combinations
ing and
(Artificial Intelligence), tasks has been mostly
right notice, the title .Of the publication and Its date appear, and notice is given that cop yright IS by permission of ttm ACM, lnc. To copy othmwim, to republish, to post on serversor to redistributeto lists, requiresspecific
LLl
We want Much
not just
a simple following is to
constructing sification nal
time
on-line
in order task
the have
guidelines
warrants
for that
simple
is
internal
classification. interface requires
to have family
of au-
however,
has some user
information chosen
inand
of creat-
by means
information,
information
for
that
the process
It usually
We
much
of on-line
on its creation
an intelligent
an on-line
organization.
is spent information
certain
provide
we have
amount
to facilitate
collection.
Our
goal
As the
on-line of our
structure
a specific
..$3.50
more
maintaining
tomation.
information,
on-line.
grows,
maintenance.
a particular
Permission to make. digitzWhwd copies of all or part of this nmterirrl for personal or classroom use is granted witbout fee provided that the copies are not Inade or r!istributed for profit or commercial advantage. the copy-
permission and/or fee. IUI 97, Orlando Florida USA @1997 ACM 0-89791-839-8/96/01
in
complex
Since
sufficient
of commonsense
as a classification certain
as coming
cause. Within the field of AI automating simple classification
also
down
of the
is not
encoding
“restaurants.”
the
encoded
solution.
rule-based
a thesaurus.
in itself
involves
can be understood
it involves and
called
task
book we
suc-
thus,
for
chains
it a simple
suitable
is encoded
that
as medical
and
and
inference,
semantics,
is more
much
tasks
such
order,
agents”
one,
focus
aforementioned
one, like keep-
classification
a simple
to a complex
numbers
task
with
from
diagnomost
been
semantics
we call
the
suitable
simple
to a simple
we
that
thesaurus
we are faced
tasks
were
causal
keeping.
on lexical
approach
INTRODUCTION very
received
as medical
problem
require
on lexical
itself
as book
tomation
knowledge-based
has
been
has usually
hand,
much
lend
the
as they
Though
not
such have
systems
other
rely
knowledge.
In
actions
items,
Keywords classification, inherit ante
solving
knowledge
the
heavily
does
to
tasks
soning.
tasks systems
Rule-based
approach,
Automatic semantics,
ones
informathe
knowledge.
a simple
was
it neces-
as the
about
certain
of commonsense
system
content we felt
personal
knowledge
approach
classification
back-
expert
cessful
the
semantic
complex
classification
rule-based
rules
for
such
criteria,
user,
for
automating
For complex
a classical
database
information
associated
kinds
that
a basis
information
existing
classification
interface
constitutes
In particular,
of contextual
preference
its
sufficient.
ious
certain
database
so-
classifica-
Although
as it provides
we found
sary
user
management.
system
being
simple
an intelligent
of a lexical
for
a lexicon-based
of automating
information
organization bone
We propose
whilst at tent ion.
for clas-
an inter-
accounting
classification,
as be-
cause
its
the same
automation
is feasible
time,
for the everyday
useful
application
interface
FAA
FAA
a simple
natural
takes
family its
income
records
In the tion
and
following
are analyzed cessing)
by
tion
In
relies
thesaurus
four
ways.
and
matic
the
illustrate
classification
to FAA
describe the
the
in
augmented
the-
supports
1997.
TRANSPORTATION
unit,
of its income
and
expenses,
omy
as to
as well
penditure
[2].
it is important
help
unit
Inputs
to
guage
only.
a balanced for
its
matter
which
HOUSING,
PUBLIC
EDUCATION,
SOCIAL
RECREATION,
and
MISCELLANEOUS,
his/her
thirteen
categories,
it
might
the record into a work a spreadsheet program. t age in using nance. matic It
is, however,
to customize Even still
as well a very
if it is already has to know thirteen
FAA
relieve
helps the
program natural that
case that for
there
The
exists
simply
user needs.
the
user
all these
not
expense
into
not might
overheads.
customizes It
not
to
know
even
also
It is
provides
a
program,
how
have
to
so
use
to know
the that
program. In order to hide the from the user, FAA automatically input into one of the thirteen catewhen “bus fare, $2.00” is given as enters this information in the appro-
a spreadsheet
(International
Labor
Organization)
as an input
change
no
medium.
LANGUAGE
to
isolate
a head
suggests
noun
definition, phrase:
man
with
items
from
a head
“man”
phrase
way
noun
we need
input.
According
is what
makes
noun
Although to decide
np is testing
is a kind
automatically,
the
is a head
a book.”
a useful
“np
PROCESSING
input
if the
to a
a phrase
in an expression
there
is no clear-cut
if h is a head following
noun
in
sentence
is
of h.”
According
to repair
unit
this
definition,
is repair.
is a potential
a head
For FAA, head
in
an
however,
expression any mean-
if it has a categorization
tag
using
only
principle.
must be noted that our definition of a head includes not only a set of nouns, but any meaning unit as well. In particular, we allow certain suffixes to be a head because some suffixes often carry a significant meaning in Korean and are only at t ached to a noun.
a spreadsheet
to a spreadsheet have
in-
does not
It
user
program,
every
accounting.
interface does
spreadsheet program classifies the user’s gories. For instance, an input to FAA, it 1ILO
classify
auto-
his/her
appropriately,
the user from
household
user
program.
must
mainte-
categories.
FAA
language the
for
to fit
lan-
accounting
and
forms
task
program
is used
to
a naive
make
provide
to use a spreadsheet
he/she
one of the
update
as diverse
tasks
Korean
an intelligent
of summary.
these
to
idea
programs
customized
how
that
easy
challenging
a spreadsheet
to mention
not
for
spreadsheet
computation
a good
wants
into
sheet that can be processed by There certainly is an advan-
a computer
In addition,
TAX, GOODS,
a user
classified
be
restricted
of providing
associated, where each categorization tag is a label for one of the thirteen expenditure categories. For example, clock is a head in an expression clock repair as it is tagged with a label HOUSEHOLD GOODS. We associate the tag with clock because it is important to know that the money was spent on a clock. On the other hand, repair does not have any categorization tag because we can hardly decide which is the right category just from the fact that the money was spent on some kind of repair. In family accounting, when expenses are made for a certain object whether it in fact was a purchase, a repair, or an upgrade of the object, they all belong to the same category. We call this criterion a family
CARE,
When
expenditure
idea
language
to classify
clock
HOUSEHOLD
SAVINGS1.
daily
fare
ex-
to have all
TRANSPORTA-
PERSONAL/MEDICAL
EXPENSES,
to record
UTILITY,
7.
classification
order
true:
Let us suppose that we want to classify our household FOOD, expenses into the following thirteen categories: CLOTHING,
1.
currently
the
In
ing
TION,
are
But
to simple
a noun
econ-
future
however, and categorized.
It is time-consuming,
the expenses recorded
FAA
terface
a noun
to keep the records
plan
as follows:
$2.00
definition,
to practice
the
bus
item:
linguistic
auto-
AGENT
For any economic
classification
category:
NATURAL
an ordi-
accounting.
ACCOUNTING
right
date:
“a
FAMILY
the
knowledge
is augmented
thesaurus
family
the
classifica-
lexical
thesaurus
pro-
follows
automatic our
We
with
classifica-
of information
how for
keeps
language
of which
FAA’s
our
the
expressions
(natural
kinds
provides,
different
saurus
).
regarding and
worksheet
amount:
describe
of NLP
task.
to
priate
expense
Input
on a thesaurus,
addition
at this
Agent
memo
as an input
illustration
on FAA’s
heavily
nary
an
Accounting
we first
a series
and,
We call
manner.
by FAA.
modules,
description base.
section,
user.
language
expenditure
performed
challenging
(Family
in an appropriate
task
but
4
food, clothing, housing, and miscellaneous. We believe that onr 13 categories are specific enough so that some categories can later be merged.
cat egories:
222
In this paper, inputs are limited only to noun phrases consisting of a sequence of nouns. For FAA, the goal of NLP is extracting a head from an input expression and is achieved in three stages. First, a collection of low-level text specialists, designed to recognize numeric expressions, works on a user’s input to identify dates and revenue objects. Second, a morphological analyzer transforms each word into a sequence of meaning units. For instance, “e] ~~” (a word for “barbershop” in Korean) is transformed into >7 . (( 77 ‘s a noun for “hriircut ,“ and’’’?~*~s~~u~~::O~~;e Finally, a semantic analyzer tries to isolate a head from a sequence of meaning units. The semantic analyzer is closely coupled with the automatic classifier and its augmented thesaurus, so it can be viewed as a part of the
automatic
classifier.
we consider
But,
for the purpose
it as a separate
NLP
of illustration,
antonymy,
module.
hypernymy,
contains
only
We will The
semantic
match
first
a user’s
input
between
thesaurus.
If the search
for each head the
analyzer
meaning
noun
the
information.
conflicting times,
there
are
the
last
difficulty
exceptions
these
such noun
to
exceptions
is a potential
as “bread
rule
follow
user
the
case
knife.”
was for
We “clothes”
is not
head.
must
When
these
other
potential
there
is a potential for to
cleaning”
must
the
“carpet”
assumption type
of conflict,
have
implicit
by
targets
over
ation
Another
pattern
“glass.”
Each
these
for
paper”.
Our
“carpet
word
means
and
When
them last
like
other
it
from word
objects
“paper”
of
and which
word
in the
hierarchy
in
of
following
tomatically mentation
is made
we will
user to the
inputs
and
how
In
constructing
amount
of semantic
to be a very 6000
a system
overall
process
general
rule
words
was
four
kinds
tool.
rule
ways.
First,
as our
of relations
au-
of aug-
between
content
is not
ficulties,
we
generic
word
vehicle
tricycle,
this
tag.
we
can
use the the
For with
which
case,
than
is a hy-
simply
tag
put
associated
one with
vehicle.
classification But
for
four
thesaurus
in two
of multiple
the
different
inheritance, Another
is that
FAA.
that
and the inher-
for FAA
kinds
follows
we find
tree-structured.
sufficient made
its
diffi-
information
To overcome
these
dif-
augmentation
to
the
of
thesaurus.
first
at ion type
of augmentation and
problem.
In
glove
FAA,
results
-+ handwear
clothing
with item
and
if those
vehicle,
they
we prefer
the
Other
types
generic
fol-
[1]. At contains
2X of
X.
Ofx.
synonymy,
223
are
level
is a holonym
of
sports
Y Y
FAA
+ tags
equipment
with
tags
a sin-
potential tagged
for
words
by hypernymy
are made
insufficiency of
glove glove
are not decide and
links,
thus
tag.)
of augmentations
is a hypernym
links
batting
batting equipment.
two
in-
hypernymy
links, then we cannot (In the case of tricycle
connected
lower
thesaurus’
X
and are two
statistical
a multiple
paths:
sports
CLOTHING
using
solve
following
in two
*
If there
lexical
to
-+ clothing;
equipment
RECREATION.
involves
is introduced
a fair
pair:
the
thesaurus
conduct
is known
a word
level
and
is not
the
AUG-
research,
not
levels
lower
is a problem
USING
of about
this
or
we can
multiple
the
of automatic
connected by hypernymy which is the right tag.
organization in WordNet
at
are insufficient
there
thesaurus
of batting
thesaurus
as part
in its dictionary noun dictionary
is en-
HOUSEHOLD
hierarchy,
tagging
of a generic
alone
with
when
a thesaurus
A Korean
constructed
to
with
of inheritance.
hierarchy
itance
gle
has
understanding,
useful
lowing WordNet the moment, the
that
hierarchy
by
to tricycle
this
heritance
thesaurus.
AUTOMATIC CLASSIFICATION MENTED THESAURUS
we would
we do not tag chair
to tag
In
The
words
FAA
kind
in-
chairs
for furniture
inheritance
preferably
information,
case of “wall
what
are words
furniture
want but
tricycle
--+ glove
describe
using that
a
in a compound.
section,
classifies
classifier
of furniture,
preferring
vehicle.
baseball In the
tag
is exbecause
telling
on semantic
any
and
tag
Augment
heads
kinds as there
exceptions
we might
The
of candidate
by the
hypernyms,
with
encode
generic
are raw
for these
some tags
only
is common
that out
are
but
in categorization,
as in the looks
the list
ex-
kind
they
thepairs.
a thesaurus
a thesaurus
information
easily
culty
last
word
classification
be exploited
in each word’s
semantic
nouns
from
is the
a conflict
analyzer
this
material
used to specify
semantic
and removes
some
causes
resolve
thesaurus.
words
generic
GOODS.
with
ob-
is one
automatic
the
in
automatic
Without
as many
a different
implicit
verbal
FAA
As the
FAA’s
between
encoded
for
can
As the
coded
how
as Inheritance
[5].
items.
for
of relation
hierarchy
important
to put
describe
classification.
this
by preferring
This
thesaurus antonymy.
thesaurus.
supplemented
kinds
bookshelves
have
first for
sufficient
have
new
semantic
ponym
its target
can
these
then
generic
our
that
that ones.
are made.
they
so that We
and
In
The
we will
is not
TRANSPORTATION,
with
carpet.
namely
to the
certain
are the
principle,
from
specified
implicit
are usually
material
HOUSING.
has to do with
noun
words
with
specifying
made
objects
compound
cleaning
with
example,
“carpet
be associated
tag
objects,
noun,
an expression
saurus
As
with
may
be overridden.
target
together
a compound
accounting its
“cleaning,” can
augment
other
categorization
and
is explicitly about
is a house,
plicit
proper
Given
family
retrieve
as
with
are used
Our except
as a generic
thesaurus
we
bookshelf,
noun such
by themselves
carpet
GOODS
it
objects
to form
conflict.
HOUSEHOLD
case,
nouns
whether nouns,
thesaurus
and
by is
target nouns
example,
According
ject
verbal
verbal
generic
heritance
pat-
case, the verbal
some
their
head
“washing”
In this
be associated
tags.
cleaning,”
from
But
do indicate
thus,
a tag
tell
or “cars.”
a potential
“parking,” and
cannot
generic
its hierarchy
but
found
One
the
tremely
Most
We
patterns.
of
tern has to do with verbal nouns. Most verbal nouns themselves do not tell us on which object the action performed.
thesaurus
Classification
for
is a head,
of thumb.
certain
an as-
of deciding
in
in a sequence
this
has
to the
lies
meronymy2.
of relations
section,
utilizes
classification,
we consider
no way
resort
this
In the following
in the
searches
If no unit
we have
The
entries
agree,
call
kinds
maximal
separate
tags
we must
categorization,
of the that
tag, thus
the
one of them
to be successful.
category,
further
it makes
If only
categorization
right
find
and the word
fails,
unit.
to
or if all categorization
classification
sociated
tries
and
three
(Y (Y
is is
because
of information a hyponym a meronym
of the content.
of
X)
if
Y
is
of
X)
if
Y
is a part
a kind
The
first
tries
in the generic
word
augmentation
senses.
tween
Secondly,
actions
type
of link
as post types
for
a new
one over lexical
the
and
context ous
in
mind.
easily
encode
that
is not
clear.
sociated
class
after
it
the hard
a spectrum
of preference,
link
a frequency
hypernymy
with
selects one link on the selected
over link.
have to take
When
contextual
ambiguation. tual
alone.
we can
the
information.
amount
of money
provides
associated
with
with
for
a single
test
words
found
for
rental,
Our
knowledge
kind
of purchase and
Similarly, the
or
the
can
help
can
help
help
pencils
of money
cer-
we have
spent
for
each
Objects
common
categorization.
For
make
it
gory.
For example,
equipment
for
but
the
us with
any
user
which
in
for
to MISCELLANEOUS
to
telephone
clue for sense
decide
could
communication amount
3 g.he~e ~~age~ of the language
FAA
words
of actions target
as their
spent
disambiguation.
word
mean
used “video”
does FAA
the were
word.
the
is used
to over-
is performed
on the
case, must
be
as links
right
must
this
paper,
input
performed location
otiice
$2.79.” it must
locations
We encode words
vertheir
correctly,
at these
if any.
we propose
this
and their
a lexicon-based
classification. classification While
This
as infor-
actions
tasks the
that
current
do not involve
provides
a backbone
we found
that
its information
ficient
for
achieving
our
goal,
cation
for
family
means
for that
accounting.
for
semantic
a natural
language
supports
automatic
for
classification,
current
interfaces
et.
Five
by
classifi-
four
kinds
providing By
of var-
providing
accounting
we have
in-
is insuf-
of automatic
family
rea-
on-line
content
disambiguation.
interface
an
classification
thesaurus,
for
causal
of
We propose
existing
to
is suitable
organization
thesaurus
to the
approach
approach
heritance,
1. G.
Miller
cal
Report
Reports,
omy.
user and
that
significantly
to personal
information
al.
Cognitive Princeton
Edward
papers
Univ., and
on wordnet.
Science
Techni-
Laboratory
Technical
New
Jersey,
1993.
F. Williams.
The
Famihj
Brothers,
Inc.,
Ann
Arbor,
Econ-
Michigan,
1973. 3. D.
Lenat
and
R.
Knowledge-Based it-
4.
provide
Edward
H. Shortliffe.
szdtat ions:
ask the
obs.erwxl ~0%
as “post
this
with
to specify
REFERENCES
be
cate-
For
such
objects,
between
simple
an electronic not
Similarly users
objects.
soning.
cate-
or telecommunication
of money he/she
the
of a
Then,
in this
for
to classify
well
a child,
attachment as a mechanism for words are genuinely ambiguous
impossible
are used
is a part
inare
must
without
FAA
enhanced the management.
about
notebooks
location
for
2. C. Fitzsimmons
FAA also uses test prompting. Certain
rean
and
is very
be informed
ious
of
gory.
self,
to Actions
augmentation
meaning
decision.
a family
belong
helpful
omit
recorder3.
and
action noun,
performed
words
noun
by
actions
CONCLUSION
statements
and the expense
For
notebooks
right
the
a head
it
with
automatic
information
pencils
children
as EDUCATION. and
the
Link
nouns,
our
the
that
word.
mation
a videotape,
the right
of personal
us make for
the
or event
examples,
amount
assumption
In order
we
and
often
us determine
has children,
to be bought
classified
usual
a verbal
non-action
to exploit
to mean
when
Hence,
In
a videotape
us make
knowledge
if a user
world
used
even
about thus
our
user
likely
In real
is often
consider
object.
Place’s
we
for dis-
It is very
users
these
noun
are
amount
words.
because
brevity.
“video”
videotape
st ante,
polysemous
are in fact
when
compound
meanings
conditional
nouns
of the
the
of contex-
object
order
the
we have
attachments
that
“video”
about
item,
to certain
to have
In
verbal
part
our
the
objects
non-action linked
occurs,
spent
multiple
to
described
that
cleaning.”
ride
other,
kinds
a certain
word.
knowledge
a lexical
attached
for when
to assume
as “carpet
criterion we as-
these
objects
is safe
linked
the
But,
It
such
and
clue
functions
noun
bal
of money
spent
a good
commonsense spent
amount
prompting
compound
to imag-
account
two
user
record.
the
word
into
we adopt
personal
often
tain
information
Currently,
information:
user’s The
a polysemous
with
on or for
input Test Attachment
are
ambiguity.
nouns.
associated
a
obvi-
item
preference
attached
it can be used for classifying
compound
In the
and
is not
this
so that
the
all,
time.
it is often
tagging
However,
the user the count
is,
be preferred, by
differ-
to prefer use for
same
however,
cases where
To encode
each
Whenever increment
glove
at the
preference
are
a specific
batting
should
problem
to two
tests
semantic
Another augmentation to the Action’s Link to Objects generic thesaurus is linking actions to its target objects. Links are added from cleaning to house, from subs cript ion to newspaper and to magazine. It might seem simpler to tag these verbal nouns with appropriate categories than adding links from actions to objects. recorded But we found it useful to keep this information
such
on various
it is hard
we have
A
category. there
time,
accounting,
this
a preferred
locations
belongs
words, resolve
be-
a new
inheritance
item
equipment
super
added
description
same
unless
baseball
which
en-
among
is added
we
for specifying
lexical
at the
of family
of link
If multiple
other
item
glove
type
detailed
a single
classes
to lexical
follows.
Information
super
tests
disambiguation
Finally,
used
More
because
for
objects.
words
office.
Frequency
ine
and
of augmentation
arises ent
is attaching
thesaurus,
MYCIN.
V.
Systems.
Guha.
Building Addison-Wesley,
Computer-based American
Elsevier,
Large 1989.
Medical New
ConYork,
1976.
these 5.
Ko-
David
S. Touretzky.
Systems.
speakers.
224
Morgan
The Mathematics Kaufmann,
California,
of Inheritance 1986.