RESEARCH IN NATURAL LANGUAGE RETRIEVAL SYSTEMS ... - Core

1 downloads 0 Views 3MB Size Report
Edward A. Stohr. Jon A. Turner. Yannis Vassiliou. Norman H. White. Center for Research on Information Systems. Computer Applications and Information ...
RESEARCH IN NATURAL LANGUAGE RETRIEVAL SYSTEMS*

Edward A. Stohr Jon A. Turner Yannis Vassiliou Norman H. White

Center for Research on Information Systems Computer Applications and Information Systems Area Graduate School of Business Administration New York University

Working Paper Series CRIS #30 GBA # 82-10(CR)

This paper was presented at the Fifteenth Annual Hawaii International Conference on System Sciences, January 1982. *This research is supported by a grant from the International Business Machines Corporation.

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 2

' L i n g u i s t s a r e no d i f f e r e n t from any o t h e r people who spend than

nineteen

hours

more

a day pondering t h e complexities of grammar and

i t s r e l a t i o n s h i p t o p r a c t i c a l l y everything e l s e i n order t o prove t h a t is

language

so

inordinately

p r i n c i p l e f o r people t o t a l k , New York:

complicated

that

it i s impossible i n

its

Lanacker, Language and

Harcourt Brace Jovanovich Inc.,

Is n a t u r a l language an a p p r o p r i a t e

structure,

1973.

interface

for

a

data

base

Is it s u p e r i o r t o formal query languages such a s SEQUEL

query system?

( 1 ) o r screen format approaches such a s Query-by-Example

( 2 3 ) ? Do

we

understand t h e complex semantic and s y n t a c t i c transformations of human communication w e l l enough t o implement such language

be

d a t a base? language

systems?

Would

natural

p r e c i s e enough t o e l i m i n a t e erroneous responses from t h e

Are t h e r e c l a s s e s of u s e r s and/or t a s k s f o r which would

be

most appropriate?

natural

To paraphrase t h e t i t l e of t h e

"Wouldn't it be n i c e i f we could

well-known paper by H i l l ( 8 ) ; a d a t a base i n o r d i n a r y English

-

query

o r would i t ? "

I n t h i s paper w e provide an overview of a r e s e a r c h s t u d y which we hope

will

shed

some

l i g h t on t h e s e i s s u e s .

Laboratory s t u d i e s and

f i e l d t e s t s w i l l be conducted using USL (User S p e c i a l t y experimental a t t h e IBM

Languages)-an

information r e t r i e v a l system c u r r e n t l y under development Heidelberg

Scientific

Research

Center

(10).

Here

we

describe t h e USL system, and o u t l i n e some major r e s e a r c h q u e s t i o n s and t h e s t r a t e g y f o r conducting

the

research.

Subsequent

papers

will

provide d e t a i l e d d e s c r i p t i o n s of each phase of t h e p r o j e c t and p r e s e n t

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 3

the r e s u l t s .

The s t u d y of computerized ' n a t u r a l ' to

the

language systems

back

e a r l y days of a r t i f i c i a l i n t e l l i g e n c e r e s e a r c h i n t h e 1950's.

After 20 o r s o y e a r s of e f f o r t we have n o t managed that

dates

can

handle

the

build

systems

f u l l complexity of n a t u r a l languages.

However

c o n s i d e r a b l e progress has been made and we

now

to

appear

to

have

the

technology t o b u i l d p r a c t i c a l (though l i m i t e d ) systems.

F i g u r e 1 shows some research

and

of

references

the

some

major

in

natural

The s u b t r e e f o r i n q u i r y systems

(or

Within t h e a r e a of information r e t r i e v a l

'semi-natural')

language

interfaces,

two

approaches have been adopted i n an attempt t o overcome t h e of

language

is

more d e t a i l than t h e o t h e r s because of i t s relevance t o t h e

p r e s e n t research. natural

of

r e p r e s e n t a t i v e systems, most of which

have been experimental i n nature. shown

areas

understanding

natural

languages.

utilizing different difficulty

The f i r s t , exemplified by such

systems a s BASEBALL ( 3 ) and LUNAR (21), is t o r e s t r i c t t h e

domain

discourse

artificial

and

to

build

i n t e l l i g e n c e techniques.

specialized

systems

using

of

This approach h a s been r e l a t i v e l y s u c c e s s f u l *

but

s u f f e r s from a high i n i t i a l c o s t and l a c k of t r a n s p o r t a b i l i t y .

second approach r e l i e s on a generalized d a t a (DBMS) which

contains

a

description

of

base the

management

concentrate

on t h e 'front-end'

DBMS has t h e f u r t h e r advantage t h a t

system

domain of d i s c o u r s e ,

performs t h e d a t a r e t r i e v a l function and allows t h e system to

developers

language t r a n s l a t i o n i n t e r f a c e . it

is

A

application

A

independent.

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

f l A m LANGUAGE RESEARCH

FOR~IGN LANGUAGE TRANSLATION

~ T ~ ~ R A L LANGUAGE GENERATION 3eg. Poetry)

WRIR;U, WUJGUAGE INQUIRY

NA~RAL LANGUAGE COlr9UTER PROGRAMMING

I

I

CONVERSATIOWL PFOBLEY SOL'JING

I

ENGLISH SHRDLU ELIZA (-nett, 1 9 6 8 ) (Winograd, 1 9 7 9 )

(~eizedbaum,1 9 7 6 )

SPECEAL PURPOSE WUCGUAGE AND DATA RE'iTcEIVRL SYSTEMS

BASEBALL (Bobrow, 1 9 7 2 )

LUNAR (Woods, 1 9 7 2 )

GEMER~IZED DATA BASE WAGErZENT

SYSTE*I i

I

CLARIFICATION

DISECT QGEW INTERPRETATION

DIALOGUE

I RENDEVOUZ (Codd, 1 9 7 8 )

I PLANES {Walty, 1978)

RO~OT (Harris, 1 9 7 7 )

USL (~ehr.am, 1 9 7 8 )

INTELLECT ' ( A r t i f i c i a l I n t e l l i g e n c e Corporation)

FIGURE 1 SOME DIRECTIONS IN N A T U W LANGUAGE RESEARCH

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 4 However

this

independence does not necessarily extend

to

the

superimposed linguistic system.

The systems using a DBMS have had two varieties of interface. RENDEZVOUS

(4) and PLANES (18) 'clarification dialogue' has been used

extensively to resolve ambiguities prior request to

the DBMS.

requires that the

system

the

submission of

a

it easiertointerpret

slows down

the

interaction and

contains a language generation process as

well as an interpretation process. attempts an

to

This tactic makes

retrieval requests correctly but

The second variety

of

interface

immediate interpretation of a retrieval request and only ROBW

prompts the user for clarification when absolutely necessary. ( 7 ) , its

In

successor INTELLECT

(a product of Artificial Intelligence

Corporation) and USL all use the latter approach.

INTELLECT utilizes commercially.

The

USL

ADABAS

as

its DBMS

and

is

available

system uses a parser-generator (USAGE, (2)).

Queries to USL are translated into the SQL query

language and

then

processed by the underlying DBMS without further intervention.

Several approaches to the translation of natural languages have been attempted.

The first builds upon a formal language vocabulary to

give an English-like appearance.

However it is

still necessary

the user to think in terms of the formal language.

A second approach

is to build a model capable of representing (at least semantics of the real world problem situation.

some of)

the

This is typically the

approach adopted by artificial intelligence research employing techniques as

for

such

semantic nets, ( S ) , frames, (12) and conceptual graphs

(16). User queries are parsed and interpreted in terms of

the model

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 5 and

then

translated

to

essentially linguistic. understanding of

data base

accesses.

Here an attempt is made to gain

by

USL.

The

language.

an

in-depth

the semantic and syntactic systems underlying human

speech and to translate these directly. ,

A third approach is

original grammatical

This is the approach studies were

for

adopted

the

German

The system now has Dutch and English grammars and a Spanish

Grammar is being developed.

USL was designed to provide users having little or experience with

a means for accessing and analysing data.

however should have a knowledge of their area of achieve portability, the

dictionary

of

computer The users

appplication.

To

USL system has no knowledge of any subject

matter that might vary from application to has a

no

application.

Instead

it

general words (prepositions, conjunctions, the

verbs 'to be' and 'to have', names of months, etc) and grammatic rules that allow

it to interpret a subset of English.

Language rules and

word definitions that are application dependent must added by

the

therefore be

system analyst for each separate application.

the definitions are of course

contained

in the

data

base

Some of schema.

Others are added by expanding the set of grammatical rules that come with the system.

Figure 2 shows the major

components of

the

USL

system.

A

syntactic analysis is first performed by the parser using grammatical rules and

definitions stored

Dictionaries.

Each

in both

grammatical rule

the

USL

a

preliminary

query

Application

references an interpretation

routine which is executed if the rule is invoked. generates

and

A successful parse

string expressed

in

SQL.

The

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 6 p r e l i m i n a r y s t r i n g i s optimized t o replace v i r t u a l by r e a l where

posssible,

and

t o e l i m i n a t e unnecessary j o i n s .

relations,

The optimized

SQL query i s passed t o t h e DBMS and t h e r e s u l t r e t u r n e d t o As

an

query,

example,

for

the

user.

t h e Alumni d a t a base t o be described l a t e r , t h e

'How many alumni have no donations?', i s t r a n s l a t e d t o t h e

SQL

query :

SELECT COUNT (UNIQUE I D ) FROM ALUMNI WHERE SOURCECODE IS LIKE

*AL%*

AND I D NOT I N (SELECT I D FROM GIFTSUMMARY);

An i n t e r e s t i n g option of t h e USL system i s t h a t t h e SQL query can

be

printed

at

t h e u s e r ' s terminal p r i o r t o a c c e s s i n g t h e DBMS.

u s e r s with some knowledge of SQL t h i s mechanism clarification

dialogue.

provides

a

For

form

of

A s i m i l a r t a c t i c i s used by INTELLECT which

a l s o echoes back a formal statement of t h e u s e r ' s r e q u e s t .

The

USL

parser

works

in

a

left-to-right

bottom-up

producing

a

sentence.

The grammar i s p r i m a r i l y c o n t e x t - f r e e b u t

p a r s e t r e e r e f l e c t i n g t h e s u r f a c e s t r u c t u r e of t h e i n p u t there

c a s e s more than one v a l i d p a r s e t r e e w i l l be produced

DBMS.

in

some

which

case

SQL q u e r i e s a r e generated each producing a response from t h e

The u s e r then has t o choose t h e c o r r e c t r e s u l t .

Over 8 0 0 grammatical r u l e s expressed i n a modified the

are

I n some ( i n f r e q u e n t )

c o n t e x t - s e n s i t i v e and transformational elements.

several

fashion

BNF

comprise

application-independent English grammar s u p p l i e d with USL.

1 shows some of

the

linguistic

capabilities

of

the

system.

Table The

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 7

examples

apply to the Alumni Data Base application that will form the As is usual

basis for our field and laboratory tests (see Section 5). with

natural

language processing systems the capabilities in Table 1

are not always complete in the

sense that

variants will not be succesfully parsed.

some English

language

The USL system also has some

capabilites for:

( 1 ) Updating a data

base

(using imperative English

rather than interrogative forms)

sentences

,

( 2 ) Computations such as sum, average and count

(3) Defining and manipulating variables and functions

(4) Creating new relations.

It has no deductive capability beyond

that required

for

language

translation.

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 8

-

Which/Who/What/When WH-QUESTIONS Who l i v e s i n New York? Which alumnus l i v e s i n New York? YES/NO QUESTIONS Did Smith make a pledge? COMMANDS L i s t t h e French donors. NEGATIONS Which donors d i d n o t g i v e i n 19813 RELATIVE CLAUSES L i s t t h e alumni who a r e members of t h e Tax Society. ADJECTIVES Who a r e t h e i n a c t i v e alumni? GENETIVE ATTRIBUTES What i s t h e a d d r e s s of Smith? APPOSITIONS Which alumni of t h e school GBA l i v e i n New York?

AND-COORDINATION/OR-COORDINATION What i s t h e name and a d d r e s s of Smith? Who l i v e s i n N e w York or Boston? a l l / a t least/how many... QUANTIFIERS Who gave a t l e a s t two donations? L i s t a l l alumni who a r e f i n a n c e majors. greater/more/less/how much... COMPARISON Who gave more than Smith? POSSESIVE PRONOUNS Who donates more t h a n h i s company? l i v e s in/at/from-to... LOCATIVE ADVERBS Who l i v e s a t 40 W e s t S t r e e t ? how long/when/before.. TIME ADVERBIALS When d i d Jones g i v e a donation? sum/average/largest/maximum... AGGREGATES What i s t h e average age of a c t i v e donors? What i s t h e sum of donations? SUBTOTALS AND ORDINALS What i s t h e sum of t h e pledges f o r each school? What a r e t h e 5 h i g h e s t donations? ARITHMETIC,DEFINITIONS OF VARIABLES AND FUNCTIONS What i s (2 * 5 ) 31 X = 1.5 Y = t h e Pledges of Smith Store Y * r

-

-

-

.

-

-

TABLE 1 CAPABILITES OF USL

--

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 9 APPLICATION DEVELOPMENT -IN USL The language

USL

system provides the

sentences together with

capability of parsing a

natural

limited vocabulary of

commonly used in expressing queries in English.

Application

words

specific

concepts and vocabulary must be added during the systems development process.

These in turn have to be correlated with the contents of the

relational data base.

Words may represent the names of relations, the

attributes (columns) of relations or specific values within Proper

nouns and numbers are recognized by

relations.

Common nouns, verbs

association with

and

(statement or

default as values in

adjectives are

defined

the columns of real or virtual relations.

done by establishing a 'view' or virtual assertion) necessary to

tuples.

relation

for

This is

each concept

the real application,

column in a virtual relation has a defined 'domain' and

by

'role'.

Each The

standard domains are ZAHL (Number1 , WORT (word or character string) , DATUM (Date, time of day) and CODE (numeric code). are

grammatical

concepts

such as NOM

The standard roles

(nominative case),

ACC

(accusative case), DAT (dative case) and VON (genitive attribute) as well

as time and place

roles.

The roles are associated with the

columns by prefixing the rolename to the column name in the definition of

the view.

These relationships are best explained by an example.

Suppose we have a "real" (or "base") relation:

GIFTSUMMARY (DONOR, AMOUNT

FISCALYEAR,

The verb "give" can be defined in USL by

. . )

first establishing

a

view

using the statement.

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 10 DEFINE VIEW GIVE (WNOM-DONOR,

DTA-FISCALYEAR)

ZACC-AMOUNT,

AS SELECT DONOR, AMOUNT, FISCALYEAR FROM GIFTSUMMARY;

Here t h e p r e f i x WNOM d e f i n e s DONOR a s a c h a r a c t e r s t r i n g nominative

(W) i n

the

S i m i l a r l y AMOUNT i s defined i n t h e a c c u s a t i v e

c a s e (NOM).

c a s e w i t h a number domain

(Z).

Finally

FISCALYEAR

is

defined

as

denoting time ( T A ) with domain d a t e ( D ) .

Thus t h e a s s e r t i o n , "Donors g i v e donations i n a year", i s e s t a b l i s h e d .

Continuing

the

above

example,

after

this

process

has

completed USL w i l l be a b l e t o i n t e r p r e t questions such a s : what", "who gives what when", "Did Smith give 5000?", Smith

give

in

1981",

"Did

Smith

been

"who g i v e s

"How

much

did

g i v e more than Jones i n t h e year

19811"

I n addition t o developer

must

defining

expand

the

verb

views

definition

and of

rolenames the

is

system

vocabulary

for the

t e n s e s , ( 3 ) p r e p o s i t i o n s used with nouns and o t h e r

s u r f a c e s t r u c t u r e c o n t e x t u a l a s s o c i a t i o n s , ( 4 ) synonyms. program

the

( 1 ) non-standard plural-forms f o r nouns, ( 2 )

a p p l i c a t i o n by defining: non-standard

the

available

to

a s s i s t i n t h i s task.

A

prompting

Each such d e f i n i t i o n

r e s u l t s i n a grammatical r u l e i n t h e a p p l i c a t i o n d i c t i o n a r y .

This explanation o f t h e systems development p r o c e s s h a s necessity

very

requires

additional

administrators)

brief.

beyond

However, work

for

of

it can be seen t h a t t h e use of USL the

system

developers

t h a t r e q u i r e d t o e s t a b l i s h t h e DBMS.

A l u m n i a p p l i c a t i o n t o be described

been

in

Section

5

(data For t h e

approximately

150

Center for Digital Economy Research Stem School of Business Working Paper IS-82-10

Page 11 views

and

350

grammatical

primary difference between language interface and

rules have been defined.

developing

a

data base

developing one with

a

However, the

with

a

formal

'natural' language

interface lies in the need to discover how users refer

to

data

in

their written and verbal communications. While the data administrator need not (it seems) be an expert linguist, some training will probably be

required.

Finally, the linguistic aspects of the application have

to be taken into account through all stages of the data

base

design

since some relational schemas may be more suitable from the linguistic point of view than others.

Although of subsidiary importance to the major research questions to be described in subsequent sections of this paper this study should help to establish some of the important considerations

in

designing

natural language query systems.

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 12 4.

MAJOR RESEARCH QUESTIONS

Research i n management

'semi-natural

systems

(SNL's)

language'

has

f e a s i b i l i t y may no longer be a

reached

major

front-ends a

to

point

issue.

data

where

technical

discussed

As

base

in

the

previous s e c t i o n , t h e r e a r e now s e v e r a l prototype experimental systems and a t l e a s t one commercially a v a i l a b l e system. be

is

resolved

'has

it been worth t h e e f f o r t ? '

gain u s e r acceptance and w i l l they e v e n t u a l l y supplement,

more

The major question t o

formal

language

- will

replace,

approaches?

such systems or

at

least

This

can

only

resolved by s u i t a b l e experiments and e v a l u a t i o n s of t h e

use

of

be such

systems i n p r a c t i c a l a p p l i c a t i o n s ( 2 2 ) .

I n t h i s s e c t i o n we l i s t some arguments and counterarguments are

commonly

made

for

and

that

a g a i n s t n a t u r a l language based systems,

review some r e l a t e d experimental reseach work and formulate

a

simple

r e p r e s e n t a t i o n of t h e r e t r i e v a l process.

Discussions of t h e arguments f o r

and

against

natural

systems appear i n , f o r example, ( 1 3 ) , ( 2 2 ) , ( 7 ) and ( 1 5 ) .

language

Briefly the

-

arguments f o r t h e use of n a t u r a l language a r e a s follows:

F1.

Humans a l r e a d y know n a t u r a l language s o t h a t many people who

would

not

invest

the

time

and

energy

to

learn

a formal

language may be w i l l i n g t o use an SNL. F2.

The use of an SNL should reduce t h e burden o f remembering o r

relearning

formal

language conventions a f t e r p e r i o d s o f d i s u s e .

Even d a t a processing p r o f e s s i o n a l s perform

some

occasionally

to

and

could

avoid

the

need

functions

only

consult reference

Center for Digital Economy Research Stem School of Business \Vorking Paper IS-82-10

Page 13

manuals i f an SNL were a v a i l a b l e . F3.

Often t h e d a t a t o

natural

language

be

form.

manipulated

by

the

is

machine

in

A n a t u r a l language understanding system

would avoid t h e painstaking and error-prone t a s k

of

translating

such d a t a i n t o t h e formats required by c u r r e n t d a t a base systems. F4.

The

conceptual

communication

has

structure

evolved

underlying

over

centuries

human of

e f f i c i e n t mechanism f o r conveying complex ideas. that

we

could

use

to

It is

and

be

an

unlikely

develop an a r t i f i c i a l s t r u c t u r e t h a t would b e a s

.

'user friendly' F5.

thought

Current SNL r e t r i e v a l systems produce a c c e p t a b l e e r r o r r a t e s

without

rephrasing

and

have f a s t enough response times t o make

them commercially f e a s i b l e .

The arguments a g a i n s t t h e use of SNL

retrieval

systems

can

be

summarized a s follows: Al.

Natural

language.

language

is

much

less

precise

than

a

formal

Using an SNL would n u l l i f y t h e major advantages (speed

and p r e c i s i o n ) afforded by unresolvable

ambiguities

computers. and/or

It

also

lead

to

errors

due

to

will

possible

misunderstanding. A2r

them

The r i g o r o f a formal n o t a t i o n system a i d s u s e r s by to

t h i n k more c l e a r l y about t h e i r problem.

t h e most d i f f i c u l t a s p e c t of

problem

solving;

forcing

Formulation i s coding

into

a

formal language i s r e l a t i v e l y simple.

A3.

Unrestricted

infeasible

and

natural

likely

language

systems

are

currently

t o remain so f o r t h e f o r e s e e a b l e f u t u r e .

The s u b s e t s o f n a t u r a l language supported by SNL's

may

have

as

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 14 many

rules

and

exceptions

to

be learned a s formal languages.

Moreover l e a r n i n g and r e t e n t i o n may be impaired

because

of

the

i n t e r f e r e n c e a f f e c t of n a t u r a l language knowledge. A4.

The a d d i t i o n a l development c o s t s , t h e need f o r c l a r i f i c a t i o n

dialogue

and

processing

overheads

will

make

less

SNL's

c o s t - e f f e c t i v e than formal languages. A5.

The use of an SNL may l e a d t o

unrealistic

expectations

of

t h e computer's power and t o q u e s t i o n s t h a t no computer system can answer. A6.

E x i s t i n g formal language systems a r e e a s y t o l e a r n

and

use

to

be

( 1 ) t h e p r e c i s i o n of SNL's f o r expressing complex r e t r i e v a l s

(F5

and adequate f o r most purposes.

Some

of

the

above

claims

and

counterclaims

appear

This i s s o f o r example with r e s p e c t t o :

contradictory.

versus A I ) ( 2 ) t h e e a s e of l e a r n i n g and r e t e n t i o n (F1 and F2 versus A3) ( 3 ) t h e b e n e f i t o r otherwise of encoding q u e r i e s

from

'natural'

t o ' a r t i f i c i a l ' formats (F4 v e r s u s A2).

The arguments Fl and F2 t o g e t h e r with A4 above have l e d a of

observers

(for

example

(15))

to

speculate

number

t h a t SNL's w i l l b e

u s e f u l , i f a t a l l , only f o r s i t u a t i o n s c h a r a c t e r i z e d by a f a i r l y l a r g e number

of c a s u a l (non-dataprocessing) u s e r s who have a good knowledge

of t h e

semantics

providing

a

of

their

application.

A

field

test

aimed

at

p a r t i a l t e s t of t h i s t h e s i s w i l l be conducted during t h e

p r e s e n t r e s e a r c h study ( s e e Section 5 ) .

Center for Digital Economy Research Stem School of Business Working Paper IS-82-10

Page 15

We now review some experimental r e s e a r c h on both formal r e t r i e v a l languages

and SNL's.

Although some of t h e r e s u l t s a r e presented i n a

t a b u l a r format i t must be emphasized comparable

that

the

experiments

of

techniques,

t h e sample a p p l i c a t i o n and ' l e v e l ' of language taught.

The experiments with formal languages have mostly involved p e n c i l paper

not

because of wide d i s p a r i t i e s i n such important experimental

v a r i a b l e s a s a b i l i t y and background of s u b j e c t s , t r a i n i n g complexity

are

exercises

using

student

subjects.

and

Although t h e experimental

designs a r e v a r i e d and have been concerned with many d i f f e r e n t a s p e c t s of

language

design

the

primary

performance

measures a r e t h e mean

percentage of c o r r e c t l y coded q u e r i e s and (sometimes) t h e mean time t o formulate

a

query.

Typical

results

for

three

such

experiments

(averaged over a l l c l a s s e s of s u b j e c t and degrees of query d i f f i c u l t y ) a r e shown i n Table 2.

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 16

Instruction Time ( hours )

Query Language

Reference*

a.

Percentage of Accurate Queries

Average Formulation Time ( minutes )

SEQUEL 10:oo-12:oo SQUARE

b

1:35 1:40 2 :40

QBE

SEQUEL Algebraic

TABLE --

2

---

FORMULATION TIME AND QUERY ACCURACYEXPERIMENTS ON FORMAL LANGUAGES 7

*References a = Reisner, 1977 ( 1 4 ) b = G r e e n b l a t t and Waxman, 1978 ( 1 6 ) c = Thomas and Gould, 1975 ( 1 7 )

These r e s u l t s seem t o be both encouraging and the

one

hand

proficiency

in

the the

discouraging,

On

f i g u r e s can be taken a s showing t h a t a reasonable various

languages

r e l a t i v e l y s h o r t t r a i n i n g period,

can

be

attained

after

a

On t h e o t h e r hand t h e percentage of

c o r r e c t q u e r i e s obtained i s n o t a l l t h a t high.

Some r e s u l t s t h a t seem

t o be common a c r o s s t h e s e and o t h e r s i m i l a r experiments a r e : R1. (for

Performance d i f f e r e n c e s between i n d i v i d u a l s example

Quer y-by-Example R2.

from Study)

33%

to

93% c o r r e c t

are

very

answers

in

high the

.

Programmers perform b e t t e r than non-programmers

under

some

conditions,

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 17

R3.

The number of e r r o r s and

time

the

to

formulate

a

query

i n c r e a s e with t h e complexity of t h e r e t r i e v a l problem. R4.

Many of t h e e r r o r s t h a t were made were of a

nature

( some

50% i n

the

SEQUEL-SQUARE s t u d y

involved s p e l l i n g and/or t h e

inaccurate

minor

.

clerical

Many of t h e s e

specification

of

data

base a t t r i b u t e s .

Because of t h e i r r e c e n t advent t h e r e have been f a r fewer SNL's

with

what has been reported i s even l e s s conclusive.

and

t h e ROBOT system, H a r r i s ( 7 ) commercial queries.

application

reports

that

experienced

users

For in

a

were achieving a 90% l e v e l of accceptance f o r

Harris a l s o reports t h a t

the

time

ROBOT i s ' i n t h e o r d e r of a week'

using

results

to

build

applications

and t h a t t h e average computer

response time on a sample of q u e s t i o n s was approximately 10 seconds.

the

A s i m i l a r l y low r a t e of e r r o r s was found i n an e v a l u a t i o n b f

USL

in

system

a field setting ( 9 ) .

I n t h i s s t u d y a sample of 2,214

q u e r i e s made by a s i n g l e user was found t o have an average e r r o r of

only

6.6%.

Two o t h e r r e a l - l i f e a p p l i c a t i o n s of USL a r e described

i n ( 1 1 ) which a n a l y s e s t h e kinds of encountered.

rate

The

average

queries

and

errors

which

were

time t o develop t h e l i n g u i s t i c p o r t i o n o f

t h e a p p l i c a t i o n s was approximately two weeks.

About a l l t h a t can be s a i d about t h e s e experiments i s users

may

be

able

to

language a f t e r a s u i t a b l e error

rates

can

be

inexperienced c a s u a l

work

comfortably

period

achieved users,

is

of by

with

still

Whether

people, opened

some

a s u b s e t of n a t u r a l

practice. most

that

to

acceptable

and i n p a r t i c u l a r , doubt.

This

is

p a r t i c u l a r l y t r u e i n t h e l i g h t of some l e s s f a v o r a b l e experiences with

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 18 o t h e r (non-data base o r i e n t e d ) n a t u r a l language systems.

F i g u r e 3 p r e s e n t s a s i m p l i f i e d view recognizes

that

it

may

be

of

query

processing

which

necessary t o reformulate t h e same query

s e v e r a l times before a c o r r e c t s o l u t i o n i s obtained and

also

that

a

u s e r may sometimes give-up without o b t a i n i n g a s u c c e s s f u l answer.

The

a c c e p t a b i l i t y and e a s e of use of a r e t r i e v a l

the

efficiency figure.

system

depends

on

and psychological impact of a l l t h e processes shown i n t h e If

the

imprecision

arguments

and

ambiguity

A1 of

through natural

above

A4

(concerning

languages and t h e r e s t r i c t e d

n a t u r e o f SNL1s) a r e c o r r e c t w e should expect, on t h e average, t o more

iterations

language.

see

p e r query when using an SNL t h a n a more formal query

This would be m i t i g a t e d by t h e e x t e n t t o which argument

(concerning

the

the

difficulty

users

may

have

in

' n a t u r a l * t o a 'formal' query statement) i s t r u e .

F4

t r a n s l a t i n g from a The use of

an

SNL

may a l s o reduce t h e tendency f o r c l e r i c a l e r r o r s ( s e e R 4 above): a ) The vocabulary recognized by an SNL customarilly

used

can

be

fitted

to

that

i n t h e a p p l i c a t i o n t h u s reducing t h e need f o r

u s e r s t o memorize and understand t h e naming and o t h e r conventions r e q u i r e d by t h e d a t a base management system.

b) Natural language q u e r i e s a r e u s u a l l y s h o r t e r than t h e i r formal language e q u i v a l e n t s ( s e e f o r example t h e sample query i n Section 2).

Finally, expressing

most

importantly,

the

relative

difficulty

of

q u e r i e s i n t h e n a t u r a l and formal languages w i l l i n f l u e n c e

the error rate. expressed

and

I t seems l i k e l y

equally

well

that

very

i n e i t h e r format.

simple

queries

can

be

However some q u e r i e s ( s e e

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

STOP

START

STOP

STOP

1,

YES

FOWLATE QUEX

DIAGNOSTIC

1ENTER AT

RECOGNIZE C C P X C T SOLUTION

TERFIINAL 1 \

'

USER

GENERATE DIAGNOSTIC MESSAGE

INTEWACE

PARSE BY QUERY LANGUAGE

RETilEIT.GXL BY

YES

FIGURE 3 SIMPLIFIED VIE!? OF QUERY PROCESSING

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 19

again t h e example i n Section 2 ) can be expressed SNL,

while

others

(depending

on

the

more

in

an

power of t h e SNL) may n o t b e

While it should be f a s t e r f o r

expressable a t a l l .

simply

users

to

express

t h e i r q u e r i e s i n English r a t h e r than i n a formal language ( s e e Table 2 above) t h e o v e r a l l time per s u c c e s f u l query w i l l depend on t h e of

reformulations

number

t o o b t a i n an English-like query t h a t i s

necessary

accepted by t h e SNL.

The

Advanced

experiments, 'real'

one

language in

an

Project

actual

will

field

incorporate

two

linked

s e t t i n g with ' l i v e '

u s e r s and one i n a l a b o r a t o r y s e t t i n g .

The

major

d a t a and

purpose

of

t h e f i e l d experiment i s t o determine t h e s u i t a b i l i t y of t h e USL system f o r c a s u a l u s e r s who know t h e d e t a i l s of t h e i r discussion

in

Section 4).

application

(see

the

Data w i l l be gathered on t y p e s of q u e r i e s

and t h e i r complexity, t y p e s of e r r o r s and t h e i r frequency, c l a s s e s

of

u s e r s and t h e i r a t t i t u d e s toward t h e system and s o on.

The major purpose of t h e l a b o r a t o r y experiment relative

performance of a semi-natural

structured

language

experiment

will

(SQL)

follow

in

the

a same

is

to

However

controlled general

environment.

outline

as

This

some o f t h e in

Section

some q u a l i t a t i v e d i f f e r e n c e s must be incorporated i n t h e

design because of procedures

the

language (USL) v e r s u s a formal

experiments carried-out by o t h e r r e s e a r c h e r s a s discussed

4.

test

and

the

nature

of

an

SNL

-

for

example

training

query a n a l y s i s w i l l r e q u i r e s p e c i a l treatments.

w i l l be gathered on t h e s t u d e n t s u b j e c t s and t h e i r background, on

Data the

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 20 length

of

time required t o o b t a i n d i f f e r e n t l e v e l s of p r o f i c i e n c y i n

each language, on t h e r e l a t i v e times

to

formulate

correct

queries,

r e l a t i v e e r r o r r a t e s and s o on.

Both experiments w i l l u t i l i z e t h e same application

involves

the

fund-raising

data

base.

activities

The

chosen

f o r t h e Graduate

School of Business Administration (GBA) a t New York U n i v e r s i t y The

(NYU).

d a t a base c o n t a i n s approximately 4 0 , 0 0 0 r e c o r d s e x t r a c t e d from an

existing relations

NYU

Fund-

Raising

System.

This

is

organized

as

five

containing r e s p e c t i v e l y , information i d e n t i f y i n g alumni and

o t h e r donors, t h e i r educational background,

their

detailed

dictionary' r e l a t i o n t h a t

gift

transactions

and

a

'data

giving

histories,

e x p l a i n s a l l d a t a items and t h e i r codes.

The primary u s e r s i n t h e f i e l d experiment w i l l be o f f i c e r s o f t h e GBA

Development

Office

and

of

t h e NYU Alumni Federation.

It i s a n t i c i p a t e d

t h e s e u s e r s have had any p r i o r computer experience. that

the

None o f

d a t a base w i l l be used p r i m a r i l y i n a d e c i s i o n support mode

f o r planning fund-raising appeals.

However some

queries

related

to

controlling current operations a r e a l s o anticipated.

Since t h e ' r e a l '

u s e r population i s l i m i t e d it w i l l be

augmented

by e i g h t user a s s i s t a n t s o r ' c h a u f f e u r s ' whose background and t r a i n i n g can be c o n t r o l l e d by t h e experimenters.

of

the

field

test

the

During t h e f i r s t

two

months

chauffeurs w i l l a c t a s intermediaries.

u s e r s w i l l explain t h e i r information requirements

to

the

The

chauffeurs

who w i l l then formulate and e n t e r t h e q u e r i e s i n t o t h e computer.

Four

chauffeurs w i l l be t r a i n e d i n SQL and t h e o t h e r

They

w i l l work i n p a i r s :

four

in

SQL.

each u s e r query w i l l be answered by one chauffeur

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 21

using USL

and

chauffeurs

another be

will

using

SQL.

After

several

weeks

the

t r a i n e d i n SQL and t h e SQL group i n USL.

USL

A t the

end of a s i m i l a r period of time a l l chauffeurs w i l l be allowed t o user the

language

of

their

choice.

Finally,

withdrawn, t h e ' r e a l ' u s e r s w i l l be t r a i n e d

the in

chauffeurs USL

and

w i l l be

allowed

to

e n t e r t h e i r own r e q u e s t s a t t h e terminal.

The use of p a i d s u b j e c t s i n First,

it

increases

allows

us

to

otherwise

this

way

has

the

number

of

a

partly

controlled

perform

several

Secondly, it

subjects tested.

uncontrolled f i e l d environment.

experiment

a

real

end-user s

environment

prior

to

their

within

use

by

linguistics

t h e inexperienced

.

The l a b o r a t o r y experiments w i l l u t i l i z e p a i d s t u d e n t s u b j e c t s a

an

F i n a l l y , it w i l l enable us

t o t e s t t r a i n i n g techniques and t o tune t h e d a t a base and in

advantages.

'crossed'

design

similar

to

that

in

employed with t h e chauffeurs.

Additional memory r e t e n t i o n t e s t s w i l l be given i n both

languages

at

t h e end of t h e experiment.

Since t h e f i e l d and l a b o r a t o r y experiments using

the

same

For

be

carried-out

d a t a base during t h e same time p e r i o d (approximately

s i x months) t h e r e s u l t s of one can be other.

will

used

to

help

callibrate

the

example a c t u a l q u e r i e s from t h e chauffeur-driven p o r t i o n

of t h e f i e l d t e s t w i l l be included i n t h e t e s t given t o t h e l a b o r a t o r y subjects.

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10

Page 22

REFERENCES 1.

Astrahan, MoM., and Chamberlin, D.D., 'Implementation of a Vol 18, S t r u c t u r e d English Query Language' , Comrn. ACM, 10, October (19751, 580-588.

No

2.

B e r t r a n d , 0. and Daudenarde, J.J., 'An I n t e r a c t i v e Program Generator from Grammatical Description', IBM France S c i e n t i f i c Center, Doc F.008 (1980).

3.

Bobrow, D., 'Natural Language Input f o r a Computer Problem Solving' , i n M. Minsky (Ed. 1 , Semantic Information Processing, N I T P r e s s , Cambridge, Mass (1972).

4.

'How About Recently? (English Dialogue with Codd, E.F., R e l a t i o n a l Databases Usins RENDEZVOUS Version I . '. I n B. Shneiderman (Ed. ) , ~ a t a b a s e s : ~ k p r o v i n c j u s a b i l i t y and ~ e s ~ b n s i v e n e s Academic s, P r e s s , New York, ( 1978 1 , 3-28.

-

-

-

and

5.

F i n d l e r , N.V., ed. Associative Networks: The Representation U s e o f Knowledge i n Computers, Academic Press. New York, ( 1979).

6.

G r e e n b l a t t , D. and J. Waxman, ' A Study of Three Database Query Shneiderman (Ed.), Databases: Improvinq Languages', I n B. - - U s a b i l i i y and ~ e s ~ o n s i v e n e,s Academic s P r e s s , ( 1978 ) , 77-97. -

7.

-

--

A

.

-

H a r r i s , L.R., 'User Oriented Data Base Query with t h e ROBOT - . - . Natural Language Query System', I n t e r n a t i o n a l Jolirnal Ha-Machine s t u d i e s , 9, ( 7977), 679-713.

of -

8.

H i l l , 1.D.t 'Wouldn't I t be Nice I f We Could W r i t e Computer O r Would I t ? ' , Honeywell Programs i n Ordinary English computer J o u r n a l , 6,2, ( 1972), 76-83.

9.

Krause, Jurgen, 'Preliminary R e s u l t s o f a User Study with t h e 'User S p e c i a l t y Languages' System and Consequences f o r t h e Architecture of Natural Language I n t e r f a c e s ' , IBM Heidelberg S c i e n t i f i c Research Center, TR79.04.003, May, (1979).

10.

Lehmann, H. ' I n t e r p r e t a t i o n of Natural Language i n an Information - System', IBM Journal of Research and Development, vo1.22, no. 5, September, ( 1978).

--

*

-

.

-

Center for Digital Economy Research Stem School of Business Working Paper IS-82-10

Page 23

11.

Lehmannn, H e l N. O t t and M. Zoeppritz, 'User Experiments with Natural Language f o r Data Base Access' , - Proc ; 7th I n t e r n a t i o n a l Conference on Computational L i n g u i s t i c s , Bergen ( 1978)

-

-

12.

13.

Minsky, Marvin, 'A Framework f o r Representing Knowledge', i n Winston, P.H. Ed,, The Psychology bf computer Vision, York: McGraw H i l l (1975).

-

New

P e t r i c k , S.R., 'On Natural Language Based Computer Systems', Journal of Research and Development, 20, 4, ( J u l y 19761, 314-325.

-

-

14.

Reisner, P h y l l i s , 'Use of Psychological Experimentation a s an Aid t o Development of a Query Language', IEEE T r a n s a c t i o n s Software ~ n g i n e e r i n ~SE-3, , 3, ( 1977 ) , 218-229.

15.

Schneiderman, Ben, Software ~ s ~ c h o lf o g~uman ~ ~ a C t 6 r si n computer and Information Systems, Winthrop, Cambridge, Mass. (1980).

16.

Sowa, John F., 'Conceptual Graphs f o r a Data Base I n t e r f a c e ' , IBM Jomndl. of Research and DeveXopment, v o l 20, no.4, July ( 1976)

17.

Thomas, J.C. and J . D . Gould, ' A Psychological Study o f Query by Example ' , Proceedings of t h e ~ a t i o n a l . Computer Conference , 44, AFIPS P r e s s , Montvale, New J e r s e y , (1975).

18.

Waltz, D., 'An English Language Question Answering System f o r a Large R e l a t i o n a l Database ' , ~ & m u n i c a e i o nof t h e ACM, 21 , 7 , ( J u l y , 1978) I 526-5390

on

-

-

-

-

--

---

----

19.

Weizenbaum, J., 'ELIZA A Computer Program f o r t h e Study of Natural Language Communication Between Man and Machine', Cammunications of t h e ACM, 9, 1, (January, 19661, 36-45.

20.

Winograd, T., Understanding ~ a t u r a Language, l Academic P r e s s , New York, ( 1972).

21.

An Woods, W o A * , 'Progress i n Natural Language Understanding Application t o Lunar Geology', Proceedings National. Computer Conference, 42, AFIPS P r e s s , Montvale, New J e r s e y , (19731, 44 1-450

-

Center for Digital Economy Research Stem School of Business \Vorking Paper IS-82-10

Page 24

22.

Woods, W.A.,

' A Personal View of Natural Language Understanding', SIGART Newsletter Nail, (February, 1977).

23.

Zloof, M.M., 'Query-by-Example: A Data Base Language', Systems J o i x n a l , 16,4, (1977), 321-343.

-

IBM

Center for Digital Economy Research Stem School of Business IVorking Paper IS-82-10