On Automated Language Acquisition

6 downloads 94450 Views 2MB Size Report
logs from the AT&T telephone network (Gorin et al., 1993b). (Gorin et al. ... In a Call Routing task, the set of machine actions is ...... to evaluate whether the grammar does a better job of ..... 9101, Center for Research in Language, University of.
Editor's note: Thisis the 113thin a seriesof reviewand tutorialpaperson variousaspectsof acoustics.

On automated language acquisition Allen Gorina) AT&T Bell Laboratories,Murray Hill, New Jersey07974

(Received18 June1994;revised13 December1994;accepted22 December1994) The purposeof thispaperis to reviewour investigation into deviceswhichautomatically acquire spokenlanguage.The principlesaad mechanisms underlyingthisresearchare described and then experimental evaluations for severaltasksarereported,involvingbothspokenandkeyboardinpu[. The genericmechanismin theseexperimentsis an information-theoretic connectionist network embeddedin a feedbackcontrolsystem. PACS numbers: 43.10.Ln, 43.72.Ne

INTRODUCTION

hess of thatresponse. Thedevice the•..adapts itsbehavior

We are interested in devices which understand and act

basedon thiserrorfeedback. Thuswe,x•tssume thatthesystem will makeoccasionalerrors,especiallywhenencounter-

uponspokeninputfrom people.Traditionally,in suchspeech understanding systems,the hierarchyof linguistic symbols and structureshas been manually constructed,involving

ingunfamiliar stimuli. Theemphasis,:•owever, isonitsability todetect anerrorviareinforcemq•nt feedback fromthe

much labor and leadingto fragile systemswhich are not robustin real environments. In humanlangJageacquisition, however,the phonemes,vocabulary,gramm:tr,and semantics seem to emerge naturally during the cour:;eof interacting with the world. This contrastmotivatesus to investigatedeviceswhichautomaticallyacquirethe languagefor theirtask, duringthe courseof interactingwith a complexenvironment. While a long-terminvestigation, researchi a suchlanguage acquisition devices yields insights into h•w to construct .•peechunderstanding systemswhichare trainable,adaptive, and robust.The purposeof this paper is to recountour progressand ideasto datein thisendeavor.In particular,we describethe principlesand mechanisms underlyingthis researchand review severalexperimentalsyslemswhichhave

thenfinallytolearnfromtheerrorso}hatit isnotrepeated.

been constructed.

A first principle in our researchis that the purposeof languageis to conveymeaning,so thatlanguageacquisition cruciallyinvolveslearningto decodethatmeaning.A second principleis thatlanguageis acquiredduringinteractionwith a complexenvironment,wherein the cleric: receivessome input stimuli, respondsto that input, then receivesfeedback as to the appropriateness of its response.These principles underlieour investigationinto connectionistmechanisms,in which a network constructsassociationsbetween input stimuliand appropriatemachineresponses. We embedthese networksin a control-theoreticmechanismfor governing languageacquisitionvia reinforcementleanting. If the reintbrcement feedback is positive, then the associationsare .,trengthened, while negativereinforcementcausesthe associations to be weakened.

A systemblock diagram based on the'seprinciplesis shownin Fig. 1. The devicereceivessomeinput,comprising linguisticand possiblyotherstimuli.In responseto this input, it performssome action, to which its environmentthen

environment,to recoverfrom the error 'via feedbackcontrol,

The goal of man-machinecommunication, in sucha system,is to inducethe machineto undergosometransformation. This transformationcan be itnmediatelyobservable in the form of some machine action, or can be an internal

statechangewhich is only observableindirectlybasedon somefutureinteraction.We denotethe inputto the deviceas languageand the mappingfrom inputto transformation as understanding. We arethensatisfiedthatthemachineunderstandsif it responds appropriately overa widerangeof input scenarios, whichis essentially a reformulation of the Turing Test.

It is worthwhilecontrastingthis paradigmwith traditional communicationtheory,bestaccomplished by a quota-

tionfromShannon's originalpaper(emphasis added)(Shannon, 1948):

"Thefi,mdamentalproblemof communication is that of reproducingat one point either exactly or approximatelya message selectedat anotherpoint.Frequently thesemessageshave meaning; that is they refer to or are correlatedaccordinglyto somesystemwithcertain physicalor conceptualentities.Thesesemanticaspects of communicationare irrelevant to the engineering problem." In contrast,for systemswhich purport to understand spokenlanguage, thesemanticaspects of communication are primary.How thencanwe quantifysuchnotions? In people, an inputstimulusevokesmemoriesof associated perceptions and activities.We thusproposea thirdprinciple,that meaning is groundedin a device'sinteractionwith its environment.This principleunderliesan investigation of methodsto quantify the meaningof spokenlanguagevia its network

providesa semantic-level error signalas to the appropriate-

associationsto a device's input/outputperiphery,providing an acquiredrepresentation of the device'soperationalenvi-

a}E-mail:algor • r•search.alt.com

ronment.Introducinga metric and norm on theseassociationsform the basisof a sallericetheory,which quantifiesthe

3441

J. Acoust.Soc. Am. 97 (6), June 1995

0001-4966/95/97(6)/3441/21/$6.00

¸ 1995 AcousticalSocietyof America 3441

more complex,so doesthe mappingfrom input stimuli to action

input Network

I User and

machine action. Given some network architecture, a reason-

Environment

Mod!fy Parameters

semantic-level

able questionis to ask whetherit is capableof learningsuch complexmappings?A striking featureof humanlanguage acquisitionis our ability to make sweepinggeneralizations from small numbersof observations. For example,a single observationof a new word, in the appropriatecontext.can sufficeto acquireits pronunciation,syntacticrole, and semantic associations.

error signal

FIG. I. Reinforcement learningin languageacquistion.

informationcontentof an inputstimulusfor a particulardevice.

In the remainder of this Introduction, we motivate and

We observethat the neural network in a biologicalorganismis not homogeneous, but ratherhighly structuredand modular.Such structuredevelopsover evolutionarytime, matchingitselfto a species'sensory/motor peripheryandenvironment.One can hypothesizethat the constraintsprovided by such network structurecorrespondto the innatecharacteristicswhich enable an individual organismto rapidly

adnptto its environment {Bunge,1986).This motivatesa

fourth principle, that in order to providerapid learningand outline the algorithmicmethodswhich have proveduseful in generalizationa devicemustreflectthe structureof its input/ our investigationof suchlanguageacquisitiondevices.The body of the paperwill describein detail thesemechanisms outputperipheryand environment.As observedin (Minsky andPaperr,1990), and experimentalevaluationsthereof. An information-theoretic nem'ork.For the systemillus"the marvelouspowers of tire brain etnergetrotfrom trated in Fig. 1, the issue arises of how to implementthe any single,uniformlystructuredconnectionistnetwork butfrom highly evoh'edarrangementsof smaller, spemappingfrom input stimuli to action. We have proposed cialized networks which are interconnectedin veo, constructing connectionist networkswhich build associations specificwars." betweeninput stimuli and appropriatemachineresponses (Gorinet al., 1991).If a machinereceivespositivereinforceThus motivated,we have investigatedstructurednetment, then the connectionsare strengthened, while negative work architectures whoseconstraintsgreatly acceleratethe reinforcement causes the connections to be weakened. There learningprocess.In particular,we developedseveralmethare many methodsthat have beenproposedin the literature ods for constructinglarge structurednetworksby combining for learningsuchconnectionweights.In particular,we define component subnetworks, then experimentally evaluated the connectionweightsof thesenetworksvia mutualinforthosenetworksin severalapplicationscenarios. mation, which has a variety of theoreticaland practicaladIn a Call Routing task, the set of machine actionsis vantagesover gradient-based trainingmethods.The definimerely a list, corresponding to a particularlysimpleoutput tion and propertiesof such information-theoretic networks peripherystructure.One can considerthe more generalsituwill be described in Sec. I. ation where the machineactionscomprisean n-parameterset Our earliest experiments involved a single-layer of subroutinecalls. Miller has proposedthe construction of information-theoretic network which constructs direct assoproductnetworksfor suchdevices,whereindividualsubnetciations between words and meaningful machine actions. worksare allocatedto eachoutputparameter,therebyreflectThis simple architecturecorresponds to a "bag-of-words" ing the outputperipherystructurein its networkarchitecture languagemodel. It was then extendedto a multi-layer net(Miller andGorin, 1993c).Thisproductnetworkenablesimwork which in additionbuilds associations betweenphrases provedgeneralizationby factoringphrase/action associations and actions,acquiringa rudimentarysyntacticstructureto throughintermediatesemanticpritnitives.Theseideaswill be improve understanding. expandedand detailedin Sec. IV. A two-dimensionalproduct network has been experiThose early experimentsinvolved a text-basedAutomated Call Routing system (Gorin et al., 1991), then a mentally evaluated on an Abnanac data retrieval task, first spoken-inputversion of that same system (Gorin eta!., text-based(Miller and Gorin, 1993c), then speech-based

1994a).That scenarioinvolveda DepartmentStore,which

(Gorin.et al., 1993a)(Miller andGorin, 1993b).The system

receives input such as ! need some paint for my redwood table, whence the appropriateresponseis to route the caller

respondsto inputs such as What is the largest mountain in the Empire State?, to which the appropriate machine re-

to the Hardware Department.More recently,thesemethods were appliedto a databaseof actualcustomer/operator dia-

sponseis The highestpoint in New YorkState is Mt. Marcy (5,344 feet). This experimentalsystemwill be describedin

logsfromtheAT&T telephone network(Gorinet al., 1993b) (Gorin et al., 1994b) (Sankaret al., 1993). In this scenario,

Sec. IV.

an input might be 1 want to reversethe charges,whencethe appropriateresponseis to route the caller to an automated subsystemwhich handlescollect calls. These experimental systemswill be describedin Sec. III. Structured networks. As a device and its task become

3442

J. Acoust.Soc. Am., Vol. 97, No. 6, June 1995

In many situationsof interest,the appropriatemachine responseto a spokeninput dependsnot only on the message, but also on the state of its environment.

This motivates us to

investigatedeviceswith both linguisticand other input channels. Such extra-linguisticinformation can serve to resolve

ambiguitiesduringunderstanding as well as provideredunAllen Gorin: Automatedlanguage acquisition 3442

dancy to acceleratelanguageacquisition.For snchdevices, Sankar has proposedthe constructionof •ensorvprbnitive subnetworks, which learn the cross-channel associations be-

tween different input stimuli (Sankar ar.d Gorin, 1993). These subnetworksare combinedin a pro:luctarchitecture,

reflectinga device'sinputperipherystructure,•,hoseoutput is then used to control the machine actions. This architecture

nem,orkassociations to the device'sinput/outputperiphery. Sucha definitiongroundsmeaningin a device'sinteraction with its world,beingdependent on its input/output periphery, environmentand experiences.In Sec. V we will describe illustrativeexamplesof suchsemantic/sensory associations for severalexperimentaldevices. We have defined a distance between these association

providesimprovedgeneralization by lhctoringphrase/action vectors,measuringthesemanticsimilarityof two wordsfor a device. We furthermore define a norm, which measures the associations throughthe sensoryprimitive subnetworks. semanticsignificanceof an individualword or phrase.This Sankarevaluatedthis networkand cor,trolstrategyin a Blocks World scenario, in which the machine has both linnorm measuresthe infornmtioncontentof a wordfor the device, which we denotesalience.This can be distinguished guisticandvisualinputchannels (SankarandGorin. 1993). from and comparedto the traditionalShannonmeasureof It is presentedwith a scenecomprisingo3jectsof varying information content,which measuresthe uncertaintythat a color and shape.The machineactionscomprisefocusingits word will occur. These theoreticaland empiricalrelationattentionon a particularobjectin responseto input suchas ships will be discussed in Sec. V. Witereis the red square?Suchfocusof attentionis a necesBased on these ideas, we constructed and evaluated a sary prerequisiteto more complexaction:;.More recently, spokenlanguageunderstanding system(Gorin Henis has extendedthis system,connectingit to a robotic rudimentary simulatorwherethe machineactionscomprisemanipulating et al., 1994a).It is uniquein that no text is providedto the the blocks upon which it has focusedits attention{Henis device during either testingor training,in contrastto all systems. It is alsouniquein that et al., 1994).Theseexperimental systems will be discussed otherspeechunderstanding the vocabularyand grammarare unconstrained, being acin Sec. IV. quiredby thedeviceduringthecourseof performingitstask. •,mbolsfrom signals.In order to providerapid learning This is alsoin contrastto all othersystems,wherethe salient and generalizationin a languageacquisitiondevice,we have

exploredmethodsfor reflectingthe structureof a device's input/outputperipheryin its network architecture.There is alsostructurein the environment,in that the inpntsignalsto a device can be organizedinto symbols,then fi•rther organized into hierarchical

structures. While

such structure can

be imposedon all sensoryinputs,in this researchwe focus on linguistics•/mbolsand structures.We believe, however, that our methodsare general and can be applied to all cognitive modalities.

We focusinitially on words,which are the fundamental symbolsof meaningin language.In traditionalspeechrecognitionsystems,one specifiesthe vocabularya priori then trains the recognizer by presentingit with labeled speech

(RabinerandJuang,1993).Duringhumanlanguage acquisition, however,words seem to emerge natnrally during the courseof interactingwith the world. How might this be? Furthern•ore,how might we mimic such characteristicsin our devicesso as to improve their trainability,adaptability, and robustness?

This contrastmotivatesus to investigatemethodsfor automatedacquisitionof spokenwords. V•bster (Webster, 1987) defines a word as

"a speechsound... that comnumicates meaning... withoutbeing divisibleinto smaller units capableof independentuse."

Basedon the intuitionthata symbolshouldbe a stablepoint of someoperator,we haveinvestigatedclusteringalgorithms that searchfor speechsoundswhich are acousticallyand semantically consistent.A prerequisiteto meamring suchcon-

sistencyis to defineacousticand semanticfeaturespaces with appropriatemetrics. In people,an input stimulusevokesmemoriesof associated perceptionsand activities.This motiw.tedus to define Ihe meatling of a word, for a particular cevice, to be its 3443

J. Acoust.Soc. Am., Vol. 97, No. 6, June 1995

vocabulary wordsandtheirmeanings areexplicitlyprovided to themachine.The initialapplicationvehicleIBrthisexperiment in spokenlanguageacquisitionwas the Department Store task (Gorin et al., 1994a), then the Almanac (Gorin et al., 1993a) (Miller & Gorin, 1993b).Theseexperimental systemswill be describedin Sec.VI. Grammaticalinference.The aboveexperimentsfocused on acquiringword symbolsfrom the speechsignal.The next level up in the linguistichierarchyis grammar,comprising symbolsandstructm'e whichgoverntheacceptable combinationsof wordsinto sentences. Grammarplaystwo important rolesin speechunderstanding. First, it constrainsthe allowable word sequences, increasingthe signal-to-noise ratio and thus improvingour ability to recognizewords in noisy or highly variable environments.Second, it modulatesthe meaningof a word accordingto its positionin a sentence. Thus meaningcan be viewed as an attributeof a word in a particularsyntacticstate,ratherthan of the word alone. The automatedacquisitionof grammar has received much attention, intertwined with the classical debate con-

cerninghow much of linguisticstructuremust be innate in order to account for human behavior. We observe, however,

thatthe acquisitio. n of grammar frommerelylistening to speechis a muchharderproblemthan peopleactuallysolve. In humans,languageis acquiredduringthe courseof interactingwith the world, exploitingboth speechand othersensoryinput.A challenge,then,is to understand how to exploit suchextra-linguistic informationto guidegrammaticalinference,governedby the goal of learningto decodemeaning. As motivation,let us considerthe basic parts-of-speech such as nounsand verbs.In elementaryschool,children are taughtthat a nounis a person,place, or thing,andthat a verb is a word that expressesan action. It is striking that the classroomdefinitionof suchfundamentalsyntacticconcepts are purely semantic.If one constructsa machine that can AllenGorin:Automatedlanguageacquisition 3443

interactwith things,for examplein a BlocksWorld,then all phraseswith high saliencefor suchthingscan be clustered into a part-of-speech. Suchan abstraction wouldcorrespond to the early semanticcharacterization of a noun.Similarly, given a machinewhich can sensethe attributesof things

(e.g.,coloror shape),thenonecouldacquirea part-of-speech correspondingto the early notion of an adjective.Verbs couldsimilarlybe emergentfrom associations to time derivatives of such attributes.

While suchdefinitionsare a subjectof muchdebatein linguistics,they serve as useful intuitionsto motivate our investigationinto exploitingsemantic/sensory associations for grammaticalinference.In Sec. VII, we will first describe the method of salience thresholdingin an informationtheoreticconnectionist network.This thresholding yields a subnetworkwhich corresponds to a part-of-speech for each dimensionof the deviceperiphery.The resultantsubnetwork is activatedonly by thosewordsor phraseswhichare highly salientfor its semanticor sensoryprimitive. Once theseparts-of-speech are acquired,they can be manipulated just like any othersymbol.We thusproposea

fifthprinciple,thatlanguage acquisition proceeds in developmental stages,from the concreteto the abstract,from the

simple tocomplex. I In adultlanguage, parts-of-speech are characterized bothvia theirmeaningandwithin-language usage patterns.In Sec.VII, we also reporton preliminaryexperimentswhich re-estimateinducedpans-of-speech so that theybecomeconsistent from boththeseperspectives. An applicationof theseideaswas exploredby Gertnet, who constructed a hierarchical

network

with subnetworks

corresponding to pans-of-speech in an Airline Information

task(GertnerandGorin,1993).A queryto thatsystemmight be I want to leave New Yorkandfly to the WindyCity, to whichthe appropriate machineresponse wouldbe to display a flight table from New York to Chicago.The principleof developmentallearningtells us that in order for a device to acquirethe languageinvolvingpairs of places,it mustfirst acquirethe languageassociated with individualplaces. A stablesubnetworkfor placeswas embeddedin a hier-

archicalnetwork,with secondary subnetworks corresponding to modifierphrases.Rapid learningand generalizationwas achieved by factoring phrase/actionassociationsthrough thesemodifiersubnetworks. For example,an encounterwith the phraseleave New York leads to the acquisitionof the meaningof leave as it relatesto all place names. Summary. The principles and mechanismspresented here form the basis of a theory of syntaxand semantics,

$

FIG. 2. A multilayernetworkmappinglanguageto semanticactions.

Call Routingtasks.Structurednetworksarediscussed in Sec. IV, in particulartheir applicationto the Almanacand Blocks World tasks.In Sec. V, we define salience,discussing its relationshipto informationtheoryand providingseveralillustrativeexamplesof semantic/sensory association vectors. Our experimentsin spokenlanguageacquisitionare summarized in Sec. VI, where the device acquiresspokenwords from speechwith no interveningtext. In Sec.VII, rudimentary experimentsin salience-based grammaticalinferenceare discussed. The applicationof theseideasto an Airline Information task will then be described,involving a structured hierarchical

information-theoretic

I. AN INFORMATION-THEORETIC NETWORK

network.

CONNECTIONIST

In this section,we describea mechanismfor learningthe mappingfrom input stimuli to machineaction.In particular, we describea connectionistnetwork,originally proposedin

(Gorinet al., 1991),whichbuildsassociations betweeninput stimuli and appropriatemachineresponses.If the machine receivespositivereinforcementto a responsethen the connections are strengthened,while negative reinforcement causes the connections

to be weakened.

where conveyingmeaningis primary and linguistic structure servesto make such communicationrobust.Although our experimentaldevicesare thus far rudimentary,we consider them to be the early stagesof a long-terminvestigationinto machineswhich automaticallyacquirelanguagethroughin-

The basicnetworkarchitectureis illustratedin Fig. 2. A spokenor typed sentenceis applied to the input layer, which comprisesa collectionof word-detectornodes.These nodes producean outputbetweenzero and one, approximatingthe probabilitythat a particularword is presentin the input. In the simplestcasefor keyboardinput,the outputequalsone if

teractionwith a complexenvironment. This paper proceedsas follows. Section I definesthe basic information-theoreticnetwork, its training procedure and basicproperties.The feedbackcontrolmechanismused for dialog control is describedin Sec. II. In Sec. III, we describethe experimentalevaluationof that basicnetworkin

the input word exactlymatchesthe node,elsezero. The intermediatelayer comprisesa collectionof phrasedetectornodes,in the simplestcase correspondingto adjacentwordpairs(bigrams).The outputlayercomprises nodes whichcorrespondto the variousactionsthat the machinecan perform.In this discussion,the actionspaceis a list of sub-

3444

d. Acoust.Soc. Am., VoL 97, No. 6, June 1995

Alien Gorin: Automated language acquisition 3444

routine calls. The extension of these methodsto more com-

plex devicesvia structurednetworksis discussedin Sec. IV.

The inputandintermediate layersare shownpartially grown.They are initializedat zero,growingover time as newwordsandphrases areencountered. Irt thecaseof keyboardinput,a wordtokencanbe definedas anycharacter sequence delimitedby blanksor punctuatton. A new word canbe mostsimplydefinedasonewhichdiffersin anyway fromthe existingvocabulary (Gorinet al., 199l). This new word criterioncan be softened(Miller and Gorin, 1993c) basedon string-distortion measures suchas the Levenshtein distance (Levenshtein, 1966)(Sankoff andKruskal, 1983).In Sec.VI, we address the relatedissuesfor spokeninput (Gorinet al., 1994a),whichinvolvebothacousticand semantic distortions.

Therehavebeena numberof methods proposed in the literaturefor trainingsuchnetworks(RumelhartandMcClelland, 1988). In this research,we have,definedthe connection weights between words and actionsto be Lhemutual infor-

mationbetweenthoseevents(CoverandThomas,1991)This leadsto severalattractiveproperties, asdiscussed laterin this

kl=arg max aa.

(5)

k

Theoreticalproperties.The information-theoretic network hasseveralrelationships to othermethods,whichwe brieflyreviewhere.GivensuitableMarkovianassumptions on thelanguage, thentheabovealgorithm is equivalent to a maximuma posteriori(MAP) decision(Gorinet al., 1991). Given suitableindependence assumptions, then a bag-ofwords model appliesand the connectionsfrom the interme-

diate layer vanish,yieldinga singlelayer network.Under thoseconditions, the networkis againequivalentto a MAP decision(Gorin etal., 1991). Tishby observesthat the information-theoretic networkis equivalentto the criterion of classification via minimumdescription lengthwherethat interpretation is selectedwhich providesfor the minimum codelengthof the inputsentence (TishbyandGorin, 1994) Addressing theproblemof rule-inference for expertsystems,Goodmanshowsthat the strengthof a candidaterule can be characterizedby the mutualinformationbetweenthe if and then clauses(Goodmanet al., 1992). He furthermore describes a methodfor combining theparallelfiringsof such

section. If we denote thecurrentvocabulary of N wordsby rules via a connectionist network with information-theoretic V={Vl,V2.....v•v} and the set of K actionsby weights.Thisresultis rathersatisfying, ameliorating thetraC= {Cl,C2.....cg}, thentheinformation-theoretic connection

ditionaldebatebetweenconnectionist and rule-basedapproaches to machineintelligence. Tishby provesa universalitytheoremfor informatione(clv) theoreticassociations, showingthat,undersuitablehypotheses,anyassociation measurewhichis functionallyrelatedto can be rescaledto mutualinformation(Tishby whereP(cJv.) istheconditional prohabilil:y thata sentence probabilities and Gorin, 1994). Thereis a seemingly relatedsetof results containing wordv, connotes actionc•, whereP(cO is the proving that when appropriately trained via a mean-squared prior probabilityof that action,and l(v,,cO is the mutual error (MSE} criterion, the network outputs provide estimates information between thewordandaction.Thisis intuitively of a posterJori probabilities (Richard and Lippmann, 1991). satisfying asfollows.If thepresence of wordv in a sentence We remark that, while all these relationships am quite fascimakesan actionc morelikely,thenP(clv)>P(c), sothat nating, fully understanding and exploiting them remains an the connection weightis positive(excitatcry).Similarly,if issue for future research. thewordv makesan actionc lesslikely,thentheconnection Estimation. After having decided on the informationweightis negative(inhibitory).Finally,if the wordhasno theoretic network, the issue remains of how to estimate mueffect,thentheconditional andpriorl:,robabilities areequal, tual information.The probabilities in formulas(1) through sothat the connection weightis zero ('null) (3) can be estimated via smoothed relativefrequencies The connection weightbetweena phraseand actionis afterencountering input definedvia excessmutualinformation. While in principle (Gorinetal., 1991).In particular, weightsare givenby

w,•=l(v,,c•)=log2 P(c•)'

(1)

scalableto any n-gram phraseor set thereoF,we restrictthis

sentencess• ,s2......•/,

discussion to adjacent wordpairs(vi ,vj),

sentences of classca containingword v,,, and let N/(ca) denotethe numberof sentences in thatclass.We compute

Wijk=l(viuj ,Ck)--I(v i ,Ck)--I(uj ,Ck).

(2)

Biasesfor eachoutputnode(cf. Fig. 5!)aregivenby w• = log2P(c•).

(3)

The activationat eachoutputnodeis computed via a linear combination of thoseinputs,

a•=• •}jwij•+• •"•w•+wk, i,j

(4)

n

whereD'ij is the outputfrom the phrase-detector nodefor viv; and •, is the outputof the word-d.•.tector nodefor

v,. Thataction c•t whichhasmaximum activation is then performed,where

.3445 J. Acoust. Soc.Am.,Vol.97, No.6, ,June' 995

let N/(c•,v,)

denotethe numberof

(Gorinet al., 1991)smoothed relative frequency estimates of P(ca) andP(cl•Jvn) via 1

N/(c•)

P/(c•) =(1- ce/) • +a/ /

,az(clv,,):(1-/3/)P(cO +

,

(6)

(7)

The interpolation parameters •z and/•/are setto •/(m+•) for somefixed prior massm. Theseestimatorshave a naturally incrementalimplementation,either via updating countersor via maintainingsufficientstatistics(Duda and Hart, 1973).Furthermore, solongasthemeaningof wordsis fixedovertime,thentherelativefrequencies convergeto the probabilitiesin thoseformulas.In contrast,networktraining AllenGorin:Automated language acquisition3445

(Farrellet al., 1993).That work addressed only the algebraic formulation,not considering the statisticalstructure.Geutner et al. (Geutneret al., 1993)describea hybridapproach leading to improvedresults,usingmutualinformationas an initial estimatorfollowedby MSE optimization.We conjecture MSE methods,one must define a smoothdistortion measure thatthisresultcanbe explainedvia the dualstructure on the on the output space, which is not necessaryfor the parameterspace.We closethisdiscussion by observingthat, information-theoretic network.One can prove however,that for an appropriateerror function,the information-theoretic while very promising,thisline of thoughton exploitingal-

methodsbasedon total mean-squared error(MSE) involve batch-mode algorithmssuchas singularvaluedecomposition (SVD) or multipassalgorithmssuch as gradient search (Rumelhartand McClelland, 1988). Observealso that for

updatevectorhasthe samesigncomponents as the gradient of thesingle-step error(GorinandLevinson,1989),i.e., they are movingin the samegeneraldirection.One can further provethat the information-theoretic updateis guaranteed to decrease thatsingle-step errorfunction(GorinandLevinson, 1989). A naturalnext step would be to extend this result to

the globalerrorfunction,but it is not clearhowto do so. Small sampleartifactsare an ubiquitousissuein statisticallanguagemodels.Zipf's law is a well-knownempirical observationwhich tells us that, in general, there will be

manylow-frequencyeventsand only a few high frequency events(Pierce,1961) (Zipf, 1949).Thereare methodswhich attemptto amelioratethis problem,such as Good-Turing estimators(Good, 1953), usedby Roseto estimatemutual informationin his topic spottingexperiment(Roseet al., 1991). Due to issuesof small samplestatisticsand contextdependency, we are led to investigate focusedlearning,where onewouldlike to adjustthelearningratefor a wordbasedon its context.We illustratethisissuewith an examplefrom the DepartmentStoresystem.Consideran input sentenceI want to buyan etagem,wherethe wordetagereis encountered for the first time. Given that the appropriateactionis to connect the caller to the furniture department,one should greatly strengthen theassociation betweenetagereandthataction.In contrast,considera secondinput sentenceI want to buy a mauvesweater,where the word mauveis encounteredfor the

firsttime andthe appropriate actionis to connectthecallerto the clothing department.In this example,however,one shouldlearnonly a mild association betweenmauveandthat

call-action. To summarize, depending on context,onewould like to acceleratethe learningrate for somewords,decrease it for others.

We considerhow to quantifyand exploit this intuition,

following(TishbyandGorin,1994).Observethatetagemis theonlypossible explanation for theinterpretation of thefirst sentence,while mauveis not necessaryto correctlyinterpret the secondsentence.That is, the error in understandingis largein onecase,smallin the other.Thus one would like to

gebraicstructure to improvestatistical associations remains an openresearch issueboththeoretically andempirically. In Sec. VII we presentan alternateapproachto focusedlearning, reportingon preliminaryexperiments in exploitingestimatedsyntacticstateto adjustlearningrate. It can be shownthat when words or phraseshave uni-

formlyweakassociations, thentheestimates of theirconnection weightshaveincreased variance,therebyinjectingadditionalnoiseinto the understanding process. This leadsus to considerclippingto zero the weightsof thosewordswith weak associations. In (Gorin et al., 1991), this was imple-

mented by clipping theestimates of P(clv) to P(c) if they were sufficientlyclose.We have recentlyintroducedan improvedmethodto addressthisproblem,clippingthe connections of low-salience

words to zero. This salience threshold-

ing both reducesthe effective vocabularyand increasesthe understanding rate, as discussed later in Sec.V. Suchsubvocabularyselectionis of great relevanceto designingand evaluatingthe speechrecognitionfront-endof our systems. Summary. In this section we have described the information-theoretic connectionist network 2 which is the basic building block of our languageacquisitionsystems. Several observationsare in order. First, in all of our experi-

ments,the vocabularyand parameterspacegrow over time as new words are encountered.Second, the networks are em-

beddedin a dialogcontrolsystem,adaptingtheir parameters based on reinforcement

feedback

from

the environment.

Third, as describedin Secs.IV and VII, this basic network is

embeddedin larger structurednetworksto enablelanguage acquisition for morecomplexdevices. II. DIALOG

CONTROL

We governthe behaviorof a devicebasedon feedback as to the appropriateness of its actions.Suchreinforcement feedback causes an immediate

modification

of the device's

behavior(control) and then a modificationof the device's futurebehavior(learning). In our communication paradigm,input is providedby a person,whosegoalis to inducethemachineto performsome

modulatethe learning rate based on the error, a well-

action. The interaction between human and machine is called

understood principle in MSE

dialog, which servesthe importantrole of resolvingambigu-

optimization algorithms.

Tishbyobservesthat thereare both algebraicand statistical

structures on the network'sparameters (Tishbyand Gofin, 1994). The algebraicpropertiescan be addressed via MSE methodsand the statisticalpropertiesvia relative frequencies.He combinesthese,proposingan algebraicmethodfor estimatingstatisticalassociations, often obtaininggoodestimatesfor words which occur only once. Farrell investigatesa gradientsolutionto the algebraic formulation, achieving similar performance to the information-theoretic networkon the DepartmentStoretask 3446 d. Acoust.Soc. Am., Vol. 97, No. 6, June 1995

ities and misunderstandings. This interactionbetweenthe machineand its environmentis implementedas a feedback controlsystem,as was illustratedin Fig. 1. In this section, we describethe basicdialog controlmechanismusedin our systems.

The initial input to the systemis a naturallanguagerequestfor the machineto performsomeaction.Basedon the machineresponse,the userrespondsin turn with a mixture of error feedbackplus possiblyclarifyinginformation.Examplesof suchdialogswill be providedfor eachof our exAllenGorin:Automatedlanguageacquisition 3446

perimentalsystemsin subsequent sections.In this section, we describe theformulas underlying •Ihemostbasicsystem, thencommentuponits propertiesandextensions. Let st' denotethe •/th userinput,a(a/) the activation vector producedby the network [cf. formula (4)], and

munications network.The goalis thatwhena persondesires someservice,he woulddial a singleuniversalnumber,which promptshim with Hello, howmayI helpyou?He responds with unconstrained fluentspeech,upon whichbasisthe call wouldbe routedto an appropriatedestination. This scenario e(s/) (for•/•2) theerrorcomponent of themessage. Let canbe contrasted with currentmethodsof providingseparate ck/denote themachine response afterthe?thinput message.telephone numbers for eachserviceor of requiringpeopleto navigatea menu-drivensystem.In this new scenario,a call Definea totalactivation arrayat eachstageof thedialogvia would insteadbe switchedon the basisof its content. A 1= a(sl) (8) The DepartmentStore.Our first experimentin this diA/= ( 1- ce/)A/ 1+ a/a(s/) + e(s•). (9) rectioninvolveda DepartmentStore scenario(Gorin et al., 1991).Therewerethreedepartments: Furniture,Clothing, In thesimplest case(Gorinet al., 1991),we setthecompoand Hardware,plus a fourth call-actionwhere the device nents ofe(s/)tozero, except fore%,(s/)which issetto "givesup" andconnectsthe call to an operator.Perhapsthe -o•. The mostbasicimplementation (Gorinet al., 1991) set best met]hodof expositionis to examineseveralillustrative

ax,= 1//, so thatA/ involvesan averageof the terms human/machinedialogs. a(st), 1•