Hope Park Square, Meadow Lane. Edinburgh EHB 9LL. (GB) ..... Roy. $9c.,. 275.b. Marslen-Wilson, W.D. & Tyler, L.K.. 1980. The. Temporal. Structure of. Spoken.
AUTOMATED
SPEECH RECOGNITION: FOR RESEARCH
A FRAMEWORK
Anne Johnstone D e p a r t m e n t of A r t i f i c i a l I n t e l l i g e n c e Edinburgh University H o p e P a r k Square, M e a d o w L a n e E d i n b u r g h EHB 9LL. (GB)
Gerry Altmann D e p a r t m e n t of L i n g u i s t i c s Edinburgh University George Square E d i n b u r g h EHB 9LL. (GB) compu-£er system is to mimic, though p r e f e r a b l y model, this strategy. In order to do so, one s h o u l d p r e s u m a b l y d r a w on both computational and psychological theories of process. Such a dual a p p r o a c h has b e e n s h o w n to be feasible, and indeed desirable, by r e s e a r c h into e a r l y v i s u a l p r o c e s s i n g (eg. M a r r 1976) w h i c h has s h o w n that there can come a point when psychological and computational descriptions become barely distinguishable. This analogy with early visual processing is s i g n i f i c a n t b e c a u s e c e n t r a l to the d e v e l o p m e n t of the vision research was the n o t i o n of 'modelling': one can argue that a significant difference between the so-called '4th Generation' and '5th Genera£ion' technologies is t h a t w i t h the f o r m e r adhoc algorithms are applied to often i n c o m p l e t e and u n r e l i a b l e data, w h i l e w i t h 5th Generation systems, algorithms are devised by f i r s t c o n s t r u c t i n g q u a l i t a t i v e m o d e l s s u i t e d to the t a s k domain.
ABSTRACT T h i s p a p e r r e f l e c t s the v i e w that the decoding of speech, either by c o m p u t e r s y s t e m s or people, m u s t to a large extent be determined by the ways in w h i c h the speaker has encoded the information necessary for its comprehension. We t h e r e f o r e p l a c e g r e a t e m p h a s i s on the use of psycholinguistics as a tool for the c o n s t r u c t i o n of m o d e l s essential to the characterisation of the speech u n d e r s t a n d i n g task. We are p r i m a r i l y c o n c e r n e d with the i n t e r a c t i o n s b e t w e e n the v a r i o u s levels at which a fragment of speech can be described (e.g. acoustic-phonetic, lexical, s y n t a c t i c , etc), and the ways in which the k n o w l e d g e bases a s s o c i a t e d w i t h e a c h of t h e s e "levels" c o n t r i b u t e towards a final i n t e r p r e t a t i o n of an u t t e r a n c e . We p r o p o s e to use the Chart Parser as a general computational framework for s i m u l a t i n g such interactions, since its flexibility allows various m o d e l s to be i m p l e m e n t e d and e v a l u a t e d .
We p r o p o s e to use psycholinguistics as a tool for the c o n s t r u c t i o n of m o d e l s e s s e n t i a l to the c h a r a c t e r i s a t i o n of the s p e e c h u n d e r s t a n d i n g task. We b e l i e v e t h a t this approach is essential to the development of automated speech r e c o g n i t i o n systems, and will also prove beneficial to psychological models of h u m a n s p e e c h p r o c e s s i n g , the majority of which are underdetermined from a c o m p u t a t i o n a l p o i n t of view. R u m e l h a r t and McClelland have recently adopted a similar a p p r o a c h to a c c o u n t for the m a j o r f i n d i n g s in the p s y c h o l o g i c a l l i t e r a t u r e on l e t t e r perception. By constructing a detailed computational model of the processes involved they were able to give an a l t e r n a t i v e d e s c r i p t i o n of the r e c o g n i t i o n of certain letter strings, which was supported by subsequent psycholinguistic experiments. Rumelhart and McClelland emphasise the point that their results w e r e not p r e d i c t a b l e 'on paper', but were the outcome of considerable experimentation with the computational model.
Within this general framework we discuss problems of i n f o r m a t i o n flow and search strategy in combining evidence across levels of d e s c r i p t i o n and a c r o s s time, during the extension of an hypothesis. We s t r e s s the i m p o r t a n c e of both psychological and computational theory in d e v e l o p i n g a p a r t i c u l a r c o n t r o l s t r a t e g y w h i c h c o u l d be i m p l e m e n t e d w i t h i n the framework.
Introduction The d e c o d i n g of speech, either by computer systems or people, must to a l a r g e e x t e n t be d e t e r m i n e d by the w a y s in which the speaker has encoded the information necessary for its c o m p r e h e n s i o n . S u c h a v i e w is s u p p o r t e d by a large body of experimental evidence concerning the ways in which various f a c t o r s (eg. p r e d i c t a b i l i t y from context) affect both the acoustic clarity with which a speaker pronounces an utterance, and the s t r a t e g y the h e a r e r a p p e a r s to use in identifying it. The task of the
239
Requirements Framework
of
the
Computational
systems, the experimental evidence suggests that the solution will not be found simply by improving techniques for low-level feature detection. The situation appears to be analogous to that of visual processing, where " s i g n i f i c a n t " features may be absent. If present, their significance may also be open to a number of interpretations.
The experience of the ARPA speech project, which resulted in the design of a number of speech r e c o g n i t i o n systems, has demonstrated that the task of c o n t r o l l i n g the interactions between the knowledge bases which make up the system is at least as p r o b l e m a t i c as that of defining the knowledge bases. Major inadequacies in the systems developed during the ARPA project can be attributed to an early commitment in one or more areas of design which were not apparent until final testing and evaluation of the complete system began. An architecture is required, therefore, which will permit the development in parallel and relatively independently of component knowledge bases and methods of deploying them computationally. It should also permit the evaluation and testing of solutions with partially specified or simulated components. This will ensure that the design of one component will not influence unduly the design of any other component, possibly to the detriment of both. In addition, we should have the ability to determine the consequences of component design decisions by testing their contributions to the overall goals of speech recognition.
C o m b i n i n g evidence across different levels of description requires the s p e c i f i c a t i o n of i n f o r m a t i o n flow between these levels. Within the p s y c h o l o g i c a l literature, there is a growing tendency away from "strong" (or "instructive") interactions towards "weak" (or "selective") interactions. With the latter, the only permissible flow of i n f o r m a t i o n involves the filtering out, by one component, of a l t e r n a t i v e s produced by other components (cf. Marslen-Wilson & Tyler, 1980; Crain & Steedman, 1982; Altmann & Steedman, forthcoming), so in h i e r a r c h i c a l terms no c o m p o n e n t d e t e r m i n e s what is produced by any other c o m p o n e n t beneath it. A strong interaction, on the other hand, allows one component to direct, or guide, @ctively a second component in the pursuit of a p a r t i c u l a r hypothesis. Within the computational literature, weak interactions are also argued for on "aesthetic" grounds such as Marr's p r i n c i p l e s of m o d u l a r i t y and least c o m m i t m e n t (Mart, 1982).
In order to fulfill these requirements we propose to use the active chart parser (e.g. Thompson & Ritchie, 1984). This was s p e c i f i c a l l y designed as a flexible framework for the implementation (both serial and p a r a l l e l ) of d i f f e r e n t rule systems, and the evaluation of strategies for using these rule systems. It is described below in more detail. The C o m p u t a t i o n a l
The strongly interactive heterarchical and blackboard models implemented in H W I M and Hearsay II respectively have been c r i t i c i s e d for the e x t r e m e l y complex control strategies w h i c h they required. Problems arise with the heterarchical model "because of the difficulties of generating each of the separate interfaces and, more importantly, because of the n e c e s s i t y of specifying the explicit control scheme." (Reddy & Erman, 1975). Similar problems arise with existing blackboard models. Their information flow allows strong top-down d i r e c t i o n of components, resulting once again in highly complex control strategies. Hierarchical models have other problems, in that they allow too little i n t e r a c t i o n between the knowledge sources: within a strictly h i e r a r c h i c a l system, one cannot "interleave" the processes associated with each d i f f e r e n t level of knowledge, and hence one cannot allow the very early filtering out by h i g h e r - l e v e l components of what might only be partial analyses at lower levels. This situation (considered d i s a d v a n t a g e o u s for reasons of speed and efficiency) arises because of the lack of any common workspace over which the separate components can operate. There is, however, much to be said for h i e r a r c h i c a l systems in terms of the relative
Model
The problem in designing optimal control or search strategies lies in combining evidence across d i f f e r e n t levels of description (e.g. acoustic-phonetic, morpho-phonemic, syntactic, etc.), and across time during the extension of a hypothesis, such that promising interpretations are given priority and the right one wins. In this section we shall consider just a few of the issues concerning this flow of information. Automated speech systems, in particular those implemented during the ARPA-SUR project, have been forced to confront the errorfull and ambiguous nature of speech, and to devise methods of controlling the very large search space of partial interpretations generated during processing. Although the problem was exacerbated by the poor p e r f o r m a n c e of the a c o u s t i c - p h o n e t i c processing used in these
240
simplicity of the control strategies n e e d e d to manage them, a consideration which is f u n d a m e n t a l to the d e s i g n of a n y s p e e c h r e c o g n i t i o n system.
a c t i v e edge). The k n o w l e d g e bases define what extensions are possible, not the parser.
The model c u r r e n t l y being developed embodies a weak hierarchical interaction, s i n c e this seems m o s t promising on both psychological and c o m p u t a t i o n a l grounds. Unlike existing hierarchical or associative models, it uses a uniform global data structure, a "chart". Associated with this structure is the a c t i v e c h a r t parser.
To s u m m a r i z e , the c h a r t is used to represent and extend pathways, through t i m e and level of a b s t r a c t i o n , through a s e a r c h space. W i t h i n the chart, t h e r e are d i f f e r e n t t y p e s of p a t h corresponding to different levels of d e s c r i p t i o n , e a c h of which is associated with a particular knowledge source. To the extent that knowledge specific rules specify what counts as constituent pathways at the different levels of abstraction, a hierarchical flow in information is maintained. The w e a k interaction arises because a l t e r n a t i v e p a t h w a y s at one level of d e s c r i p t i o n can be filtered through attempts to build pathways at the n e x t "higher" level. This m o d e l differs from straightforward hierarchical models, but resembles associative models, in that k n o w l e d g e s o u r c e s c o n t r i b u t e to p r o c e s s i n g without each source necessarily corresponding to a distinct stage of a n a l y s i s in the p r o c e s s i n g s e q u e n c e .
The a c t i v e the f o l l o w i n g : -
chart parser
consists
of
I) A u n i f o r m g l o b a l data structure (the Chart), represents competing pathways through a search space, at different levels of description, and at d i f f e r e n t stages of a n a l y s i s . Complete descriptions are marked by "inactive" paths, c a l l e d edges, spanning temporally defined portions of the u t t e r a n c e . T h e s e i n a c t i v e edges have p o i n t e r s to the lower level descriptions which s u p p o r t them. P a r t i a l d e s c r i p t i o n s are m a r k e d by "active" edges which carry representations of the d a t a n e e d e d to c o m p l e t e them. For example, a s y n t a c t i c edge, such as a n o u n phrase, may span any complete descriptions that p a r t i a l l y s u p p o r t it, such as a d e t e r m i n e r or a d j e c t i v e . In a d d i t i o n , it w i l l carry a d e s c r i p t i o n of the s y n t a c t i c p r o p e r t i e s (e.g. noun) any i n a c t i v e lexical e d g e m u s t have to c o u n t both as a d d i t i o n a l e v i d e n c e for this syntactic description and as justification for its extension or completion. The type and complexity of the descriptions are determined by the rule b a s e d k n o w l e d g e s y s t e m s used by the parser, and are not determined by the p a r s e r itself.
H a v i n g s k e t c h e d the construction of the s e a r c h s p a c e we m u s t n o w d e c i d e u p o n a s t r a t e g y for e x p l o r i n g that space. Most current psychological t h e o r i e s a p p e a r to assume strict "left-to-right" processing, although this r e q u i r e s t a c k l i n g s t r e t c h e s of s o u n d immediatedly w h i c h are of poor a c o u s t i c quality, and w h i c h are r e l a t i v e l y u n c o n s t r a i n e d by h i g h e r level knowledge. The majority of s y s t e m s d e v e l o p e d d u r i n g the A R P A p r o j e c t found it n e c e s s a r y to use later occurring information to disambiguate earlier parts of an utterance. Moreover, there is psycholinguistic evidence that the "intelligibility" of a p a r t i c u l a r s t r e t c h of sound increases with additional evidence from l a t e r " r i g h t w a r d " s t r e t c h e s of s o u n d ( P o l l a c k & Pickett, 1963; Warren & Warren, 1970). We p r o p o s e to a d o p t a system using a form of left-to-right analysis which could a p p r o x i m a t e to the power of m i d d l e - o u t a n a l y s i s (used in H W I M and H e a r s a y II) but w i t h o u t r e q u i r i n g the construction of distinct "islands" and with less computational expense. This more precise method of using "rightcontext effects" d e p e n d s on the p r i o r i t y s c o r e s a s s i g n e d to paths. Such s c o r e s can be thought of, for p r e s e n t purposes, as some m e a s u r e of " g o o d n e s s of fit". The score on a s p a n n i n g p a t h w a y (that is, a pathway which spans other pathways "beneath" it) is d e t e r m i n e d by the s c o r e s on its constituents, and so is partly determined by scores t o w a r d s its r i g h t hand end. By virtue of affecting the "spanning score", a s c o r e on one s u b - p a t h can a f f e c t the probability that another sub-path to its left (as well as to its
2) A multi-level task queueing structure (the Agenda), w h i c h is used to o r d e r the ways in w h i c h the descriptions will be extended, through time and level of a b s t r a c t i o n , and thus to c o n t r o l the size and direction of the s e a r c h space. This o r d e r i n g on the a g e n d a is controlled by specifically designed search strategies which determine the minimum amount of s e a r c h c o m p a t i b l e w i t h a low rate of e r r o r in d e s c r i p t i o n . The p o w e r and flexibility of this approach in tackling complex s y s t e m b u i l d i n g tasks is well set out in B o b r o w et al. 1976). 3) An algorithm which automatically schedules a d d i t i o n s to the C h a r t o n t o the A g e n d a for s u b s e q u e n t p r o c e s s i n g wherever such extensions are p o s s i b l e . T h a t is to say, whenever a description which is complete at some level (an i n a c t i v e edge) can be used to extend a partial description at some higher level (an
241
right) will finally be chosen as the best description for the acoustic segment it represents. We will use psycholinguistic techniques to interrogate the "expert" (i.e. s t a t i s t i c a l l y reliable experiments with human listeners), in order to d e t e r m i n e both when such leftwards flowing information is most often used for the d i s a m b i g u a t i o n of poor q u a l i t y areas, and what sets of paths it will affect. It will be extremely useful to know whether people regularly rely on i n f o r m a t i o n from the right to disambiguate preceding stretches of sound, or whether this happens only at the beginning of utterances as the HWIM strategy suggests. Pollack and Pickett claim that there is no effect on intelligibility of a word's position within a stimulus, but unfortunately they offer no i n f e r e n t i a l statistics to back this claim.
The proposals c o n t a i n e d t h e r e i n have been adopted by the E d i n b u r g h c o n t r i b u t i o n to the Plessey C o n s o r t i u m ' s Alvey Large Scale Demonstrator Project on M a c h i n e A s s i s t e d Speech T r a n s c r i p t i o n .
REFERENCES
Altmann, G.T. & Steedman, M.J. Forthcoming. The Garden Path in Context: R e f e r e n c e and the R e s o l u t i o n of Local S y n t a c t i c Ambiguity. Bard, E.G. & Anderson, A.H. 1983. The u n i n t e l l i g i b i l i t y of speech to children. Journal of Child L a n u u a q e 10, 265-292 Crain, S. & Steedman, M.J. 1982. On not being led up the g a r d e n path: the use of c o n t e x t by the p s y c h o l o g i c a l parser. In (eds.) Dowty, D., Kartunnen, L., & Zwicky, A . Natural Lanaua~e Parsinu: psycholouical, co~putational, and t h e o r e t i c a l perspectives. In press.
This is only one of the many issues in speech recognition which are e x p e r i m e n t a l l y addressable. The results of such experiments are obviously of relevance to c o m p u t a t i o n a l systems since they can indicate where and when sources of information are most likely to contribute towards identification of an utterance. Conversely, the attempt to build a working model of at least some parts of the process, will h i g h l i g h t many areas where further e x p e r i m e n t a l data is needed.
Johnstone, A.M. & Altmann, G.T. 1984. Automated Speech Recognition: A F r a m e w o r k for Research. Department of Artificial Intelligence, U n i v e r s i t y uf Edinburgh, Research Paper No. 233. Also appears as S p e e c h Input Project, U n i v e r s i t y of Edinburgh, R e s e a r c h R e p o r t No. 2.
Concluding Remark
Marr, D. Visual 275.b
We hope that this sketch of part of the proposed system has given a feel for the combined approach taken here. It developed through a r e - e x a m i n a t i o n of a number of issues which arose during the ARPA speech project, and a r e c o n s i d e r a t i o n of these issues in the light of recent computational and psycholinguistic advances. Given the success of these recent advancements in the c o n t r i b u t i n g fields of research, we feel that the time is right for the e v a l u a t i o n of a speech r e c o g n i t i o n system along the lines laid down here.
1976. Early Processing of Information. Proc. Roy. $9c.,
M a r s l e n - W i l s o n , W.D. & Tyler, L.K. 1980. The Temporal Structure of Spoken L a n g u a g e Understanding. Cognition, 8, 1-71. McClelland, J.L & R u m e l h a r t , D.E. 1981. An Interactive Activation Model of C o n t e x t Effects in Letter Perception: Part I. An A c c o u n t of Basic Findings. In P s v c h o l o u i c a l Review, 88, 375-407. Pollack, I. & Pickett, J.M. 1963. The intelligibility of excerpts from Conversation. L a n g u a g e add Speech, 6, 165-171.
ACKNOWLEDGEMENTS This is a summary of a paper (Johnstone & Altmann, 1984) written as a result of discussions held in the University of Edinburgh School of E p i s t e m i c s research workshop on Lexical Access and Speech Perception. We would like to thank the members of that workshop, in particular Dr. Ellen Bard and Dr. Henry Thompson.
Reddy, D.R. & Ermann, L.D. 1975. Tutorial on System Organisation for Speech Understanding. In D.R. Reddy (ed) Speech Recognition, A c a d e m i c Press. Rumelhart, D.E. & McClelland, J.L. 1982. An Interactive Activation Model of C o n t e x t Effects in Letter Perception: Part II. The Contextual Enhancement Effect. Some Tests and Extensions of the Model. In P s y c h o l o g i c a l Review, 89, 60-94.
242
Thompson, H.S. & Ritchie, G.D. 1984. Techniques for Parsing Natural Language: Two Examples. In M. Eisenstadt & T. O'Shea (eds.) Artificial Intelliqence Skills. Harper & Row. Warren, R.M. & Warren, R.P. 1970. Auditory Confusions and Illusions. Scientific American, 223, 30-36.
243