10 Applications of N-tuple - Pages supplied by users

17 downloads 0 Views 6MB Size Report
2.1 The \VISARD model. WISARD (WIlkie. Stonham, Aleksander Recognition Device) is an implementation in hardware of the N-tuple sampling technique first.
10 Applications of N-tuple sampling and genetic algorithms to speech recognition A. Badii Sclzlumberger Technologies, Central Research Departme/lt, Famborough,

UK

M. J. Binstead 17 Myddletoll Road, Uxbridge, Middlesex, UK

Antonia J. Jones Department ofComputillg, I mper i a l Colleye of Science alld Tecllllolooy, Unirersity of London, London, UK T. J. Stonham Department of Electrical Engineering, BrlIlle! Ullitwsity, Uxbridge,

Middlesex, UK

Christine L. Valenzuela Department of Computer Sciellce, Teesside Polytechnic, Middlesbrollglz, Cleveland, UK

Abstract

N-tuple nets a re co n ce ptual l y a highly parallel architecture for pattern recogn ition, implemented in hardware as a device called WISARD. However, high-speed serial emulations of N-tuple nets afTer considera ble advantages of flexibility and cost efficiency in applications, such as speech recognition, requiring only moderate bandwidth. In this chapter we first describe a softwa re technique for designing dynamically evolved N-tuple nets and illustrate the process w h e re by the designed structure can be progressively mapped into hardwa re to a level determined by the application requirements. Next, we summarize some simulation studies which apply N-tuple nets to isolated word recognition and vowel detection. Fo r isolated word recognition it is shown tha t wit h raw data (non-pre­ emph asized, noisy speech), N-tuple recognition yi el ds i m proveme n t ove r d y nam ic time warping, while providing substantial savings in processing time.

For vowel d e t ec t io n , two distinct, single-speaker studies are described. 172

Speech re('og 11 it ion In the first experiment we attempt to accom mod a te to var i at i on in the length of a rticula t ion of a vowel by tra i n i ng six disti nct d iscrim i n ators for each class of vowel, each of the six being trained over a d ifTerent timescale. In the secon d ex perimen t on vowel detect i o n, resu l t s are presented for a task specific o ptimization of a si ngle m a p ping WISARD pattern recognizer using Holland's gen e t ic algorithm.

1. Introduction In this chapter we pro vide a syno psis of work, carried o u t by the a u t h o rs u nder the auspices of the Pat tern Recognition La boratory, Bru neI Uni versi t y, o n the applica tion of the N-tuple sam p l i n g paradigm of Bledsoe & Browni ng! to s peech recogn i t i o n.* Networks of the type und er considera t i o n are simulations of ext remely stylized models of b i o l ogical neu ral networ k s Such systems are usua lly characterized by some very simple algorithm, frequen t l y little mo re than an inner produ ct, replicated a large n u m ber of times a s pa rallel, sometimes l oosely coupled, processes. Examples of such systems in the li terat ure include perceptrons,3 WISARD nets,4 Kohonen's topol ogizing nets,S the goal seeking components of Bar t o & S u tton6 and, more recently, the co nform on net s of Fish. 7 I n this chapter we will conce n t rate on the implementation of WISARD nets, d escri bed below, applied to speech recognition. Th e advan tages of the WISARD model for pattern recognitio n are: .

• Implementati o n as a para llel, or serial, system in cu rrently available

hard ware is i nexpen s i ve and sim ple.

• Given label led sa m ples of each recognition class, training ti mes are

very short.

• The time required by a trained system t o classify an u nknown pattern

is very small a n d, i n a parallel implemen tation, is independent of the number of classes.

The requi rement for label led samples of each class poses particular problems i n speech recognition when dealing w i th u n i ts smaller than whole words; the ext raction of samples by acoustic and visual i nspect ion i s a labour int ensive and t i m e co nsuming activity. It is here t hat paradigms such as Koho nen's topologizing network, as applied to speech by Taltershall, show particular promise. Of course, in such a p p roaches t here a re o t her compensating problems; pri nci pal ly, after the network has been tra ined and produced a dimensionally reduced and feat u re• Section 2 is based on will appear elsewhere.

[2], and more detailed reports on

the work described in Sections 3 and 4

173

A. Badii et al. clustered map of t h e pattern space, it is necessary to interpret this map in terms of o utp u t symbols useful to higher levels. One approach to this pro blem is to t rain an associative memory on the net output togethe r

with the associated symbol . Applications of N-tuple sampli ng in hardware have been rather sparse, the commerci al version of WISARD as a visual pattern recogni tion device able to operate at TV frame rates, being on e of t he few to date­ another is the o ptical character recognizer developed by Bi nstead & Ston ham. However, one can envisage a multitude of applications for such pattern recogn itio n systems a s their o p e ratio n a nd advantages become more

widely understood.

Typically the real-time system is preceded by a software simulation in which various parameters of the theo retical model are optim ized for the particular app l ic a t io n . We begin by describing a software framewo rk which is sufficiently general to cope with a la rge class of such net-systems, while at the same time preserving a high degree of com puta t ional efficiency. In add ition the structure produced has the prop erty that it is ea sily mapped into hardware to a level determi ned by the a p p l i cat i on req u i r emen ts. The rationale for believing that N-tupl e techniques might be suc ces sfully applied to speech recogn izers is briefly o ut l i ned by Tattershall & Johnson,S who dem o n strated that N-tuple recogn izers can be d esigned so that in t rai n i n g they derive an implicit map of the class conditional probabilities. Si nce the N-tuple scheme requires almost no computation it appears to be a n attractive way of implement ing a Bayesian c l a ss ifi e r In a real-time speech recogni tion sy s t em the p re­ processed input data can be slid across the ret ina and the sys tem tuned t o respond to significant peaking of a class discriminator response, see ,

.

Fig. 4.

Two types of a pp l i c a tion to speech recognitio n are d iscussed. First, comparative results for isolated word, single-speaker speech "(,("09I1ition are pres e n t ed f o r a variety of N-tuple recognizers. These resu l t s are then c ont rasted with the observed performance for the same data using a standard dynamic time warping algo rithm used as a control in this context. Next, preliminary investigations in vowel detect ion are repo rted; two d i s tinct experiments are described. These experiments were restricted to rowel detection for a silly/e speaker. Both ex peri ments used the same data. In the first experiment we a tt em p t to accomm odate to va riation in the length of articulation of a vowel by training six disti nct discriminators for each class of v ow el, each of the six being trained over a di fTerent timescale. In the seco nd expe riment one mapping is used for all vowels, each vowel h a v i n g a sing l e discriminator, and Holland's genetic 174

Speech I"eCO{ll1ilioll

retlNl

data out

Iddreaing dilCl'imlnator 2

retpOnIe 0 1'IIPOf1II1

r-.·Z:'>;�' :-!l v.rri,�.· �/r:'I"" ' i."" ,,-'.I""'i?�"/,�,":"I 'a ;, decision .



confidence Figure 1

SciJematic of N-tuple recoglli:er.

algorithm is used in an of vowel detection.

attempt to optimize this map for the specific task

2. A simulation system 2.1 The \VISARD model

WISARD (WIlkie. Stonham, Aleksander Recognition Device) is an implementation in hardware of the N-tuple sampli ng technique first described by Bledsoe & B ro w n ing . J The scheme outlined in Fig. 1 was first p roposed by Aleksander & Stonham.4 The sample data to be recognized is sto red as a two-dimensional array (the re t in a ) of binary elements with successive samples in tim e stored in '

'

175

A. Bacia et al.

su ccessive columns and t h e value of the sample represen ted by a codi ng of the bin ary elements i n each colum n. The particu lar coding used will generally depend on the application. One of several possible codings is t o represent a sam ple fea ture valu e by a 'bar' o f binary 1 s, t he l e ng t h of the bar b eing proporti o n al t o the value of the sample feature. Random connections are made o nto the elements of the array, N s uch con nections being gro u ped together to form an N-t u ple which is used to address one random access memory (RAM) per discriminator. In this way a l arge number of RA Ms a re grou ped together to form a class discriminato r whose out put or score is the sum of all its RAM's o u t puts. This co nfigura tio n i s repeated to give one discriminator fo r each class of pat tern to be recognized. The RAM's implement logic functi o n s which are set up during training; thus the method does not i nvolve any direct storage of pattern data. A random map fro m array elements to N-tuples is preferable i n t h eo ry , since a systema tic mapping is more likely to render the recognizer blind to distinct patterns having a systema tic difference. H a rd-wiri ng a random map in a totally parallel system makes fab rication infeasible at high resolut i o ns. In many applicatio ns, systematic di fTe rences in i n p u t patterns of the type l i able to pose problems wit h a n o n-ra ndom mappi ng are u n l i kely t o occur since real data tends t o be 'fuzzy' at the pixel level. However, th e issue of randomly h ard-wiring individual RAMs is somewhat academic since in m ost contexts a totally parallel sys tem is n o t needed a s i t s speed (in dependent o f the numbe r of classes a n d of the order of the access time of a memo ry element) would far exceed data inp u t rates. At 512 x 512 resol u ti o n a sem i-parallel structu re i s u sed where the mappi ng i s 'soft' (ie achieved b y pseud o-rand om add ressi ng with pa rallel shift registers) and the processing withi n discriminators is serial but the discrim i nators themselves are operatin g in parallel. Using memory elements with an access time of 10 - 7 s, th is gi ves a mini mum operating tim e of arou nd 70 ms, w h i ch once agai n is i ndependen t o f the number of classes. The system is trained u si n g sam ples o f pattern s from each class. A pattern is fed into the retina a rray and a logical 1 is written into the RAMs of the discriminator associated with the class of this trai ning pat tern at the l ocations add ressed by the N-t u ples. This is repeated many t i mes, typically 25-50 times, for each class. In recogn itio n mode, the unkn own pa ttern is stored in the array and the RAMs of every discrimi nator put into READ mode. The inpu t pattern then stimulates the logic fu nctions in the discrim i n a t o r n e t w o rk and a n overall respo nse is obtai ned by summing all the logical outputs. The pattern is then assigned to the class of the discrim i nato r prod ucing the highest score. 176

Speech recognition

Where very high resolu tion image data is presen ted, as i n visual imagi ng, this design lends itself to easy implementation in massively parallel hardware. However, even with visual images, experience tends to suggest that a very good recognition performance can often be obtai ned on relatively l ow resol u t i o n d ata. Hence in m any applications, massively parallel hardware can be replaced by a fast serial processor and associated RA M , emulating t he design in micro-cod ed software. This was the approach used by Bi nstead & Stonham in optical character recogni t ion, with notable success. Such a system has the advantage of being able to make optimal use of available memory i n applications where the N-tuple size, or the number of d iscriminators, may be requi red to vary.

2.2 The development of N-tuple systems Practical N-tuple pattern recognition systems have devel oped from the original i m plemen tation of the ha rdware WISARD, wh ich used regularly sized blocks of RAM that sto re only the discriminato r states. As memory has become cheaper and processors faster, such heavily const rai ned systems a re no longer appropriate for many applications. Algorithms can be implemented as serial emulations of parallel hardware and RAM can also be used to describe a m o re flexible structu re. In such a system we m ight require a dynamically va riable number of classes, R AMs per class or mappings. N-tu ple mappings need no longer map each retinal pixel uniquely and might be varied d u ring training and across classes accord ing to some heuristic supplied by the programmer­ for example, Holland's genetic algorithm.9 Having d i fferent mappings for each class d oes require that each class be given a separate opportunity to respond, but i n some applications this may well be worth the extra overhead in time or hardware. One might easily imagi ne t hat the price to be paid for this enhanced flexibility wou l d be excessive com plexity and slow performance. However, this t u rn s out not to be the case and we will briefly outline why this is so.

2.3 Software system for dynamic reallocation of N-tuples Conceptually i t is hel pful t o think of the en t i re experimental design process of a n N-t uple classifier as the growing and filling of a dynamic t ree. Initially this tree wil l have a root from which all else will grow. In pract ice 'root' is a pointer (down) to the first of the next level n odes, which for now we may choose to thi n k of as class zero. (However, first-level nodes could equally be 'machine types' so that decomposition at the first level would t hen be into a series of parallel machines.) At the class level, 177

A. Badii el al. each class has a p o inte r (across) to the next class and a pointer (down) to the first RAM associated with that class. We can iterate this process to create a tree-machine (ie data structure) which consists of:

(1) Classes-which in turn form collections of RAMs; (2) RAMs-which form collections of input pointers

(mappings) and pointers to the block of memory used to store the RA M state.

Fig. 2 illustrates the general structure of the tree. It is important to note that the no des can hold extra information, for example statistics of their usage, a un ique identifier and other pointers which can be used for memory control. This last feature is an essential part of a dynamically re­ allocatable system. Ultimately, memory will contain two lypes of information: the nodes

which are joined by pointers to create the tree structure, and the memory which actually holds the taught information (th e N-tuple storage). The memory requirement is s tro ngly dependent on the N-tuple address size­ adding an extra input to every RAM (although one could add an extra root

Figure 2

178

T,.ee structl/I'ed N -lliple

classifier.

Speech recognition

inpu t to just o ne RAM if desired) will linearly increase t h e num ber o f nodes used b u t d ouble the amo u nt of N-tuple storage. To access the memory it is necessary to traverse the tree t o reach the req uisi te point. Fo r example, s u p pose it was required to add a n extra class. I t becomes necessary to t ra verse the tree down to th e class level and t hen along to the las t u sed class node, where a new n ode may be reclaimed fro m the 'node pool' m a i n tained by mem o ry cont rol and added to form a new class by manipulating the necessary pointe rs. The same process may be repeated in order t o add RAMs to the newly formed cl ass. In virtually every operation involving the tree a single very simple recursive algorithm, the tr(/t'erser, is used. When calling the traverser, two parameters are passed: one i s the base of the sub-tree to be traversed and the other is a p oin t er to a t a ble of actions to be performed at each node v isite d The table itself contains lists of act ions fo r each p ossi b l e node type. At p resent only t wo actions are used; the first is called when the node is entered and the other when the node is exited for the last time in t he c u rrent t raversal. For e xa m ple, if one wanted to pe rform a classification: the first action on entering the node of type cl ass would be t o clear that class's response; upon leaving, the score (nu mbe r of addressed RAMs in the discriminator which con ta i n a logic ' 1 ') wi ll have been updated by the lower levels so that the seco nd action might be to print its value and to check if i t is large r th a n the largest class score so far encountered. Depending on the netwo rk being m odelled the node types and actions can be ch osen appropriately. Fo r instance, if K oho nen's topologizi ng ne twork were being modelled, one n ode type would be a /lode, in Kohonen's sense, w hich stores a sta te vector of the dimensionality of the data-his network is es sen tially a n array of such nodes, and one action would be to modify the states of nearby' node s acco rding to the res po nse of the cu rrent node to the data being p rese nted A C-code l is t i ng of the t raverser algorit hm is given in Appendix I. 2 In most cases it will no t be necessary t o visit all nodes of the tree. So the traverser algorithm has extra switches that allow b ra nches to be bypassed or the tra ver s al aborted . In this way, fo r example, the search can be confined to a single level of the tree and abo rted w he n a specific condition o r n ode is attained. Th us a flexible and simple experimental system, having all the proposed properties, has been created. It is now rela ti v el y straigh tfo rward for the experimenter to implement his ch osen heuristics to control the evolution of the final system design. Moreover, since the structure c o n sis t s l argel y of th readed pointers, very little calculation is req uired during the trai ning and testing phases. Consequent ly, simulation t i m es are considera bly red uced. -

.

,

.

L79

A. Badii et al. Compa risons with earl i e r sim ula tion systems, s uch as JAN, give an i m provem ent o f a fact o r between 2 and 4. Direct com parison is difficult since t h e earlier systems were so slow tha t they were m odified t o l o ok o nly at i n put data w h ich had changed, and they o n l y dealt with regula r sized discriminators, etc. If systems such as JAN had to deal with vari a ble-sized discrimi n a t o rs t hen accessing a multi-d i me n sional a rray, say (class, RAM, element), could n o longer be d o n e us ing tables and would in volve two multiplications and one addition, whereas i n the presen t system access is vi a a po i nte r and i n vol ves no calcu lation. When the fu lly t ra i ned system is com plete the netwo rk of poin ters will h ave become ra ther tangled. However, t his poses no real problem s i n ce the st ructure of m emory can be rat i o nalized i n t o appropriate blocks to facilitate implementation i n t o ha rdwa re. This process is easily accomplished by a software mod ule which reo rders t he pointers. For h istorical reasons the final system has been named NEWJAM. It pro m ises to be the vehicle for m uch of the net -sy s t ems research w ork of the a daptive systems and pattern recognition group at B runeI o ver the nex t few years.

2.4 Mapping the real time system into hardware An i m portant advan t age confe rred by NEW JAM is that sin ce t h e data

structure p roduced is t ree-like it naturally decomposes i n t o hardware at several alternat i ve levels. Thus the act ual decom position can be chosen depe nding upon the bandwidth and response t i m e requ ired for t h e rea l­ time system. In Fig. 3 we sketch one possi ble approach fo r i m plementing the rea l­ time recogni t i o n system (envisaged as a co-processor connected to a m i cro-co mpu ter host) The princi pal compo nents of this system a re: .

68000/68020 CPU

This pe r fo rms input-output functions and, i nitially, all act i ons called for via the action table memory. Every actio n is intrinsically a very simple process and consequently the most frequently called actions can be progressively replaced by special p u rpose hardware (Node type A p rocesso r, Nod e type B processor, etc., in Fig. 3).

Memor}, controller

This is the hardware which performs the traverser algori t h m recursively. It could easily be implemented as a gate a rray and requires a s m all stack and access to a small number of stat us registers. In pri nci ple the traverser accesses the system memory via a separate bus (the t ree bus) and ca n d i sable-en able the 68000 bus. [n pract ice t he traverser and the 68000 may share a common bus transparently, with the t r a verser able to con t ro l priority and refresh.

180

Speech reco{j 11 it iOIl

68000 68000 bus o o�

tree



N-tuple

store

ffio:

store

'

rI

tree

traverser

processor

processor

processor

I I I

L_.

vectored interrupt logic

y

Figure 3

Tree t"lll'("'se"-b!ock dill!Jl'lIlII.

Tree memory

The t r ave rse r locates a particular node of t h e tree by consulting a particular base add ress i n tree memory. The block of m em o ry s tart i ng at th i s address contains information describing the n ode ( t y pe, etc.). This memo ry is not particularly large a nd could be i mpl em ente d i n fast RAM.

ActiolJ table memory

Having loca ted a particular n o de and reco vered the add ress of the asso ci a ted action type from tree memory the traverser c o n sul t s this add re ss in the acti o n table mem o ry which acts e ss e n t i ally as a function lo oku p table. As the number of action types is small this m e m o ry co uld

be implem e nt ed in fast

RAM.

N-tup/e storaye memory This is the largest block of memo r y and cheaper RAM.

can

be implemented in slower,

When an action request is initiated, the corresponding module, or the m ust place an acknowledgement in the traverser status register.

68000,

18 1

A. BadU et al. Upon comple t i o n of the a ctio n a return value is placed i n the status register. Having decided u p o n the action type currently requi red the t raverser places the request o n t o the acti o n bus w here it is e i ther vecto red to the 68000, i f no s pecial pu rpose hardwa re exists to perform the a ction, o r passed t o t he appropriate act i o n m odule. Initially there woul d b e n o act i o n mod u l es and the 68000 w o uld perform all these acti ons. A s action m odules are slotted into the system they t ake o ver the correspo n d i n g role fro m the 68000. An additional advantage conferred by this design is that if an actio n m o dule fai l s, the 68000 ca n resume perfo rma nce o f the a ctio n u ntil the mod u l e can be replaced.

3. Isolated word recognition 3.1 Introduction In this secti o n com parative results for isolated word, sil1gle-speaker speech reco{jllition are presented fo r ten difTerent N-tu ple recogn izers. These results are then co ntrasted w i t h the o b served performance for t h e same data using a s t a n da rd dyna m ic time warping algo ri t h m used as a control in t h i s context. Samples o f 1 6 wo rd s from a d iagnostic rhyming test l ist were collected fro m a single speaker o n a carefully standardized data acquisition system (Sh u re S M 12A micropho ne, fla t pre-emphasis profile and a Sony model 70 I ES tape recorder) for subsequent a u t o m atic re t rieval and d igital processing using sa m p l e la belling and a modular A-D, D-A system w i t h 16-bit resolution. This d a t a was then stored o n a VAX 11-750 to enable precise com pariso n of difTerent recognitio n algori t hms. The speech data bank fo r the speech resea rch incl u d es the rhym ing set, the alpha-numerics, simple co m m and words a n d their syno nyms, and the phonotactically permissible CVC-VCV co nstructs from a large speaker popUlatio n u nder both controlled and noisy envi ronmen ts. H oweve r, fo r the p relimina ry stages of the investiga t i o n it was decided to test N-tuple recognitio n systems und e r u nfavourable signal conditions and using the m inimum of pre-processing (ie non-pre-em phasized, n on­ normalized in put speech). Th us if the perfo rmance of a sim ple system, o perating on minimally pre-processed d ata from the rhym i ng set, was acceptable, then it could reasonably be expected that for a given co rpus t he early results wo uld i m prove with a m o re ad vanced N-tu pIe recognize r u s in g optimally t uned pre-processing and n o rmaliza t i o n tec h n iques. Accordingly, t he experim e n t s described here were run on d ata from the

182

Speech recognition noisy environment samples, allowing recognition to take place on sample data having no pre-emphasis or time normalization. Pre-processing was limited to a 19-channel vocoder bank,lo simulated by fast Fourier transform (FIT), and scaling the result as input to the N-tuple recognizers. The diagnostic test set was chosen so that the acoustic dissimilarity within rhyming sets (eg one/run-short) is minimal and the range of perceived phonological length did not markedly vary among the confusable rhyming sets (eg one/run/want-short; wonder/rudder­ long). The 16-word diagnostic corpus was as follows:

Word set 0 1 2 3 4 5 6 7

one run

want begun wonder rudder win two

8 9

10 II 12 13 14 IS

shoe toot tattoo toothache cooler tee three see

Two important dimensions of assessment for a speech recogmtlOn algorithm are: robustness in the face of a large speaker popUlation and the rolIofT in recognition accuracy as the vocabulary size increases. These aspects are not investigated in the present study, primarily because of resource con straints. However, this work represents a necessary first step in the evaluation of N-tuple sampling applied to speech recognition.

3.2 Experimental procedure for speech recognition The strategy adopted for the present experiments was chosen to provide flexibility and repeatability with the same data, thus enabling

comparison of differing recognition and pre-processing techniques. For this reason, simulations of the training and recognition process for eight different designs of N-tuple recognizer were performed on previously stored data using a VAX 1 1-750 system. Real-time performance was not a factor since it is known that the systems under consideration can be implemented with a satisfactory real time response when a suitable design has been proven.

3.3 Pre-processing algorithm The raw-time domain files were subjected to a

lO-ms wide FFT

183

A. Hadii et al. producing 19 8-bit samples of each filter channel every 5 ms. In the first six experiments the 8-bit value was reduced to a 4-bit value using one of three encoding methods discussed below (encoding of data). The 4-bit intensity can be considered as a weighting of each pixel on the retina and the 19 samples as a single slice in time encoded as a vertical column on the WISARD retina. In this way each word was reduced to a 120 x 19 array of 4-bit elements. The total duration being 0. 6 s. After the first six experiments the 4-bit intensity of each filter channel was replaced by a single bit which was set if a pre-determined threshold (determined experimentally) was exceeded, thus reducing the word data to a 120 x 19 array of single bits for the final four experiments.

3.4 The \VISARD

retina

The WISARD retina was sized at 100 (horizontal) by 19 (vertical), each component consisting of four bits initially and one bit subsequently. In the recognition stage of a real system the sample data can be visualized as stepping across the retina in steps of one horizontal unit (5 ms). Precise alignment in comparison with the training data would therefore not be a problem-as the data slid across, the system would be looking for a sharp peaking of one discriminator, see Fig. 4. Of course, one discriminator could be trained on the ambient noise. Thus segmentation of speech from background becomes an implicit property of this paradigm, ___ see _---three .-__ tee ___ cooler

___--toothache

___-tattoo

._---toot __---shoe

._---two _---win _---

rudder

__---wonder

max

Figure 4

184

time-·

Plot oJ(lf/ discriminatur responses to 'coot hache'.

Speech recoOllitioll �----�----�

�� ,

-

r':'

II -­ .-. "--· · - ........ _ . ..,-.

-

.

.. ·

-. . . .. _

� .. ...... .... -

toothache

1'-'-. 1-

i l

..

:==

.

.

'-

"'-­

.

.. .

_...... , .. . � . ......,.-. .. �-..,. - -.,.� .

..

...� .. .

'"

:::i._.�-:"-

Figure 5

FFT images o/llJe Irort/'toothache'.

Because the computa t i onal cost of scanni ng the image across the reti na i n 5-ms steps is t oo high i n a simulation of this type, the start of a word in the sample frame was arbitrarily decided to occ u r when a 1 0% i ncrease in the ambient energy level (summed across all fi l ter channels) was observed. In training. each such sample was presented three tiines, represen t i ng a 'ji tter' of ± 5 ms about the determined start point. Fig. 5 shows FFT samples for t he word 'toothache', The vertical l ine ind icates the time at which the threshold was exceeded; the s ubsequent 100 columns (500ms) are taken as t h e retinal image. 3.5

Encoding and rna pping

Four different kinds of encoding of the 8-bit sa m p les produced by the FFT were employed. In t h e first six ex periments each encoding reduced the 8-bit data to four bits. In the remai ning two experiments the 8-bit

sam ple was red uced to a single bit (binary encoding),

(I) Lillear-encodiny: here the top four bits of the 8-bit sam ple were selected and their binary image slotted into the retinal column in t he position determined by which filt er the output originated. (2) Thermometer-ellcodill{j: for this encoding the i nterval [0, 255] was partitioned into five equal sub-intervals and integers in each su b-interval were mapped in t o a 4-bit value. 185

A.

Hadii et al.

(3) Gray-scale-ellcodin{}: here the i nterval [0, 255] was divided into 16 eq u a l sub-intervals. Each sub-interval is i n de x ed by a 4-bit value i n such a way t hat the Ham ming di s t an ce between the i ndices of adj acent i ntervals is always 1. Th is fo r m of indexing amounts to t ra versing all the vertices ofa hypercube. The idea being that a small change in the value of the signal b e i n g encoded w i l l produce a small change of Hamming distance in the encoded image. (4) Billaryencodill{}: finally the 8-bit sam ple was reduced to a single bit by t h reshold i ng at an experimentally determined level. In the i nitial six ex p e riments N 4 a nd so 19 x 100 x 4/4 N-tuples are chosen from the 1900 x 4 bits of the retina t o define the mapping. Two types of mapping were used, namely linear, where N-tuple add resses are taken f rom cons ecu t ive p ixels i n a co l u mn, and random, where the add resses are com posed fro m bits sampled randomly across the entire retina. =

3.6 Results and conclusions for the 4-bit-4-tuple rccognizcrs Single-speaker recognition results with the 16-word repert o i re . 4- t u p l e, 40-� s sampl i ng rate (25 kHz., BW 0-8 kHz): In t he 4-bit en cod i ng , 4-t u pl e experiments the best overall performance

was obtained wit h linear encoding and a l inea r map or, equivalently, with G ray-scale e nco ding a nd a linear map. I nitially we found t his result rather u n e xpec ted in that the li near map employed t ook 4-tuple addresses from a si n gle time s lic e, whereas the random map also looked across time. However, further co m pari s on with the I-bit encoding, 4tuple experi ments suggests that 4-bit encoding may have been presenting t he system with excessive, relatively u nrepeata ble, detail. It would appear that most l earn i n g occurs d u ring the first five training instances of a ny given class, at which point the system g ives around 85% accu racy. Subsequent t raining tends initially to red uce r ecogniti o n perfo rm ance and reco ve ry is thereafter p rogressive but slow unti l saturation beco m es a significant e ffect . We will ret u rn to the q uest i o n of how the p rogress of the system t owards saturation can be effectively monitored. However, our results suggest that with these system configurations, t raining on m o re than 25 instances from each cl ass ca uses overall recog n it io n per fo rma nce to degrade. With 4-bit e nc o d i ng , a linear mapping a nd a 25-word teach set, the average perfo rman ce of 90% looks quite promising as an initial result u nd e r the unfavou rable conditions of the experiment. But t he accu racy per wo rd over the entire t raining seq uence of 5, 10, 15,20 and 25 patterns respectively was as shown in Table 1. Each d i scriminator co n si s ted of 100 x 19 16-bit RAMs, ie a 3.8 (8-bit) Kbytes per word. Since there were 186

Speech Table I. 4-lllple-lillear map-4

5

x

19

x

IOO-lilleal" ellcodillg

10

Class one

run want begun

wonder

rudder win two shoe toot tattoo

toothache

cooler tee three see

Average

recoyniticm

Training 15

20

25

40

44

96 96

100

52 88 96

88 92

92 96

92 100

100 92 100 100 80 100 76 88

100

% Accuracy

60 80 48 100

92 100

80 88

92 92

1 00 96 1 00 80

36 16

12

96 92

92

84 84 100 88

100

80

80 84 100 92

100

tOO

96 100

92

60

96 44

80 100 60

85.00

83.15

96

84

86.50

16

96 16 84

88.00

96 76 84

92

100 96 100 80 96

92

90.00

1 6 class discriminators this comprised a total of 60.8 Kbytes of RAM u sed by the 4-bit-4-tuple recognizers. Table 1 shows that the p e r fo r m ance on the word 'one' (the worst case) was p l ain ly unsatisfactory. A graphical confusion matrix for this experiment is given in Fig. 6. The confusion between the first three utterances, which uttered with no context would be particularly confusable even to the human listener, can mainly be ascribed to the fact that both the phonological duration as well as word-final and word­ initial qualities are almost identical. In an attempt to gauge the efficiency with which the discriminator RAMs were being used, two sets of statistics were produced for the case of 4-bit-4-tuple linear m ap ping with linear encoding. The first concerned the number of bits set in each 16-bit RAM versus class. The second gave the number of identical RAMs for all classes and the number of identical RAMs in pairs of classes. We briefly summarize this information. Almost all zero-addressed locations were set, indicating that virtually every 4-tuple had seen (0,0,0,0), ie a complete absence of activity in the retinal cells sampled, during training. Typically, each discriminator had around 1000±400 RAMs, from a possible 1900, with exactly one bit set. The previous observation suggests that in most of these it will be the zero-addressed bit which is set. So that

187

r;� o �o�-O �' I 1 ','0:, 1 ,

A . BacW et £II.

�o

� 2

� t

0

0 0

set

. 0

0

I

r

0

.

.

. .

. '

.

3

.

I . � I

.

.

. ..

set 4

.

:

.

.

.

'

LSJ · · �, ::J. I � "··

I>' .

.

.

:.

Figu re 6

I I . 'I



.

0

0

0

.

0

.

:

.

.



reoog

aligned l i near lin. map

COllfusioll matrices with 511 011 5120125 trailling examples.

anywhere between 3 1 and 73% of the RA M s was each merely a ffirming the absellce of some 1 6 particular activity features as a basis u pon w hich to classify. The n umber of R A M s per d iscriminator with more than one bit set was typically arou nd 500. One might say t hat approximately 2 5% of R A M s were providing a con t ribution to classification based on between o n e and 1 5 observed activity features. There were 9 1 R A M s which were identical fo r all classes. Thus most R A M s con t ribu ting on the basis of an observed activity featu re were providing useful classification info rmation. Typically the number of identical R A M s in pairs of classes was i n the range 500- 1000, ie i n any pai rwise decision 50-75% of all releva n t R AMs made a useful contri b u tion, even if m ost of these were repo rting absences of activity feat ures. Of t h e 28 500 = 1900 x 1 5 non-zero-add ressed bits per discriminator around 3000 were normally set (about 1 0%) as com pared to a to tal number of bits set in the range 5000-7000 (max. possi ble 30 400). One can i nterpret this i n one of two ways: one can argue that 1 0% R A M utilizatio n is inefficient (in a 2-c1ass system wit h ideal preprocessing the probability of any d iscrimi nator bit being set after training should be 0.5, with no commo nality between discriminator contests); or one can say that this state of affairs reflects our ignorance of p recisely what co nstitu tes the cri tically significant fea t u res of the speech signal. (Such debates have a certai n air of ci rcularity.) 1 88

Speedr

reeog/litioll

3.7 Results a nd conclusions for the I -bi t-N-tupIc recognizers A W I S A R D net is sat u rated when all discri m inators give maximal respo nse to sample d ata. This could occur, for example, as a result of over-training. I n practice one t rains the system almost to t h e point where the dynamic range of discriminator responses becomes insufficient to give an adequate margin upon which to base a classification decision. To monitor t he effectiveness of training in the last fou r experiments we define the following parameters of the system response with respect to any particular test sample: Response M i n-response Ave-response

=

{ the discrimina tor sco re expressed

as a

maximum possi ble.

=

=

t he the

percentage of

minimum respo nse from any class. average respo nse of all classes.

Let D(i) be the response of t he ith discrimi nator. For any pa rticular class j

let

d(j)

=

max

{D(i); all i not eq ual to j}.

is t h e best response from a l l discriminators excl ud ing the jth. Suppose now the data sample belonged t o the jth class. Then D(j ) - d(j) is a measu re of the ma rgi n by which the classification was m ade. I f D(j)- d(j) i s negative then the sample was incorrectly classi fied.

Thus (/(j)

Table 2. 4·tup[e-[ill{'ar-/I/ap- J 9 x J OO-hillary ('IICI}([ill?l

Class one

run want begun wonder rudder win two shoe toot tattoo t o o t h'l che cooler tee three sec

A verage

5

10

96 56 88 1 00 1 00 96 88 92 64 92 84 1 00 1 00 76 92 92

72 64 92 1 00 96 84 88 92 96 1 00 96 96 1 00 72 96 92

8 8. 25

89.75

15 % Accuraey

20

25

52 64 96 1 00 96 88 84 92 1 00 1 00 96 96 96 80 1 00 96

76 76 96 92 96

64 76 96 92 96 88 88 96 96 1 00 1 00 1 00 1 00 84 96 1 00

89 . 75

88

88

92 1 00 1 00 1 00 1 00 1 00 84 96 96

92.50

92.00

1 89

A. Badii et al. Table 3. 4-tllple-lmear-map-19 x

Class

to

5

89.6

one

cooler

l OO-binary encoding

56.4 74.4 2.3

92.8 61.2 80.6 0.1

85.6 49.1

90.4 58.4

20

2S

94.4

95.8 14.5 81.4

96.6 75.8 88.1

94.9 68.9 8 1 .5

96.3

Response Min-response Ave-response Margin Response

82.9 61

Ave-response Margin

1 1 .2 83 9 - 0.2 .

92.3 66.0 79.0 6.2

15.7 6.8

68.8 1.9

IS

Statistics

0.3

6.0

0.4

Min-response

70.2 .

As training and testing progresses, the quantity DU) - dU) can be averaged over the test samples to provide a progressive picture of how training gradually reduces the margin of decision. Over a test set T of samples we can define for each class j:

Margin =

the average of DU) - dU) over T.

In the last four N-tuple experiments these statistics were collected to provide a running picture of the extent to which each class could benefit from further training. Table 4.

4·tuple-random-map-1 9

Class one

run

want begun wonder

rudder win

two shoe toot tattoo toothache

cooler tcc

three see Average

190

x

l OO-binary encoding

5

10

88

88

IS

20

25

64

72 72 96 96 96

68

% Accuracy

52 96 100 100

72 96 100 96

56 96 1 00

92

92 92 96 92

88 92 96 92 100 92 88 48 92 1 00

96

92

60

92

64 100 100 16

84

88

96

84

56

92

84

92 92

86.50

88.50

96 84

86.50

84

92 92 96 92

100

92

96 72

96 1 00 90.25

16 96 96 1 00

88

88 96 96 92 100 92

1 00 68 96 1 00

90 75 .

Speech recoOllitiolt Ta ble 5. 4-l liple-rlllll[olll-lIIup- 1 9 5

cooler

r oO - hillar)' I!llcot/iIlq 20

25

95.3 48.2 75.0 0.4

96.3 59.4 8 0.9 0. 3

96.9 6 1 .9 8 2.9

94.8 52. 1 75.4 5.7

96. 8 53.7

15

10

Class one

x

Statistics 90.4 3 1 .4

93.8 40.6

63.2

70.6

3.5

1.8

84.2

89.0 40. 4

91.1 49. 4 7 1 .9

7.0

6.2

32.2

68. 1

58.9 1 0.4

Rcsponse M i n - response A ve-response M a rgi n

0.3

R esponse M i n -response A ve-response M a rgin

77.0 6.5

The experiments were cond ucted fo r both 4-t u pl e and 8 - t uple mappings over a wide range of t h reshold va l u es ( 1 0- to 50-chan nel i ntensi ty). It was fo u n d t h a t the sys tems were rel a t i ve l y i nsensitive t o t he t h re s hold fo r t h e bi n a ry encod ing ove r t h i s ra nge, t here being a l m o st no detect a b l e d i fference in performa nce. We will p rese n t t h e res u l t s fo r a t h res h o l d of 20 as bei n g t yp ica l in Ta ble 2. For the l -b i t -4-t u ple recogn izers t h e RAM cost is 950 bytes per discri m i na t o r, giving a total of 1 4.84 K by tes fo r a ll 1 6 classes. H o wever the margin o f decision d ecreases very ra p i d l y as t ra i n ing p rogre s ses We give the worst and bes t case figures in Table 3. .

Ta ble 6. 8-l rtl'iC'-lill(,(lr-Ill(lI' - 1 9 x I O()-hirlllr)' I!lIcvciilllJ 5

10

one

92 64

want begu n wonder rudder win t wo

88 96 92

88 76 96 96 96

Class

run

shoe

toot t a t too toot hache cooler tee

96

88

84 1 00 68 92 80 1 00 1 00

92

three see

76 92 72

Average

87.00

1 00 88 92

96 1 00 96 60 96 76 89.75

15

20

80 84

84

% Accu racy

96

1 00 92 88 92 96 92 1 00 1 00 1 00 1 00

76 1 00 88 92.75

25

72

88 92

84

96 92 88 96

96 92 88 96 92 96 1 00

92 96

1 00

1 00

1 00 1 00 84 1 00

92

1 00 1 00 1 00

84

1 00

96

96

94.00

93.00

191

A. Bad;i et al. Table 7. 8-tuple-linear-map-1 9 x l00-binary encoding Class one

5

10

77. 1

3.8

8 2. 6 40.7 62.3 3. 2

67.0

25 . 3

30.8 54.8

cooler

45.9 1 0.9

20

25

85.0 46.6 66.4 1 .9

87.8

89.6 55.4

7 1 .2

74.5

78.5

82.7

32. 1 52.6

38.3 56.3

40.3 59.4

72.9 2.6 86.3

15

Statistics

1 2. 1

I t.I

52.9 2.4

1 3.7

Response Min-response

Ave-response Margin

Response Min-response Ave-response Margin

4 1 .3 6 1 .2 1 5.8

The result given in Table 3 is significantly better t han the corresponding results for the 4-bit encod ing experiments. at a fraction of the RAM cost. It provides evidence that the 4-bit systems were being prese n ted with excessive detail. We next compare the corresponding perfonnance with a random map (Tables 4 and 5). Once again the linear map pro vides consisten tl y better results. Turning now to the I -bit-8-tuple results we have (Tables 6 and 7). For the I -bit -8- t uple recognizers the RAM cost is 7.42 Kbytes per discrimi nator, giving a total of 1 1 8.75 Kbytes for all 16 classes. The Table 8. 8-tuple-random-map-1 9 x l00-binary encoding Class one

run

want

begun wonder rudder win

two

shoe

toot tattoo toothache

cooler

tee

three

see Average 192

5

10

15 % Accuracy

20

25

88 56 80

92

92 84

88 84 96

84

100

96 96 92 92

52 92 S6

t OO 100

76

60

72 8 1 .75

76 96

96

1 00 96

1 00

88 96 96 76 92 88

96 92 92 92

100

88

48 100

92 89.00

96 88

100 96 88

96 96 96

96 1 00

92 100 1 00

1 00 100

1 00 1 00

88 52

9 1 .50

92 80

94.00

88

96

96 96 88 96 96 96

92 1 00 1 00 1 00

80

tOO 1 00

94.25

Speech recogllit ioll Table 9. 8-luple-randolll-/llap- J 9

Class one

cooler

x

I OO - billary I'llcot/illff

5

10

15 Statist ics

20

25

7 1. 1

84.6

7.5

80. 1 5.8 39.0 7. 1

8. 2 44.9 4.3

87.3 1 9.2 5 0. 8 4. 1

22.0 5 3.5 3.3

Respo nse M i n - response A ve-respo nse M a rgi n

4.4

66. 1 7. 1

70.4 1 1 .0 35.5 1 6.0

76.2 1 2.0 38.7 1 8.6

82.4 1 2.9 40.7 22. 1

R esponse M i n-response A ve-res ponse M a rgi n

3.7 3 1 .8 55.3

23.8 1 8 .0

3 1 .6 1 6.9

89.0

results are somewhat better and, as one migh t ex pect, the m a rgin of decision d ecreases less ra pidly as training progresses (Tables 8 and 9). These final results a re margi nally better fo r the ra ndom m a p. This suggests tha t a bi l i ty to perceive the logical co nj unction of several fo r m a n t fea t u res (in this i nsta nce an 8-tu ple recogn izer) i s req u i red before the ex pected advantage results from attempting to e x t r a ct fea t u res across the time domain of a sliding F FT. 3.8 Compa rath'c results using comcnti onal time-warping

We next d escribe the results o bta ined with the o rigi n a l 1 6-wo rd set but using conventional time-warpi ng- tem plate-matching recogni tion. Comparison of t he s e results with those of the N-tuple recognition system shows that, on t he same da ta, 8-tu ple sampl ing provided significantly improved recogn ition accu racy. 3.8. 1 DTW algorithm description

Assume, fo r the moment, that word s are not finite tempora l ly ordered sequences of spectra but co ntinuously t i me-varying, vec t o r valued functions. Su ppose aCt), b(t) (0 � t � T) are two words which w e wish to com p a re. We may d efine a metric a t the level of primitive pat terns as

D(a, b)

=

foT

d(a (t), b(t» dt,

where d is some suitable metric of spect ral dilTerence. We k now that very large local variations in the ra te of articulation of a word can be tolerated without compromising i ts intelligibili ty. This suggests that a bet ter metric should be largely invariant to cha nges of timescale. One way to a c c om p l i sh t h i s is to define a fu nction q(t) which m a ps the timescale of bet) onto that of a (t }. M odifying the previous 1 93

A. Badii et al.

equation accordi ngly

we

obtain

D * (a, b)

=

min q

fT 0

d(a (t), b(q(t))d t.

Esse ntially this is an instance of a classical variational pro blem whose solution is found by sol ving the co rresponding E uler- Lagrange equa tion. However, D * must not be calculated with respect to an a rbi t ra ry change of timescale; we must place some const raints on q and these complicate the problem so as to make it, in general, analytically intractable. Fortu nately, as Bellman h as shown, 1 1 a numerical sol ution can be efficiently obt ained by means of dyna mic progra m ming. I t was t his line of reaso ning which first led Vintsyuk 1 2 to apply dynamic programming to speech recogniti on, o fte n called dynamic time warpi ng. The DTW algorithm described below is based on the work of Sakoe & Ch i ba. 1 3 Let a j ( 1 � i � tI), bj ( 1 � j � r) be sequences of spect ral vectors. I f d(aj, b) i s a suitable meas u re of d istance between 3 j a n d bj t he DTW algori t h m finds a pat h connect ing ( 1 , I ) and (II, r) such that the cu mul ative d istance is minimal, the gu id ing principle being that if a locally correct d ecision is made at every point then a globally co rrect pat h will be fou nd (this is often obscu red by specific implementatio ns). I f the cu rrent poi n t is (i,j), then we choose the next point (i/,j') by examining the t h ree possible paths as ill ustra ted below:

/ � (i + l , j + l )

(i

(i , } )

I

(i + l , j)

and choosing a path corresponding to t h e minimum value of

d (aj, bj t I }' d(aj t I ' bj t I )' d (aj t l ' b), where any point outside the rectangular reg i o n cu m ulative d istance D*(i,j) is t hen u pdated:

D*(i',j')

=

D*(i,j) + d(aj., bj.),

D *( 1 , 1)

=

IS

o m i t ted. The

d(a l , b l ).

The final value D*(u, r) provides a l i m e normalized measure of distance between a and b. When performing recogniti o n the un known a is compared agai nst every b in the vocabulary and assigned the class for which D* is minimal. 3.8.2 Results using con\"cnt ional DTW

The DTW algo ri t h m com pares two arrays ref (the tem plate- vertical axis) and unkllowll (the test sample - h o rizo n tal axis). Figs. 7 and 8 show complete cumulative d ista nce con tours and an opti mal path fo r two ru ns

1 94

Speech recogllit ion

Figure 7

D T W for t\\'o

Figure 8

D T W for

(/ifjerelll sample oj'wotlUlche',

t ime-aliYllec/ 'toothache' "YUill.51 referellce. 1 95

A . Badii

et

al.

of the pr o g ra m In Fig. 7 the wo rd 'toothache' is c ompa r ed with a d i fferent sample of t h e same word. As a test of these routines a sample o f 'toothache' was compa red with the reference 'toothache' and the r e s u l t i n g path used t o warp the sample to confo rm to the reference. In Fig. 8 a second D T W is t hen performed, com paring the t ime-aligned sample against the reference; the resulting optimal path is, as expected, a straight line; this acts as a good test of the code. In applying t h e algorithm, only one template is used for each re feren ce word, but that reference is based on 5, 1 0, 1 5, 20 o r 25 words taken from the teach sample. For exam ple i n the first experiment five samples o f the same word were selected. The first was taken as the basic reference and the remaining fou r were time normalized against the first in the usual way. In the sample vs. sample distance a rray so prod uced, ea c h dia gon a l path was used as a time-distorting function to normalize the sample against the basic reference. Having eliminated as much time variation as possible all five samples were then a vera ged to prod uce the sing l e .

,

reference.

I t seems likely that one would get better results for DTW if each word i n the teach set were used as a separat e re fere n ce rather than by combining them as described above. H owever, the com p utation a l overhead in recognition would be so h igh that it is difficult t o imagine a real-time system performing i n this way. Table 1 0. D T W results 5

10

Trai n i n g --+ 15 % Accu racy

76

80 88

84 88

84 88

92 1 00

92 1 00

Class one run want begun wonder rudder win two shoe toot t a t too too tha che cooler tce t h ree see Average

1 96

80 92 1 00

88

92 1 00 20 1 00 92

76

76

80 80

92

76 1 00

60 92 1 00 92

80 64

76

1 00

88 84 92 80

64 88

20

76 1 00

88 80 92

80 64 88

25

84 88 92 1 00

76

1 00 88 80 96

80

68 92 1 00

1 00 92 92 1 00

1 00

l Oa

80 1 00 92 92 1 00

1 00

92 1 00

86.5

84.2

88.7

88.5

89.2

92

92 92

92

Speech recognition It would appear tha t the technique of averagi ng (time normalized) templates does provide some progressive improvement in accuracy as the number of templates increases-at least within the framework of this experiment - but that this improvement is not great (Table 10). These are good results, admit t edl y at enormous computational cost, and emphasize the value of time normalization. Nevertheless, comparison with Table 8 shows that an 8-tuple WISARD recognizer (having no time normal ization and, in principle, virtually zero computational overhead) obtained significantly better results on the same data. The inference would seem to be that if it were possible to pro vide a WISARD recognizer with time normalized data, at reasonable computational cost, the resulting system should have a remarkably good performance This was confirmed by a later set of experiments. .

3.9 Summary of results and conclusions In this ini tial series of e xperiments in the application of N-tuple sampling to the problem of speech recognition some interesting lessons were learnt (Table 1 1 ). These experiments demonstrate that under the most unfavourable conditions (noisy rhyming test utterances from a naive speaker, no pre­ emphasis, no signal conditioning, no time or amplitude normalization) N-tuple sampling, applied to single-speaker isolated word recognition with a 16-word diagnos tic vocabulary. yields an improvement in accuracy of around 5% (in the range 90- 1 00(10) over conven tional DTW using the same data. -

Table I I .

Summary of results Data

Tuple

4 4 4 4 4 4

4

4 8 8

Encoding Linear Linear

Thermometer Thermometer Gray Gray

Binary Binary

channel 4 4 4 4 4 4

I I 1 1

Binary Binary

8-bit per channel

bits per

-

19 channel DTW

Mapping Unear

Random Unear Random Unear Random Unear Random Uocar

Random

RAM per word (bytes) 3.8 K 3.8 K

3.8 K 3.8 K

% Accuracy 90* 88

79.75

SO.50

3.8 K 3.8 K

90* 87.25

950 950 7.42 K 7.42 K

92 90.75

93 94.25 89.20

• identical.

197

A. Badii et al. W i t h amplitude n o rmaliza t i o n and act ive range encod ing of the pattern vectors a further improvement can be ex pected to result. M o reover, a WI SAR D implementa tion o f N- tu ple sampling h a s virtually no computatio nal overhead (as compared to the h igh computational cost of DTW, o r other recognition parad igms), and ca n, i n principle, be built so that the respo nse time is i ndependent of t h e n u m ber of classes. A fu rther advant age of t h is parad igm i s that for a real system d iscriminator responses monitored contin uously can provide whole word recogni tion of connected s peech without the necessity fo r segmentation.

4. Vowel detectors 4. 1 I ntroduction a s peech recognition system would be to identify ph onem ic segments of co n t i n uous speech acc u rately. Phonemic recognition need not be exceed i ngly accu rate; accuracies a ro u nd 80% m ight well suffice, since relat i vely s i m pl e l i nguistic k nowledge based systems ca n detect something a p proach i n g 60% of ra ndomly induced e rro rs i n a phonemic stream of English u t terances ( Da d i i , H ui & J o nes ­ in prepa ratio n). Pho nemic rule based error detect i o n can also be enhanced to provide some degree of error cor rect ion. H igher levels of syn tact ic, semantic a n d co ntextual k nowledge m i ght t he n be used i n a similar fas h i o n to p rocess the p h o nemic s t ream i n t o text. S uch a system could i n pri n c i ple cope with a n u n l i m i ted voca bu l a rly, in con t ras t to the l i m i ted v oca bula ry word recogn i t i o n systems cu rren t l y in use. Certainly the goal of speech reco gn i t i o n must be beyond isolated w o rd recogni t i o n t o wa r d s the effective recognition of con t i n u o u s speech. Syst ems such as CO H O RT and T R A C E (see [ 1 4] , Chapter 1 5, for exam ple) p oi n t the way b u t do n o t promise cheap im plementation in the m ed i u m ru n. Despite the fact that some authors 1 5 report correct segmentation of con tinuous speech into phonemes w i t h u p to 97% accuracy, R umelhart objects to segmentation befo re recognition:

A desi rable goal fo r

Because of the QL'erlap of successiL'e phollemes, it is diffi cult, Qnd we heliere coullterproductit'e, to try to diride the sp eedl stream lip into separate phonemes ill advallce of identifyill�J t h e lIllitS. A 1l1ll1lher of other researchers (eg Fowler, 1 984: Klatt, 1 98() hare made milch the same poillt. ( [ 1 4J , pp. 60�6 1 ) Rumelhart p refers t h e approach o f allowing the phoneme identificati o n

1 98

Speech recogllition

process to examine t he speech stream for cha racteristic patterns, without first segmenting the stream into separate u nits. I t is interesting that either a pproach is practical using a WI SAR D-type device. The advantage of prior segmentation is that i t perm its some degree of time n o rmal ization before presen tation to the recognizer, and work at the Pat tern Recognition Laboratory at BruneI Uni versity has shown t hat a very considerable improveme nt in recogn i t ion occurs if W I SA R D i s presen ted with t i me- n o rmal ized d a t a . We may define a static pat tern recogni tion system to be one which stores its t raining ex perien ces in memory and refers to memory in seeki ng to classify u n k nown patterns. This con t rasts wit h a dynamic: system which continually u ndergoes s tate t ra n s i t ions a n d whose ou t p u t depends on the current (and possibly previous) state(s) a nd the input rather than t h e i nput alo ne. While, dynamic pattern recognition systems a re of co nsiderable interest, the current theoretical situation is largely specu l a t i ve and it seems likely that it w i ll be so me t ime before any practical sys tem for v i s i o n o r speech will be realized along t hese l i n es. I n a static pattern recognition system the goal is to optim ize the map between input patterns and memory while preservi ng the real-time performance and keeping t raining to a m ini mum. In a ppl i cations such a s speech, the situation is rendered more difficult by the fact t hat the significant fea t u res o f the signal are n o t really well u nderstood. Without feed back, WISAR D is a static model which makes no 1I priori assumptions about the i n put patterns and is easily implemented to give a suitable real-time performance. A s we have observed, W I S A R D is very simple and fast t o t rain, provided o n e has su i tably l abel led samples of each class. Th i s last req u irement creates seri ous logis tical problems in applyi ng s t a t i c pattern recogn i t i o n models to speech a t a level below whole words. The speech signal must be exami ned visually and acoustically by a human operator who defines t he boundaries of a segment which hopefully represents an example of the particular class. This sample can then be used for t raining or testing. Since many such samples a re req uired for each class t he construction of a suitable data base is a very time consuming p rocess. However, once such a database h as been p repared it can be used for many d i fferent ex periments and can enable d i rcct co mparison of d i fferent algori t h ms o n identical data. The experi m e n t s reported h e re were restricted t o L"01rel detec t ion ror a silTgle speake/'.

4.2 Vowel detection usi ng multiple discriminators per vowel The words were pronounced in word p a i rs which instantiated the same 199

A . Badii el al.

vowel in an a t t empt to obtain the coarticulat ive eITects which would normally be present in co nti nuous speech. The sam ple speech was collected and passed through a 1 6-channe1 filter bank to p rod uce frequency doma i n d ata. The frequency information was in 5-ms steps. For both the t raining and test ph ases it was necessary to create a para l lel file containing an i ndication a t each step as t o which class the 5 ms sample corresponded (or to no class). This second file was hand crafted and iden tifica tion was acco mplished by traversing the time dom a i n data in small steps while playing back progressively nested samples through the D-t o-A. Consequently, there is an element of subjectivity inherent in this identification process. One variant of each vowel was selected, these were: A as in fAte E as in m E t

I as in bIt o as in gOat U a s in d Ue

The Concise Oxford EI1{}lisl! Dictiollary was used as a guide in defin i n g which vowels were t o b e expected in the pronunciation of each wo rd. I t s h o u l d b e noted that various dictionaries are by no m e a n s in agreement as to the precise qual ity of each vowel that occurs in a given word a nd, of course, there is considerable variation between speakers. In a n attempt to deal with the fact that samples of a given class are l iable to considerable variation of d u ration, each vowel segment i n the t raining freq uency data (once identified as above) was, in this in itial experiment, linearly scaled to a uniform d u ration in order to fit a standard 1 6 x 1 6 8-tu ple W ISA R D retina with one thresholded bit per pixel.

The variation i n the vowel lengths was typically from 45 to 250 ms. Although we actually know how long the vowel samples are in the test phase, we cann ot use this information d u ring recognition, since t he ability to c o pe with such variation is intrinsically part of the recogn i t i o n process. To deal with this we used six diITerent scale factors. The incoming sound was pl aced in a buITer long enough to accommodate at least 250 ms (the l o ngest observed vowel length). Every 5 ms this buITer was updated and six snapshots of d i ITering lengths were presented t o the WISA R D recognizer. The classifiers with thei r difTerent scale factors were t reated as though t hey were separate classes so that d u ri ng testing the highest respo nses would hopefully detect both the correct vowel and its d u ration. 200

Speech recognition Table 1 2 Six discriminators for duration per class

Pen::cntage recognition accuracy Class only

Vowel

Duration and class

A

23.1 29.2 37.0

6 1 .5 54.2

31.8

8 1 .8

25.2

6 1 .8

E

I o U

59.3

4.2

Average

54.2

4�1 Results and eooclusioDS As Table 1 2 might suggest, a confusion matrix for response against class and duration shows that correct classification of class was more relia ble than correct classification of duration within the class. This is probably explained by the fact that estimating the vowel duration while pre paring both the training and test data-classification file is a difficult and rather imprecise affair. A second confusion matrix looking only a t response against correct class is probably more significant and is given in Table 1 3. In general terms the idea is to present a sliding window of the frequency domain data from the test utterance to the WISARD net and determine whether the vowel discriminator responses are detecting the embedded vowels. Fig. 9 summarizes the result of one suc h simulation and consists of four traces. The top two traces indicate the strength of response and the confidence (the difference between the best and second-best classifica­ tions) for all window positions. The next trace details which vowel was producing the largest response as the words slid past the window. The bottom trace indicates where the vowel was found by the experimenter. From these results it can be seen that single-speaker vowel detection from within continuous speech can be performed by a WISARD net using spectral energy data with a reasonable degree of accuracy. Table 1 3.

Corifu.sion mQlrix for class resfKNIM A A

E

Actual class

I 0 U

23 3 1 20 I

E

Classified as ... I

4

6

1

16 1 7

14

4

7

0

U

I

6

15

12 1 30

2

201

A . Badii et al.

Figure 9

SlImmary of COli! iIII/o liS respoll.�e.

H o wever, i t should be em phasized that we are only a t tempting to rec o g n iz e one part icular type of each vowel quality.

It is possible to envisage a number of improvements i n the experiment described above. For e x a m p le , most of the energy i n vowels is concent rated i n t h e lower frequencies. Th ere fo r e a suitable pre-emphasis p rofile would no doubt improve the reliab i l i t y of such a system. The significance of t h e se p relimi n ary vowel d etec t i o n results is to demonst rate the feasibility of using W I SA R D nets to reco gn i ze significan t speech fra g me n ts within words of co n n e ct ed s peech, but the re sults would be more i n teresting if genera l ized to a comprehensive set of building b l oc ks s uch as phon em es or phoneme-like fragments.

4.3 Breed ing vowel detectors usin g I l olland's genetic al g orithm Given that the i nitial mapping from the retina to memo ry, that is, the assignment of N - t u p l e b i t s across t h e retina i s r and o m , the q uestion arises as to whether the m a p p ing can be im proved for a particu lar type of application. For exa m ple , if the task were face recognition, t hen a bet ter performance might be e x pe cted i f the N-tuples w e r e sa m p l i n g more densely in that area of the retina where significant fe a tu res such as the ey e s , hai rline, and mouth are pres ented. 202

Speech recognition As a vowel detector. a relatively difficult task. WISARD gives a creditable performance considering the lack of time normalization. Across the five classes. as we saw in the preceding section. typical recognition accuracies exceed SOUIo, and in particular classes are as high as 80010, as against the expected 20% . of pure chance. Of course, carefully crafted vowel detectors can do much better than this, Jassem 1 6 reports accuracies of 92-97% in his review of speech recognition work in Poland. However, WISARD is a very simple recognition paradigm and the question addressed by the present experiment is: by how much the reasonably

2

!

c • Col



• :I

i ..

3

� �

� 1 01 r,; 1141

4 6 8 7 8

�1

! i

22,1

\

,



[281

W

/ i

II

,I

1281 j

1,/

1/

16 18 1 17 1 18 . 1 9 1 20

M

l'

j



j \

10 1 1 1 1 12 11 13 1 1 4

I

1

1181 181

9

l

/

'j

i

V

�' j /

,.'�

" -,-

�'

./ . .v

1"'"

mapping

f/)j;:;//

11031 881-11

'r

1 42

.tdr_l l 1 0 1 1 1 0 1 0 1 1 1 1 1 1 i •

I



8-tuple ",

I

• •

. . .

I

... �.

I I I I I I I

I

I I I I I . . . I_8-tuple #40 _I 8-tuplt #2 -

203

A . Badii

et

al.

perform ance can be better ma ppin{js?

impl"Ot"ed

usill{}

Holland's yelletic a{oorithm

to

'breed '

Holland's algori thm 9 was chosen because it is a very powerful adapt ive search t ech nique and beca use the mappi n g from ret i na to N-tuples is easil y described as a stri ng: each position o n t he retina is n u m bered and each block of N s u ch n u m bers i n t he string describes the mapping fo r a particular N-t uple, see fig. 1 0. This is a part icula rly pleasant s i t uation, because the usual d i fficul ty with genetic a lgorithms is representing the o bjects being optimized as s t ri ngs i n such a way that after using a genet ic operator, eg mutation t o alter a n elemen t, t h e resulting string still rep rese nts a valid o bject. I n the presen t case this is not a p roblem s i nce any string of integers in the correct range ( i n this case [ 1 , 3 20] ) represents a va lid mapping.

4.3. 1 H olland's genetic algorithm

Simulated evol u t i o n h a d been tried befo re H olland wit h ex t remely poor results. All o f these were based o n t he 'mutation a nd natural selection' model of evo l ution. H o l l and's genetic algorithms are based o n a 'sexual reprod u c t i o n and selection' model: h is pri ncipal opera t o r is crossing­ over, t h a t is, t he creat ing of a new object fo r t he next genera t ion by combi n i n g parts fro m t wo i n dependent o bjects i n the current generation. M u t a t i o n plays a m i no r role i n gene t i c algorit hms. M any ex periments have been done wit h genetic algori t h ms, a n d they have proved t o be remark ably effective and robust learn ing systems. for t he most part they have been tested as fu nction optim izers, where t he objects i n a generation are 'num bers' a n d thei r s u rviva l - reproducti ve value i s given by the funct i o n wh ose maximum we wish t o fi nd. One o f the most interesting aspects o f genetic algori th m s is t h a t they n o t o nly find the optimum object, but i n doing so t hey d iscover properties that a re common to m a ny near-o ptimal o bjects (so-called higher-ord e r schemat a). In some i nstances, this info rmat i o n is at l east as valuable as t h e optimum itself. As the name 'genetic algori thm' suggests, the i ns p i ra t i o n fo r H o l l a nd's work is taken from an a nalogy with biological systems. The mathemat ics of genetic evol u t i o n is now a very sophist icated tool wh ich has changed our percept i o n of how the evol utionary p rocess works. for exam ple, i t is now known t h a t s i m ple m u tation alone is insufficient t o ex plain the rate o f biological adaptat ion. I ns tead, mutation p lays t h e role of backgro u nd 'noise' which, by occasional random perturbation, p revents a specie from becom i n g frozen a t a l ocal optim um. Other fact o rs expla i n t h e rapid rate of ada ptation. H olland const ructs adaptive plan programs based o n the fo llowing basi c ideas. We are given a set, A , of 'struct u res' wh ich we can t h i n k of in

204

Speech

,.ecog llitioll

the fi rst i nst a nce as b e i n g a set of strings of fixed length, I say . The object o f the a d a p tive sea rch is to find a structure which performs wel l in terms of a meas u re of performance: v:

A

-+

real numbers � O.

We have so far a k n owledge base of competing struct u res a n d measure v of the o b s e r v e d perfo rmance of ge ne ra t e d st ruct u res. For e xa m p l e , if the problem were o ne of function o p t i mization the st ruct u res, or s t r i n g s, could be the binary expansion of a real numbe r to some fi xed n umber of p l a ce s, and the fu nction v could be the fu ncti o n to be m a x i mized . Then v eva l uated at the real nu mber represented by a s t r i n g would be a m e a s u re of t h e string's fi t ness t o survive. Representing strings as

a( 1 ) a(2) a(3) . . . a(l)

(aU)

=

1 or 0),

we can des i g n a t e su b-sets of A which have attri b u t es in c o m m o n , these for 'don't care' in o n e or more positions.

are c a l l e d schemata, by using '.' For e xampl e ,

a( 1 ) * a(3) * * . . . * represents t he schemata of all strings with fi rs t element (/( 1 ) and t h i r d element (/(3), all other el e me n t s being a rbi tra ry. Thus any particular string of l e n g th I is a n i ns t a n c e of 2' schemata. If J is o nl y about 20 this is s t i l l o v e r a mi l l i on schemata. An eva l u a t i o n of j ust one s t r i n g therefore yields i n fo rm a ti o n a b o u t a large number of schemata. The next in g red ie n t s of H olland's m o d el are the ope r a t o rs by w h i c h stri n gs are combi ned to prod uce new stri ngs. I t is t he ch o i c e of t h e s e opera t o rs w h ich prod uces a search st rategy t h a t ex p l o i t s c o - a d a p t e d sets of s t ruct u ral components al ready d iscovered. The t h ree p ri n c ipal opera t o rs used by H olland are crosso ver, inversion, a nd mutation. Crossover

Proceeds in three steps:

( 1 ) Two structures a{ l ) . . . a(l) and b( 1 ) . . . b(l) are s el ect ed at random fro m the current populati o n. ( 2 ) A crossover p oi n t x, i n the range 1 to / - 1 is selected, again a t random. (3) T w o n e w st ructures:

a(1)a(2) . . . a(x) b(x + l ) b(x + 2) . . . b(l) b( l ) b(2) . . . b(x) a(x + l ) a(x + 2) . . . a(l) are formed.

In m odifying the

pool of schemata, c ro s si n g over con tinuall y 205

A.

Badii et ai.

int rod uces ne w schem ata for t rial whilst t es ti n g extant schemata in new co nte x t s . It can be shown that each c ro s s i n g ove r affects a great n um b er of schem ata.

Inversion

For some ran d o m l y t ransfo rmation:

selected p osi t i on s

x

Suggest Documents