An Algorithm for Spoken Sentence Recognition and Its ... - IEEE Xplore

0 downloads 0 Views 1MB Size Report
The problem of sentence recognition is mathematically formulated as an optimization ..... By using this notation the following recursive formula is derived:.
475

CORRESPONDENCE

III. CONCLUSION The applicability of fuzzy algebra to the study of fuzzy chains has been introduced. Program correctness techniques were used to certify the main algorithm. The main contribution of this note consists of two parts. First, a new conceptual framework for the study of fuzzy systems is provided, facilitating the derivation and stimulating the discovery of various results in applied areas. Second, a proof of correctness of the main algorithm is given. This technique for certifying algorithms shows conclusively that no errors exist, in contrast to the usual technique of testing, which can only show that no errors have been found in a certain number of trial runs [27]. Several problem-oriented examples have been mentioned in the introduction, and it is our hope that the interested reader will be able to find many more applications in his field of interest.

ACKNOWLEDGMENT The authors express sincere thanks to the referees of this paper for their excellent remarks and criticism, the effect of which on this work has been profound.

An Algorithm for Spoken Sentence Recognition and Its Application to the Speech Input-Output System KATSUHIKO SHIRAI

AND

HIROMICHI FUJISAWA

Abstract-An algorithm for spoken sentence recognition is described. problem of sentence recognition is mathematically formulated as an optimization problem with the constraint of sentence structure. It is solved by a dynamic programming technique. The algorithm presented has advantages not only in that the solution is optimal in the Bayesian sense, but also that the effective number of words that affects the recognition score is reduced, the end of a sentence is automatically detected, and a sentence that is logically invalid can be rejected. The algorithm was applied to a practical situation as a speech command recognition and vocal response system. It recognizes speech command sentences and responds in voice to the operator. The vocabulary of conversation between the operator and the machine is limited, but flexibility in the conversational style is allowed. The system that was built utilizes a minicomputer with an eight kiloword memory capacity, a hardware feature extractor for speech recognition, and a hardware speech synthesizer for vocal responses. If a larger computer is available, the system can be enlarged with only minor modifications. The

I. INTRODUCTION Many speech pattern recognition systems have been designed to classify spoken words [1 ]-[4], but few have been designed REFERENCES so as to treat spoken sentences. Strictly speaking, it is difficult [1] L. A. Zadeh, "Fuzzy sets," Inform. Contr., vol. 8, pp. 338-353, 1965. to define what is recognition of a sentence or what is under[2] ,"Fuzzy algorithms," In,form. Contr., vol. 12, pp. 94-102, 1968. , "Fuzzy sets and systems," in 1965 Proc. Symp. [3] on Systems standing of meaning. However, unless the system responds to a Theory, Polytechnic Institute of Brooklyn, Brooklyn, N.Y., 1965. sentence or changes its internal state according to the meaning [4] , "Quantitative fuzzy semantics," Electron. Res. Lab., Univ. Calif., Berkeley, Memo no. ERL-M281, Aug. 1970. of the input sentence, it cannot be said that it recognizes the [5 ,"Similarity relations and fuzzy ordering," Electron. Res. Lab. meaning. That is to say, a sentence recognition system is required Univ. Calif., Berkely, Memo no. ERL-M277, 1970. , "Toward a theory of fuzzy systems," Electron. Res. Lab., Univ. to be more than a simple classification machine. [6] Calif., Berkeley, Rep. no. ERL-69-2, June 1969. In this correspondence a method is presented for the design of , "Fuzzy languages and their relation to human and machine intelli[71 -gence," in Proc. Conf. Man and Computer, 1970; also Electron. Res. a system that recognizes spoken sentences, makes vocal responses, Lab., Univ. Calif., Berkeley, Memo M302, 1971. [8] vl, "Probability measures of fuzzy events," J. Math. Anal. Appl., and changes the related state. This method was applied to a convol. 10, pp. 421-427, Aug. 1968. [9] S. K. Chang, "On the execution of fuzzy programs using finite-state versational system, the Speech Input-Output System (SPIO) of machines," IEEE Trans. Comput., vol. C-21, pp. 241-253, Mar. 1972. the robot called WABOT-1 (Waseda Robot) [5]. It accepts [10] S. S. L. Chang and L. A. Zadeh, "On fuzzy mapping and control," Japanese spoken command sentences, which are strings of IEEE Trans. Syst., Man, Cybern., vol. SMC-2, pp. 30-34, Jan. 1972. [11] R. E. Bellman and L. A. Zadeh, "Decision-making in a fuzzy environ- separately spoken words, responds to the meaning of the comment," Management Sci., vol. 17, pp. B-141-B-164, 1970. [12] E. T. Lee and L. A. Zadeh, "Note on fuzzy languages," Electron. Res. mand in speech, and makes the robot move as commanded. Lab., Univ. Calif., Berkeley, ERL Rep. 69-7, Nov. 1969. One of the most important factors in the design of such a [13] P. N. Marinos, "Fuzzy logic and its application to switching systems," system is that the machine and the operator have a common IEEE Trans. Comput., vol. C-18, pp. 343-348, Apr. 1969. [14] R. C. T. Lee and C. L. Chang, "Some properties of fuzzy logic," recognition of the situation or the scene that is talked about Inform. Contr., vol. 19, pp. 417-431, 1971. [15] P. Siy and C. S. Chen, "Minimization of fuzzy functions," IEEE Trans. between them. Therefore, the concept of situation is introduced Comput., vol. C-21, pp. 100-102, Jan. 1972. [16] A. De Luca and S. Termini, "A definition of a non-probalistic entropy in terms of "states" as in automata. The state makes a transition in the setting of fuzzy sets theory," Inform. Contr., vol.20, pp. 301-312, after the recognition of an input sentence and simultaneously 1972. [17] R. C. T. Lee, "Fuzzy logic and the resolution principle," J. Assoc. makes an output. Probable sentences that may appear under a Comput. Mach., vol. 19, pp. 109-119, 1972. limited, and thus the effective number of words (sen[181 C. L. Chang, "Fuzzy algebras, fuzzy functions and their application state arethat affects the recognition score is reduced. Further to function approximation," Division of Computer Research and tences) Technology, National Institutes of Health, Bethesda, Md., [19] P. P. Preparata and R. T. Yeh, "Continuously valued 1971. logic," J. words in a sentence should be ordered in a restricted way, which is not necessarily grammatical. This is conveniently taken into Comput. Syst. Sci., pp. 397-418, 1972. [20] A. De Luca and S. Termini, "Algebraic properties of fuzzy sets," to account by the concept of sentence structure. be published. [21] A. Kandel, "Comment on the minimization of fuzzy functions," In a practical application, the purpose and the ability of the IEEE Trans. Comput., vol. C-22, p. 217, Feb. 1973. machine is always limited, and the contents of conversation [22] , "Comment on an algorithm that generates fuzzy prime implicants, by Lee and Chang," Inform. Contr., pp. 279-282, Apr. 1973. can be finite. It follows that the problem of sentence recognition [23] , "On the analysis of fuzzy logic," in Proc. Sixth Int. Conf. on can be considered on the extension of a classification problem. System Sciences, Honolulu, Hawaii, Jan. 1973. [24] , "Application of fuzzy logic to the detection of static hazards in Another difficult problem lies in the recognition of the naturally combinational switching systems," New Mexico Inst. of Mining & Technology, Socorro, N. M. Comput. Sci. Rep. 122, Apr. 1973. spoken sentences [6]. This stems from the fact that they are [25] vol. "On minimization of fuzzy functions," IEEE Trans. Comput, continuous and the segmentation becomes necessary. In the C-22, pp. 826-832, Sept. 1973. [261 L. A. Zadeh, "Outline of a new approach to the analysis ofcomplex systems and decision processes," IEEE Trans. Syst., Man, Cybern., vol. SMC-3, pp. 28-44, Jan. 1973. ,

[27] R. L. London, "Proving programs correct: Some techniques and examples,"

BIT, vol. 10, pp. 168-182, 1970. [28] S. Warshall, "A theorem on Boolean matrices," J. Ass. Comput. Mach., vol. 9, pp. 11-12, Jan. 1962.

Manuscript received August 20, 1973; revised April 1, 1974. This work Project, Waseda University, Tokyo, Japan. The authors are with the Department of Electrical Engineering, Waseda University, Tokyo 160, Japan. was conducted under the Wabot

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, SEPTEMBER 1974

476 TABLE I VOCABULARY OF THE SPIO SYSTEM symbols

part of

words, w

's

speech

P1

VB1

tomare, hajime, mate (halt, begin, wait)

P2

VB2

susume,

maware

(walk, turn) AD1

P3

migie, hidarie

(to right, to left)

P4

AD2

ippo, niho, sampo (1 step, 2 steps, 3 steps)

TABLE II SENTENCE STRUCTURES DEFINED IN THE SPIO SYSTEM

S1 s11

Tomare

61

Maware

sl2 s13

Hajime

s62

Susume

S2

ADl-VB2

521

Migie maware Hidarie maware AD2-VB2

s3

(1) {W(a,.l),Wi(a,2),.s ,Wiz(,L,)} where the integer number i(o,h) indicates that the hth word of the ath sentence structure is a member of Pi(a,h), and La is the Sa

S6

s5,p

(l)j(i, f,l),Wi(,2,)j(a,,B2),2

{W=(,

VB2

Mate

7

=

length of a sentence in Sa. For example, S4 in the SPIO system is {W3,W4,W2}, and L4 = 3, as shown in Table IL, in which the sentence structures and the admissible command sentences defined in the SPIO system are tabulated. The set of sentences having sentence structure S, is also written as S,. The fith member can be expressed as

Sap

VB1

s22

direction, and AD2 (p4) is a set of complements specifying a distance. The jth word of the ith part of speech pi is denoted by wij, and the set of words in pi is denoted by Wi = f lIj = 1,2, * *, N, }. Then the vocabulary of the system is U=- Wi. Second, sentence structures are defined from how the pi are arranged in a sentence. The ath sentence structure S5 is defined as

AD1

s71 s72

Migie Hidarie

S8 81

AD2

Niho

'

a31

Ippo susume

Niho susume Sampo susume

s82

s83

Sampo

i(h) = i(,h)

s4

ADl-AD2-VB2

S5

AD2-ADI-VB2

j(h)

s41 s42

s51

s45

Migie ippo susume Hidarie ippo susume Migie niho susume Hidarie niho susume Migie sampo susume

a55

migie susume hidarie susume migie susume hidarie susume Sampo migie susume

s46

Hidarie sampo susume

s56

Sampo hidarie susume

s43 s44

S52 53

S54

(2)

where the integer number j(a,f,,h) indicates that the hth word of the 13th sentence in S, is the j(a,fi,h)th member of Pi(5,h). For example, S52 in the SPIO system is {w41,w32,w21} for "Ippo hidarie susume." The set of all the possible sentences is S Ua= S5, where S, = {S, / = 1,2, ,n5}. A sentence spoken to the system can be simply written as {Wi(l)j(l),Wi(2)j(2),. ,Wi(L)j(L)} without the knowledge of what sentence was uttered. If a valid sentence is uttered, the following relations must hold:

s32

s33

5Wi(o,L,)j(a5,p,Lx) }

Ippo

Ippo Ippo Niho Niho

present state of the art, it is difficult to segment a sentence into words, automatically, and in this study the speaker is required to utter words in isolation. Therefore, the sentence recognition problem is essentially a concatenation of the word recognition. It is, however, not a simple contatenation because the classification of words should not be independent of the sentence structure and the state, in order to make sense as a whole. Such a sentence recognition method will be discussed in the following section. II. FORMULATION OF THE SENTENCE RECOGNITION PROBLEM In this section the sentence recognition problem will be given a mathematical formulation. First, a set of parts of speech P = {Pi i = 1,2,..- .,m} is introduced. The concept of a part of speech is almost the same as the linguistic one, except that it is defined so as to include its meaning in addition to its role in a sentence, so that every part of speech can be made to not appear more than once in the same sentence. This condition is necessary for the use of dynamic programming in the algorithm, which will be described later. As an example, the vocabulary of the SPIO system is tabulated in Table I, together with the parts of speech defined for it, where VBI (P1) is a set of verbs without adverbs or complements, VB2 (P2) is a set of verbs with an adverb or a complement or both, ADI (p3) is a set of adverbs concerning

L

j(a,f/,h) =

La

(3)

for some a and 1l, and for h = 1,2,. ,L. For simplicity the following abbreviations will be used: I(L) = {i(h) h = 1,2, -. .,L} J(L) = {j(h) h = 1,2,..*,L}

(L5)

{i(a,h) I h = 1,2,- . ,Lx} (4) Jaf(La) = {j(a,,B,h) h = 1,2, Lai. For example, I5(3) = {4,3,2}, and J52(3) = {1,2,1 } in the SPIO system, since S5 = {W4,W3,W2} and s52 ={W4,W32,W21} I

=

Thus the problem of sentence recognition is reduced to the determination of the sequences of integer numbers I(L) and J(L) that satisfy the constraint of (3). This must be done in some optimal way, which will be described in the following section. Third, a set of states is considered, and it is denoted by Z = {z. y = 1,2,' ,F}. The state makes a transition according to a command sentence, simultaneously making an output.

The output made at the state zy after sentence Sap is accepted is written as Re,, If an input is too ambiguous or has an invalid sentence structure such as "Ippo niho maware" (turn 1 step 2 steps), the input is rejected and the response R?, is made, for instance, saying " Wakarimasen" (I do not understand). If there is no speech for a while, e.g., for 3 s, the response R*Y is made, considering the input to be a null sentence. A state transition diagram can be drawn for the conversation between the human operator and the WABOT as in Fig. 1, which shows how each state is transferred by the input and what responses are made at each state according to the input.

477

CORRESPONDENCE

TABLE III VOCAL RESPONSE SENTENCES Tomarimasu (I halt)

p9

Migie ippo susumimasu (I walk 1 step to right)

Haj imemasu (I start)

p10

Hidarie ippo susumimasu (I walk 1 step to left)

P3

Yamemasu (I wait)

P11

Migie niho susumimasu (I walk 2 steps to right)

P4

Migie mawarimasu (I turn to right)

p12

Hidarie niho susumimasu (I walk 2 steps to left)

PS

Hidarie mawarimasu (I turn to left)

P13

Migie sampo susumimasu (I walk 3 steps to right)

Ippo susumimasu (I walk 1 step)

P 14

Hidarie sampo susumimasu (I walk 3 steps to left)

Niho susumimasu (I walk 2 steps)

p 15

Wakarimasen (I don't understand)

Sampo susumimasu (I walk 3 steps)

p16

Mo ichido dozo

Dochirae desuka (To which side?)

P17

Meireio dozo (What is your order?)

Pi

dcconplIshmienof d work

*

{

, "stop'

Fig. 1. Flow diagram for conversation with WABOT.

In Tables III and IV, responses R.0l for each state and their concrete contents are tabulated. Possible conversation can be easily seen through these tables and Fig. 1.

P7

III. SENTENCE RECOGNITION ALGORITHM

In this study, the Bayes decision rule is adopted since the feature vectors are generally stochastic [7]. Let the speech pattern representing one word be X, which is a random matrix in the SPIO system. A senltence can thus be represented by a string of these X as 9 = {X1,X2,. ,Xh,.- ,XL}. When a sequence b is observed irn the state z., sentence recognition is to determine sa, i.e., a and P. The optimal determination of sa, in the sense of minimum error, is that of maximizing the a posteriori probability f(s4 Y,zy). The expression 5,z/) has the same imeaning as f(I(L),J(L) ,zy), where f(s.l and J(L) satisfy (3). Therefore, the problem of sentence 1(L) recognition can be seen as the optimization with respect to I(L) and J(L), with the constraint of sentence structure of (3). Then we have

f(sa,# Y,zy)

=

f'(ac,J5(L) Y,zy) f(Ja(L) 9,a,zy)f(a ,zy).

(5) Consider the sentence structure S4 of the SPIO system. The first word (migie or hidarie) and the second word (ippo, niho, or sampo), i.e., j(1) and j(2) for S4, are statistically independent, andj(3) for S4 is always 1 oIr "susume." Therefore, if the categories of sentence structure are appropriately chosen, it is possible to assume that the numbers j(h) (h = 1,2,... ,L5) for a fixed a are statistically independent. This fact means that the strong mutual dependence of words in a sentence can be absorbed in the sentence structure. Accordingly, the following relation can be assumed to hold: =

L

f(J58(L) a,&,zy)

=

[J

f(j(a,I3,h) a,Xh,zy).

Furthermore,f(j(a,,B,h) a,-Xh,z.) and f(a 9,z,) can be rewritten as follows by using the Bayes theorem: f(Xh

I

j(cc,fi,h),a,zY)f(j(a,fi,h) a,zY)f(Xh a,z) Zy) f(Xh Lxt,Zy)} k1)/(a f(a|fz)

f~~~ ~~

AO(c

P19

Nanipo desuka (How far?)

P20

makes a

P21

sends the command code to the WABOT control program.

communand

code for the WAB0T control program.

TABLE IV RESPONSES MADE AT EACH STATE

R221 R3211

R31 1 R R31

R331

R41 R4

R431 R441 R45 1 R4

R51 R521 R531

R54 1 R55 1

R56 R61

R62 1

P4

R 122

PS

R 132

P20 P20 P6 P20 P7 P20 P8 P20 P9 P20 PlO P20 P1 1 P20 P12 P20 P13 P20

P14 P2O

Pg9 20 Plo P20 Pl1 P20 P12 P20 P13 P20 P14 P20 P18 Pl9 P17 p15 P16

R?2

-7 R 723

R* 3

R?73

R814 R 824

R 834

R?4

R?5

R*s5

P3

P21 P17

PIS P2 P21

R *2

R*,4

P2

P5

P20 P20

pi5

p18

P4

P18 P6

P20

P7

P20

P8

P20

Pl9

PIS Pl9

Pi

P17

Pi P17 P17

.~~~~

f(j(ac,/,h) a,XhX,z) =

(Please repeat it)

Y',zY)

h=1

f(Y ~Z~)

R*j R7

478

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS, SEPTEMBER 1974

Hence, by taking the logarithm of (5) and discarding the common term log f(QI zr), the objective function to be maximized becomes

9,z7)

D(a,/3

L

= h=I

di(a,h)j(a,#,h)(Xh Z)

di(a,h)j(a,#,h)(Xh z)

log f (Xh

=

log f(a z) (6)

+

Wi(a,h)j(a,p,h)) +

1og(wi(a,h)j(a, ,Ph) x,Z7). (7)

The first term of (7) is usually used for a simple word recognition machine as a discriminant function when the a priori probability of wij is uniform. The second terms of (6) and (7) carry information on the meaning or the context and the situation. In the sequel, D(a,J(L) Y,z) is optimized instead of D(a,,B Y,z.), which is defined as follows:

D(a,J(L)

L

E di(a,h)j(h)(Xh Y,zy) h=1 =

Z)

+

logf(a

zy)

(8)

In this expression the constraint about ,B is neglected, but the optimal solution can be obtained by using dynamic programming. If the optimal a* and J*(L) of D(ae,J(L) Y,zY) satisfy (3) for some ,B*, sa*P* is the truly optimal solution that optimizes D(a,,B Y,z,). Then, if every combination of the j(h) (h = 1,2,.. ,La) for every a constitutes an admissible sentence sap, Y is equivalent to that of the optimization of D(x,f3 I,zy) D(a,J(L) 5,z). However, when ,B* cannot be found for the J*(L), the input sentence is rejected without searching the optimal result, because there are few inadmissible combinations

of the j(h). Now, let the maximum value of (8) for a fixed stage be

Ah(a)

max D(a,J(h)

=

J(h)

a

S,zy).

at the hth

(9)

By using this notation the following recursive formula is derived:

Ah(a)

=

max

j(h)

{di(a,h)j(h)(Xh Z)

+

Ah_il(a)}

(10)

where A0(a) = log f(a zy). Hence, beginning with h = 1, the optimal J*(L) for every a that satisfies La - L can be obtained sequentially by using the recursive formula, until the end of the sentence is detected; then cx* that maximizes AL(a) is obtained. The solution a* and J*(L) are represented as a

J*(L)

= arg =

[max AL(a)]

(11)

a

arg [max D(c*,J(L) J(L)

5,zy)].

(12)

This calculation process is dynamic programming [8]. This method makes it possible to detect the end of a sentence automatically, i.e., to determine the length of a sentence L. The probability that the (h + I)th word will be spoken can be evaluated from the values of (8) at the hth stage. If La* > h + 1 for the optimal a* at the hth stage, it is probable that the (h + I)th word will be spoken. Therefore, provided that there is no inclusive relation such as S, = {W1,W2} and S2 = {W1,W2,W3 }, it is quite natural to conclude that the sentence has ended, if at the hth stage the optimal a* gives the relation h = La*. This function is very advantageous because it does not require the speaker to send a signal to the system showing the end; therefore, it does not degrade the good points of using the speech input and the vocal response. The sentence recognition algorithm can be summarized as follows.

Sentence Recognition Algorithm Let us assume that the hth word in a sentence has been spoken, and the speech pattern Xh of it has been observed. Then proceed as follows.

1) Compute the discriminant function di(a,h)j(h)(Xh Iz) for all sentence structures S. that satisfy Lat > h and for j(h) 1,2,. *,Ni(a,h) for each 2) For the same S,, as in 1), compute An(a) from the old Ah 1(a) by using (10), and find a* that gives the maximum of Ah(a). 3) Check whether h is equal to La*. When h X La,*, memorize the new j*(h) for the Sa that satisfy L_ > h + 1, and continue the procedure expecting the new word Xh +1 to be spoken. When h = La*, decide that the input sentence structure is Sa*, and the sentence is Sa*p*, if /3* can be found that satisfies J*(La*)= Ja*p*(La*); otherwise the input sentence is rejected.

S,c

IV. THE SPIO SYSTEM The algorithm was applied to the SPIO system of the WABOT-l [5]. The most important points in such a system, which serves as a vocal command input system to an automatic machine, are that it should not make a mistake as a total, and that it should have as much flexibility as possible in conversational style. The first point is resolved by arranging the system flow such that after the system understands a command sentence, the operator can cancel it, if necessary, by hearing the vocal response telling him how it is recognized. In the state Z2 in Fig. 1, the probable sentences to appear are only "hajime" (begin) and "mate" (wait), so that an erroneous decision can hardly occur. In this example, the auditory feedback by the vocal response has only the meaning of confirmation, but it has great utility in making possible an actual conversation with a machine. The plan for the second point is to provide enough sentence structures so that the operator is able to speak almost freely to command the system. In Japanese ADI-AD2- VB2 and AD2-AD1- VB1 are both admissible, so that both are made admissible by providing the sentence structures S4 and S5 to maintain flexibility in the word order. It is probable that the operator himself makes a mistake, for example, saying only "susume" (walk). In this case the command is incomplete, and the system is programmed so as to ask the question "How far?" As for flexibility, sentences without a verb, e.g., "to right 2 steps" are considered to make sense, and such commands can be correctly accepted, although they are exceptions. The system consists of a minicomputer with a memory capacity of 8 kilowords and a cycle time of 1.4 ,us, the speech processor with the hardware feature extractor for speech recognition, and the terminal-analog speech synthesizer for the vocal response. The speech recognizer has three parts: feature extraction, phoneme recognition, and word recognition [9]. The forrnant information to discriminate vowels is extracted from zero-crossing intervals of the prefiltered speech waves. Two learning functions are allowed to adapt the system to different speakers; one is for vowel parameters in the phoneme recognition program and the other is for words. The phoneme recognition program generates a string of phonemes from the features sampled by the computer every 10 ms. The system can recognize 10 phonemes: five Japanese vowels, /r,n,m,N/ as one group, the unvoiced fricative consonant, the voiced fricative consonant, /p,t,k/ as one group, and a silence. The string of the phonemes is fed into the word recognition program and the phoneme

479

CORRESPONDENCE

Self-learnitnX of the vo wel1

Lea rn i -R f

Phone me

W cr d

reoiiton

recaitlatiton nc e ~~~~~~~~Sente

pa ra Ynete rs

Speech

0

Feat u re

n

extraction

l~~~peratar >

((pe rato3deCi Speech

I |Termnal anao|

speech .synthes'tzer

The Speech

Processor

wer

a

The authors are very grateful to Professor Ichiro Kato and to the members of the Bio-Engineering Group of Waseda University,

Tokyo, Japan.

sn

.CTener*ation contryl

|

.parcLvneters

REFERENCES -

WA BOT

syllables

[5]

Tal

[6]

ar w rds

m in i compuLkte r

[7]

... ..

[8]

Fig. 2. Functional block diagram for SPIO system.

[9]

transition pattern matrix X (10 x 10) is generated from the string. The discriminant function for words corresponding to the first term of (7) can be derived as (13) by assuming that the string constitutes a Markov chain: log f(X wij)

=

trace

CijTX

[1] [2] [3] [4]

|(ener+ation I phoneme seiuence

Tab6le fortheofff parameters the Japanese dl

ACKNOWLEDGMENT

rd s

R. W. A. Scarr, "Word recognition machine," Proc. Inst. Elec. Eng., vol. 117, No. 1, pp. 203-212, Jan. 1970. T. G. Keller, "An on-line recognition system for spoken digits," J. Acoust. Soc. Amer., vol. 49, no. 4, pp. 1288-1296, 1971. L. C. Pols, "Real-time recognition of spoken words," IEEE Trans. Comput., vol. C-20, pp. 972-978, Sept. 1971. L. Clapper, "Automatic word recognition," IEEE Spectrum, vol. 8, pp. 57-69, Aug. 1972. K. Shirai, H. Fujisawa, and Y. Sakai, "Ear and voice of the WABOTThe speech input/output system," Bull. Sci. and Eng. Res. Lab., Waseda Univ., Tokyo, Japan, Special issue on WABOT, pp. 39-85, July 1973. D. H. Klatt and K. N. Stevens, "On the automatic recognition of continuous speech: Implications from a spectrogram-reading experiment," IEEE Trans. Audio Electroacoust., vol. AU-21, pp. 210-217, June 1973. K. Fukunaga, Introduction to Statistical Pattern Recognition. New York: Academic, 1972. G. Hadley, Nonlinear and Dynamic Programming. Reading, Mass. Addison-Wesley, 1964. K. Shirai and H. Fujisawa, "Spoken digit recognition through phoneme recognition" (in Japenese), J. Inst. Electron. Commun. Eng. Jap., vol. 57-D, no. 3, Mar. 1974.

(13)

where the coefficient matrix Cij for the word wij can be estimated through a statistical method [9]. Thus the vocabulary can be easily changed by learning. A functional block diagram of the SPIO system is shown in Fig. 2. The vocal response is produced by controlling the synthesizer by the computer, in which the necessary tables for the response sentences, words for those sentences, and the Japanese syllables are prepared [5]. The software of the SPIO system is programmed in an assembler language and requires 8 kilowords: 1.25 kilowords basic routines, 4 kilowords speech recognition (1 kiloword the word parameters, 0.5 kiloword-a temporary memory, 2.5 kilowordsinstructions), 2.25 kilowords speech synthesis, and 0.5 kiloword-leg control. Every function is written in the form of a subroutine and only about 300 steps are necessary for the main program that realizes the conversation shown in Fig. 1. Silence required between words is less than I s including computation time for sentence recognition. Since a 350 ms silence is required to detect the word end, the computation time of sentence recognition for one stage is shorter than 650 ms. The vocal responses is put out within about 1 s after the end of the utterance of a command sentence.

V. CONCLUSION The problem of spoken sentence recognition in a limited situation was discussed. The problem was mathematically formulated as an optimization with the constraint of sentence structure. The solution, which is optimal in the Bayesian sense, can be obtained by the algorithm using dynamic programming. The algorithm has the advantage that the end of a sentence can be automatically detected. The algorithm was applied to the SPIO system for the WABOT-1. The system has moved in real time and in a reasonable manner. Although this example is rather small, the method is general and directly applicable to larger systems.

Weighted Adaptive Algorithms for Estimation of Gaussian Distribution Parameters C. Y. CHANG, MEMBER, IEEE Abstract-Two weighted adaptive algorithms are proposed for updating the estimates of the mean vector and the covariance matrix, respectively, in a multispectral pattern recognition system. To achieve computational efficiency, the auxiliary matrices have been utilized in the algorithm for covariance matrix updating. Enhancements in the performance accuracy of a multispectral processing system and extensions of the Gaussian maximum likelihood classification capabilities to larger scale surveys are the motivations in developing the algorithms presented herein.

INTRODUCTION There are two parameters, the mean vector,u and the covariance matrix X, that completely specify the multivariate Gaussian distribution. However, in actual practice these two parameters are usually unknown and must be estimated from a sample of observations. In fact, statistical estimations of the mean vector and the covariance matrix play an important role in the Gaussian maximum likelihood classification for processing of remotely sensed multispectral scanner data [1]. More often than not, the statistical characteristics of data will undergo gradual changes due to the variations of spectral, spatial, and temporal conditions. This is particularly true in the context of large area and/or long duration surveys in earth resources remote sensing applications. As a result, it is often necessary to have the distribution parameters periodically updated during the processing of multispectral remote sensing data. Clearly such an adaptation of variabilities of data will enhance the performance accuracy of a processing system and Manuscript received November 28, 1972; revised March 20, 1974. This work was supported by NASA/JSC under Contract No. NAS 9-12200. The author is with the Aerospace Systems Division, Lockheed Electronics Company, Inc., Houston, Tex. 77058.