Speech Recognition by Computer - Tasso Partners LLChttps://www.researchgate.net/profile/.../Speech-recognition-by-computer.pdf

Chapter 11 Speech Recognition by Computer Jared Bernstein, Horacio Franco

Chapter Outline

Introduction Current Applications Future Applications Technical Overview Functional Description and Relation to Speech Synthesis Internal Structure of Hidden Markov Model Speech Recognition Potential Intermediate Representa tions Task Types History from 1970 to 1990 Performance Specification Recognition System Design Signal Analysis Searching the Compiled Language Model The Compiled Language Model The Search Methods and Materials for Training Vector Codebook Selection Acoustic Model Training Characteristics of HMM Training for Continuous Speech Evaluation Applications Commercial Educational Components of Foreign Language Competence Live Instruction Language Education 408

Pronunciation Scoring Alignment Training Prosodic and Segmental Scorers Authoring Lessons and Tests Future Research Missing Resources Missing Science Fast Speaker Adaptation Source Separation Noise Immunity Language Modeling Detail Acuity Summary Key Terms

adaptation 414 analog signal 416 continuous speech 423 delta spectrum 419 digital signal 416 discrete Fourier transform 417 error rate 414 finite state machine 411 frame 420 hidden Markov model 411 Markov model 419 perplexity 414 speaker adaptation 431 speaker dependent system 413

Chapter 11: Speech Recognition by Computer 409

speaker independent system 413 spectrum 411 syntactic parser 412

training data 423 user interface 412 vector codebook 417

INTRODUCTION

one university or another, linguists, psycholo gists, speech scientists, aerospace engineers, biologists, and medical specialists are all seeking new ways to apply speech recognition. The basic development efforts may involve the coopera tive efforts of people trained in computers, mathematics, linguistics, speech science, and psychology. The commercial efforts at speech recognition are generally found in small companies that spe cialize in speech technology and in the research and development laboratories of large compa nies involved with computers or telecommuni cation. For example, major efforts in speech rec ognition have been under way for more than 20 years at both IBM and AT&T. Until recently most of the work in these large laboratories has been focused on the core problem of getting a machine to identify the words in an acoustic stream. That capability improved significantly during the 1980s, and now these companies and others are building systems for the public that use speech recognition. What are the current applications of speech recognition and what upcoming applications will justify the research investment by govern ments and industrial firms? We can briefly describe two current applications that suggest the limits and the promise of the technology: emergency medical reports and the handling of collect telephone calls. Future applications in education and in speech-language pathology and audiology will also be mentioned here and will be treated more fully at the end of the chapter.

When someone says a machine can recognize speech, it means the machine can take in a speech signal and produce a sequence of words that fairly represents what was said. In other words, a speech recognizer is a device that converts an acoustic signal into another form, like writing, to be stored or used in some way. This ability seems basic to a person who is highly literate, but you can get an appreciation for the difficulty of the task by attempting to transcribe or take dictation in a language that you do not know. Taking dictation in a new language requires that you learn the vocabulary and the pronunciation of the language. Getting good at this new skill re quires a lot of practice hearing the language spo ken. It helps to know what people are likely to talk about, how they express themselves, and what are the common constructions and idioms. This same information must be learned by a machine if you expect it to recognize the words in a stream of spoken English. The machine must know some English words and how they are pronounced. Moreover, it will perform much more accurately if it has some information about common word sequences in English and how they are typically used in some domain of discourse. However, although computational methods for recognizing speech may now be sufficient to accomplish this task, the methods discussed in this chapter are probably not the best model for understanding speech perception or word recognition as performed by human listeners. Many human processes are demonstra bly different; the human skills develop differ ently, they operate in a different context, they succeed in a different range of circumstances, and they fail in a different way. The design and construction of systems that recognize speech are most often carried out by people trained in computer science or electrical engineering. Work is going on at many univer sities and at various commercial laboratories. In universities basic development is likely to take place in the department of computer science or electrical engineering, while new applications of speech recognition are developed in any depart ment where someone sees a potential benefit. At

CURRENT APPLICATIONS

An emergency room physician may see 36 patients in a 12-hour shift. Each patient's visit requires a report for the hospital and the insurer of the patient, and each report must include certain information. Many emergency room physicians dictate this information to a medical report system that uses speech recognition. The physician only speaks various phrases that fill in the blanks in a preset report template, and the process is faster and cheaper than conventional alternatives. Furthermore, the resulting reports

410 Part ill: Speech Perception

SPEECH

TEXT-TO-SPEECH SYNTHESIS

TEXT

SPEECH RECOGNITION SYSTEM

TEXT

SPEECH

FIGURE 11-1 Text-to-speech synthesis and speech recognition shown as inverse processes. are more satisfactory because they always in clude all the information that the various ad ministrative channels need. The physician says the date and time of admission, date and time of injury, the nature of the injury and what treatment, if any, was administered or recom mended. Of the many features of this scenario three are particularly significant: First, the speech to be recognized is not arbitrary. At certain points in the procedure the machine expects a date or an injury description or treatment, and these utterances will come in a fixed format. Second, the users of these ma chines are few, and they soon gain experience with it. They have been trained in medical school to describe injuries and treatments in a particular stylized format. Finally, the medical vocabulary itself is mostly polysyllabic and distinct and is therefore relatively easy to recognize. There are many well-formatted applications, like emer gency room reports, whose vocabulary is known, and the user population is small and controlled. A more recent successful application accepts speech over the telephone. Handling collect telephone calls has been a major activity for thousands of telephone op erators. New systems at AT&T and other telephone companies handle the collect call transaction by computer. Callers are asked to say their names; the machine records the name as it is spoken and then puts the call through to the intended recipient. The person who answers the call at the receiving end hears the recording of the caller's name and is asked by the machine to say yes to accept the charges. A speech recogni tion system decides whether the recipient has said yes and then either puts the call through or rejects it.

FUTURE APPLICATIONS

There are many future applications of speech recognition in education, and speech-language pathology, and audiology. For example, imagine a machine that can teach spoken English to a recent immigrant who has only limited profi ciency in English. Imagine a system that can automatically diagnose articulation disorders and measure progress during treatment. Imagine a system that can monitor and help a person who is learning to read or that engages students in a spoken dialogue in Spanish. Such a system is not like that found in a traditional language labora tory, where a student speaks onto tape; instead, the system hears whether the student's response is correct and judges how fluent the answer is and how well it was pronounced. All of these systems are now in operation in one laboratory or another. Now, let us try to understand in more detail what speech recognition is and how it works. The potential and the limits of speech recogni tion for speech, hearing, and language research and treatment should be clearer if the nature of the technology is understood. TECHNICAL OVERVIEW

Functional Description and Relation to Speech Synthesis Take a closer look at the functional (external) behavior of a speech recognition device. It accepts acoustic signals as input and produces sequences of words as output. This is the inverse function of a text-to-speech synthesizer as de scribed in Chapter 7. Figure 11-1 shows a text-to-speech synthesizer and a speech recog-


TEXT NORMALIZER

I

TEXT-TO-PHONE CONVERSION

FIGURE 11-2 Component processes in text-tospeech synthesis. nizer in their simplest functional representation. The synthesis function is like a person reading aloud from a written text, and the recognition function is like a person taking dictation or transcribing a spoken message. If both systems worked perfectly, the operation of one could undo the operation of the other. That is, if a particular text were fed to a text-to-speech synthesis device and an ideal speech recognizer listened to the spoken output of the synthesizer, the recognizer would reproduce the original text. The internal structures of the synthesis device and the recognition device may be very differ ent. The typical text-to-speech synthesizer implements a series of processes that show some similarity to the processes discussed by linguists and psychologists when they are describing the structure and use of natural spoken languages. One simplified design for text-to-speech conver sion is shown in Figure 11-2, with the text coming in at the top and a speech signal generated out the bottom. The first process normalizes the incoming text by expanding

forms such as abbreviations and monetary amounts. Then a phonological process takes the letter string that makes up each word and analyzes it into a sequence of phonemes. The phonological process of translating the spelled form of the word into a phoneme sequence is usually done by first looking for the whole word in a lexicon. If the whole word is not found, the process works by analyzing the word into morphological elements like a stem and affixes, which can be matched and transformed into phonemes and stress information. This phonological process works with refer ence to a lexicon and a set of rules that look more or less like the kinds of rules and lexical forms proposed by linguists. After the prosody module constructs a rhythm and melody to impose on the phonemes in the sentence, the actual construction of the acoustic signal pro ceeds inside a speech synthesizer. The signal construction, however, is radically different from the coordinated muscular and aerody namic processes in the vocal tract. Internal Structure of Hidden Markov Model Speech Recognition Figure 11-3 shows a schematic design of a speech recognition device. The acoustic signal is transduced by the microphone on the left and processed by the signal processor and search modules to the right until the best matching sequence of words is produced. Some stages of speech recognition may also be similar to simple models of human performance in similar tasks. For example, the operation of a microphone and a frequency analyzer are somewhat similar to processes that occur in the auditory periph ery. However, the methods used in speech recognizers to identify words and hypothesize sentences are probably fundamentally different from the corresponding processes in human listeners. The recognition system schematized in Figure 11-3 has two models, one of speech and the other of language. In the simple hidden Markov model (HMM) system that we will discuss below, the model of human speech acoustics is just the spectrum codebook— nothing more than a list of spectral shapes that are commonly found in speech signals. The language model in our example system is a small finite state machine—that is, a graph that de scribes exactly which words can start an utter ance and which words can follow those words.

412 Part III: Speech Perception

ACOUSTIC SIGNAL SIGNAL ANALYSIS

SEQUENCE OF SPECTRA AND ENERGIES

WORD SEQUENCE

FIGURE 11-3 Component processes and static structures in speech recognition. For each word the phones are given, and for of phones or morphemes) are not calculated as each phone there is a list of spectral shapes that such. Thus, although recognition systems in are likely to be realizations of it. The signal principle can produce phonemic transcriptions, processor and search modules find the elements most systems are not designed so that this can be in these models that are the best match for the done easily or with optimum accuracy in isola incoming acoustic signal. These operations are tion from the task of finding the best word explained in the next section, which describes sequence. the design of a typical HMM-based speech recognition system. Task Types Potential Intermediate Representations

In the schematic representation of speech syn thesis shown in Figure 11-2 the text goes through many intermediate representations as it is converted to speech. As text is processed inside a text-to-speech converter, it takes several understandable forms that may be accessed for one use or another. For example, letters enter the text normalizer and words are produced. From these words morphemes and phonemes are produced by the text-to-phoneme conver sion. The prosody module produces a specific intonation melody and an explicit rhythm to align with the words and phonemes. The amount and kind of intermediate infor mation normally calculated in the process of speech recognition is very different from the intermediate representations of a text-to-speech synthesis system. The only standard intermedi ate representation is the sequence of spectra produced by the signal processor, and then there is the final product, the best-matching word sequence. It takes significant engineering to allow most current recognition systems to pro duce a string of phonemes or a fundamental frequency pattern that is time-aligned with the words in a sentence. The search process, as explained below, is designed to find the opti mum word sequence. Representations at other linguistic levels (e.g., the best matching sequence

Many types of tasks could be accomplished with a speech recognition device. From the acoustic signal, one could identify the speaker or the speaker's gender or the language being spoken or the speaker's physical condition or the speaker's skill in producing the language. In this chapter we focus on speech recognition, by which we mean the recognition of the words in the signal. Bear in mind that various systems may operate at different levels of complexity to accomplish tasks that could be identified as speech recognition. A system that accepts and answers spoken queries about commercial air line schedules and fares will have in it a speech recognition system as we have been using the term here, but it probably will also have many other component systems: • A syntactic parser that assigns syntactic labels to words or phrases in the utterances. • A semantic model of requests, times, flights, and destinations. • A database of airlines, flights, times, aircraft, fares, and connections. • A user interface that guides the user of the system and responds by graphic displays or by voice or both. The focus in this chapter is on understanding just those processes involved in the extraction of word sequences from the acoustic signal. This single speech recognition function is essential in many larger system applications, and so far,


speech recognition can be treated in isolation from the many other issues that affect its functioning in more complex systems. Further more, limiting the task domain should make the exposition of the technology simpler. History from 1970 to 1990 You can better appreciate the strengths and limits of current speech recognition design if you know a bit of the history of speech recognition since 1970. There were some attempts at auto matic speech recognition before 1970; the 12-page review of the field in Flanagan's com prehensive 1972 book, Speech Analysis, Synthe sis, and Perception covers most of the significant work up to that time. Some systems used analog electronic circuits to recognize small vocabular ies (like the English digits zero through nine in real time) and a few limited digital systems offered similar function (e.g., Reddy, 1967) but used digital computers. Since 1970, speech recognition has advanced in two waves, both based on general pattern matching techniques and implemented in digital computers. First, there was the rise and decline of systems based on dynamic time warping (DTW); and second, there was the emergence and dominance of systems based on the HMM. A third major thread in the development of speech recognition during the 1970s and 1980s was the expert system based on speech and language insights. Although some systems based on expert knowledge of phonetics and linguis tics could accurately recognize speech (Weinstein et al., 1975), most have not been as accurate, as fast, or as extensible as contempo raneous systems based on DTW or HMM algorithms. That is, systems based on more general approaches to optimizing the match between a signal and a preexisting pattern have consistently outperformed systems that seem more intuitive to the expert trained in linguistics or acoustic phonetics. Typically, a DTW recognition system is de signed to recognize isolated words spoken one at a time by a particular talker who has trained this DTW system by repeatedly speaking a fixed set of words to the system during a training session. DTW technology was developed and extended at AT&T Bell Labs during the 1970s (Itakura, 1975). DTW was the basis of most of the commercial speech recognition systems that were available during the late 1970s and early

1980s. A clear explanation of the internal operation of a DTW system can be found in Levinson and Liberman (1981). Basically, a DTW system stores a separate template of each word in its vocabulary. The templates are based on example pronunciations that the system records during training sessions. The templates are sequences of acoustic spectra that represent the word. When a DTW system is recognizing, it captures each speechlike event in the incoming acoustic stream, reduces it to a sequence of spectra, and compares the sequence of incoming spectra with each word template by stretching and/or compressing the incoming spectral sequence to form a best match with each of the stored word templates. The incoming signal is identified with the word template that best matches the incoming signal. The dynamic programming technique used to map the incom ing spectral sequence to the stored template is called time warping. The limits of the classic DTW technique are that it is a speaker dependent system (each talker has to train the system), it is most suitable for recognizing isolated words or phrases, and it operates on each word as a unique template, so larger vocabularies have to be trained by having users say each word. Some of the limits of classic DTW recognition were overcome by the early 1980s. DTW techniques were extended to handle connected speech, and clustering meth ods were developed to allow accurate recogni tion in a speaker independent system for lim ited vocabularies. Furthermore, special-purpose computing elements supported DTW recogni tion of vocabularies of up to 1000 words by the mid-1980s. However, the treatment of each word as a unique template makes the training of a DTW system extremely inefficient. Since the DTW technique recognizes no subword units like syllables or phonemes, word sets like {cable, table, fable, able, sable, label...} cannot share training material. Each word has to be trained separately by a population of speakers. Fortu nately, an alternative method for representing words, HMM, has very efficient training meth ods and operates on words in terms of phonemesized units. Thus, with an HMM system, a fairly accurate speaker-independent model of a new word (for example, "chronic" [kranik]) can be constructed from material that was part of other words (like "crop", "crock", "Ronnie", "sonic", "nickel", and many others) that may be

414 Part ill: Speech Perception UNCERTAINTY (PERPLEXITY) 100

1000

MODE OF SPEAKING CODE

W O R D - B Y- W O R D

CAREFUL

CASUAL

ADAPTATION 4 HOURS

10 MINUTES

NONE

ENVIRONMENT/CHANNEL SILENCE

OFFICE

LOUD-VARIABLE

ERROR RATE

1/2

1/5

1/10

1/100

1/1000

FIGURE 11-4 Dimensions of speech recognition performance.

part of a large speaker-independent training set covering the phonemes and common phoneme sequences of English. Speech recognition systems based on hidden Markov models of words were first devised at IBM and Carnegie-Mellon University in the mid-1970s (Baker, 1975; Jelinek, 1976). By the early 1980s, it was shown that they could perform as well as DTW systems on those very tasks for which DTW had been optimized, and the long-term advantages of HMM techniques in large-vocabulary, speaker-independent sys tems were becoming clear. Thus, by the end of the 1980s, DTW systems had been generally supplanted by HMM systems, although DTW systems can operate very successfully in tasks for which a limited vocabulary like the digits is spoken by a limited number of speakers. In a later section of this chapter, we describe an HMM-based speech recognition system with out reference either to alternative techniques such as neural networks (Cohen et al., 1993) or to the possible integration of semantic and task-domain information into the recognition system design. As of 1995, almost all the high-performance experimental and commercial speech recognition systems share some of the major elements of the design that we describe in this chapter. Performance Specification Consider the concept of performance as it relates to speech recognizers. Just as with the

performance of a car, the performance of a speech recognizer has many aspects. Recogni tion systems are sometimes touted in terms of their accuracy. For example, many systems at various levels of complexity have been described as "99% accurate" in some task. This is vaguely similar to comparing automobiles in terms of top speed. The cars that have the highest top speed are often not the cars that most people want or need. A useful car typically offers some agree able array of qualities including ease of opera tion, reliability, fuel economy, smooth ride, a quiet interior that seats five, and an affordable purchase price. The fastest racing cars are not suitable for general street use. There are many dimensions along which speech recognizers vary. Five of these dimen sions—perplexity, accuracy, adaptation, mode, and noise—are shown in Figure 11-4. Perplexity relates to the uncertainty of the task. If the possible distinct outcomes of the recognition are many, the perplexity is high, and if the recogni tion is only deciding between two possible options, the perplexity is low. The perplexity is like the effective or expected vocabulary size. Error rate in Figure 11-4 means raw accuracy, shown in the figure as average errors per number of words. Adaptation means how long the speaker has to use the system before it reaches its peak performance for that speaker—does it take 10 minutes or is it immediate? The mode of speaking is the manner in which the user is expected to speak. Is the form of input a code like {alpha, bravo, Charlie, delta...}, or is there


a requirement to speak in an isolated word-by word mode, or can the system recognize casual, continuous speech? The last dimension shown is environment/channel. Can the system operate in normal office noise, or does it require a very quiet acoustic environment? Is it immune to adverse noise conditions, and does it operate well in the loud and variable noise of a factory floor? Can it operate on speech signals over radio or telephone channels? Notice that the right-hand ends of the arrows represent the desirable high-performance goal for speech recognition but that no recognition system, even a human listener, can operate at the right end of all the dimensions at once. People can simultaneously reach the right end of most of these dimensions, except for loud and variable noise. A current speech recognition system can achieve the right end of any one of the dimen sions in Figure 11-4 by limiting the requirement on the other dimensions. Thus, for example, a system can achieve accuracies of just 1 error in 10,000 words if the active vocabulary is a set of three or four distinct words and if the system can adapt to (or train itself on) the speaker repeating those few words many times over several hours and the acoustic conditions are good. Similarly, we can design systems that will recognize any one of 20,000 or even 50,000 words if an error rate of 1 in 10 is acceptable and again if the speaker trains the system for several hours and the acoustic conditions are excellent. We discussed two commercially viable appli cations of speech recognition: emergency room reports and collect call handling, at the begin ning of the chapter. Where do these applications sit on the dimensions of Figure 11-4? The collect call handler requires the recognition of "yes" and "no" over the telephone from any person at any location. Phone company engineers say that a rejection rate of 1 in 10 is acceptable. Thus, the collect call handler is operating at the difficult end of the adaptation and mode dimensions; the machine has to deal with any unfamiliar person, and most people speak fairly casually in most circumstances. The noise and channel condi tions are a challenge too, but the very low perplexity of the task—yes or no—and clever user interface design around the l-in-10 rejec tion (the person just gets switched to a human operator) make the service a major commercial success. The emergency room reporting system has quite a large overall vocabulary, but the speech

recognition task is always constrained to be rather low perplexity in the way the system is used. Accuracies seem high (the emergency room physicians are satisfied with the accuracy), but only a small number of physicians use each system, and each is required to train the system for several hours before the system reaches its best performance. The physicians speak care fully and the noise environment is usually benign. In addition, the medical vocabulary is composed of fairly distinct polysyllabic words that also help the recognizer maintain adequate accuracy. The Advanced Research Projects Agency (ARPA) has been the single largest U.S. govern ment supporter of speech recognition develop ment in the period between 1970 and 1990. Since about 1985 ARPA has been setting up a series of technology challenges that can be understood with reference to the dimensions of Figure 11-4. ARPA has focused mainly on speaker-independent recognition of sentence material read aloud in a normal fashion. This sets the adaptation to the right hand end of the arrow (no adaptation) and the speaking mode somewhere between careful and casual but certainly continuous. The sentence material has been taken from several sources (e.g., The Wall Street Journal), and the perplexity of the tasks has been held in the range between 50 and 100. For materials with an average sentence length of 10 words, a word perplexity of 50 implies that the recognition task is equivalent to a selection from among an equally probable set of 5010 different sentences; 5010 is about 100 quadril lion. For readings recorded in a reasonable office environment, by 1993 the best laboratory sys tems had achieved speaker independent accura cies on The Wall Street Journal of about 1 error in 20 words. The ARPA challenge is to push upward on the error-rate and environment/ channel dimensions, increasing accuracy in the presence of increasing levels of noise. RECOGNITION SYSTEM DESIGN

This section discusses the operation of a com plete but scaled-down version of an HMM speech recognition system, which we will call Simple-HMM. The system is missing several internal refinements that are needed for high accuracy; for example, it uses the technique vector quantization as an approximation of the

416 Part ill: Speech Perception

spectrum estimation that is the first step in recognition. Furthermore, the system does not reject ill-formed input, nor does it "understand" the words that it recognizes, nor is there any provision for how the system may respond after it identifies what words have been said. Assume that the Simple-HMM system has been trained, the spectrum codebook and lan guage model as seen in Figure 11-4 are set, and the system is ready to process and recognize an incoming signal. The construction of the spec trum codebook and the language model from which the system works will be discussed in a later section. This system can recognize which one of 81 possible sentences was spoken. More specifi cally, for each of the 81 possible word sequences specified in the grammar, the system will calcu late how likely it is that the input signal is a realization of that word sequence. The word sequence that can account for the observed utterance with the greatest likelihood score is "recognized" to be the one spoken. Relevant parts of the system are illustrated and explained in turn as we progress through the example. As shown in Figure 11-3, the Simple-HMM recognition system has two processes, the signal analysis and the search. The signal analysis takes an input signal and returns a sequence of spectra that more or less closely approximates the spectral shape of the input signal. The sequence of spectra is a greatly simplified representation of the original signal, yet if designed properly, this simplified representation is sufficient to identify words. The search process takes the sequence of spectra produced by the signal analysis and finds the path through the compiled language model that yields the word sequence with maximum probability for that spectrum sequence. The words traversed in that maxi mum-probability path are taken to be the recognized word sequence. The next sections describe these processes and the related data structures in more detail. Much of the art of making an HMM system accurate, fast, and extensible is not included in the Simple-HMM, but the logic is essentially the same as in the best current recognition systems. Signal Analysis

In the Simple-HMM system the first of the two recognition processes is signal analysis. In the first step of signal analysis an acoustic signal is

converted to an electrical signal by a micro phone. The microphone produces a voltage signal that varies as a function of time in a form that follows the acoustic pressure signal at the microphone. The next step is to convert this electrical signal into a digital signal so that it can be processed by a digital computer. A digital signal is just a series of binary numbers. A small device called an analog-todigital converter (ADC) replaces the electrical signal, which varies continuously over time, with a series of numbers that represent the amplitude of the original signal at fixed intervals. In a typical speech recognition system, the ADC samples the electrical signal 16,000 times per second, and the amplitude of the signal is represented by a 16-bit (or 16-place binary) number at each sample. Thus, the signal enters the computer in the form of numbers that require 256,000 (16 X 16,000) bits to represent each second of signal. These numbers are a very precise description of the speech, and if they are reconverted into an analog signal, amplified, and played back through a loudspeaker, they can sound nearly indistinguishable from the original acoustic signal. Following the operation of the ADC, the sig nal is available for processing inside the com puter. Inside the computer, the first process is signal analysis. The main purpose of the signal analysis is to reduce this flood of information (256,000 bits per second) into a more manage able form that still retains most of the important linguistic information. In the Simple-HMM sys tem, the output of the signal analysis process is a much simpler stream of information: just two 3-bit numbers, 50 times per second. This is only 300 bits per second. A 3-bit number can have any of eight values, in this case zero through seven. Figure 11-5 shows a waveform of the word [san] pronounced like the first three segments in the English word "sonic." The 1-second wave form displayed at the top of the figure is represented in the computer as 16,000 16-bit samples. A spectrogram of the same signal is displayed below it for reference, and the output of the coarse-grain signal analysis in the SimpleHMM is shown at the bottom. The key operation in the Simple-HMM signal analysis is vector quantization. Vector quantiza tion in this context is the process that approxi mates the continuously changing spectral shape of the signal by a sequence of spectral types (Gray, 1984; Furui, 1989). In the Simple-HMM

Chapter 11: Speech Recognition by Computer 417 TIME —seconds 0.3

0.4

0.7

0.8

0.9

1.0

WAVEFORM

SPECTROGRAM

SECONDS 0 ENERGY LEVEL 0 SPECTRUM TYPE

FIGURE 11-5 Waveform, spectrogram, and 20 msec frame parameters of a 1 sec signal of "san'

system, the signal is approximated by one of eight spectrum types every 20 ms, or 50 times per second. In operation, the signal analysis process produces a spectral type for every 20 ms of incoming signal. First, the actual spectral enve lope of the 20-ms speech signal is determined by calculating a discrete Fourier transform of the digital signal in that region. The signal analysis process then finds the spectrum type from among the set in the spectrum codebook that is most similar to the actual calculated spectrum. From this point forward everything in the Simple-HMM recognition process is done with reference to the energy level and spectrum type, as quantized by the signal analysis process. The signal analysis has reduced the intricate detail in the 1-second signal that entered the system to 50 pairs of 3-bit numbers per second. Thus, the search process of the Simple-HMM ignores everything about the speech signal except this sequence of 50 pairs of derived 3-bit numbers. The energy level is a number that corresponds to a coarse linear quantization of the energy level in the digital signal. The spec

trum type, however, is a vector quantization of the original spectrum. The Simple-HMM re duces all the infinitely subtle differences in sound quality and timbre into exactly eight dif ferent types. No matter what the spectrum of the incoming sound, the signal analysis classifies it as one of eight distinct types and sends that type identifier across to the search process. The sub sequent search through the compiled language model operates with reference solely to this string of number-pairs (an energy level and a spectrum type), which encodes all the informa tion about the original speech that the system will use. The spectrum types encode the actual spectra found in the incoming signal, and the set of available types is a codebook. The spectrum types are sometimes called code words (Figure 11-6). Before running the system, we establish a set of spectra generally representative of the kinds of spectra observed in acoustic signals contain ing speech. (An algorithm for deriving a vector codebook is presented in a later section.) Al though a typical speech recognition system might use a set of 256 spectra to approximate

418 Part (II: Speech Perception 40 30

40

-

SPECTRUM 0

20 10 0 -10

SPECTRUM 4

_ -

-

. -

-20 -30 i

20

-

10

-j

30

1

1

1

>o on

-20

J

I

l_l

J

I

L_I

SPECTRUM 5

-

0 -10

1|

SPECTRUM 1

-

^ -

-30 i

z

LU

10

1

1

SPECTRUM 6

-

v -

0 -10

-

-20 -30

i i

SPECTRUM 2

30 20

1

i

1

|

1

11

SPECTRUM 3

30

-

20 10 0 -10 -20 -30 -40

c3

o o

If)

1

1

o o o

o

s CM

| 11 o o o o 8 O o S 9- o o into

J I l_l

30 20 10 0 -10 -20 -30 ^0

SPECTRUM 7

o o o CM

o o o tO

J o o ■*o

LJ

o o o o o o uxo

FREQUENCY —Hz FREQUENCY —Hz FIGURE 11-6 The set of eight different spectrum types used in Simple IIMM. the speech signal, for the Simple-HMM system we derived a set of eight spectrum types, or vectors. These spectrum types are shown in Figure 11-6. Take time to verify that the spectrum types displayed in Figure 11-6 match the visible spectral properties of the speech signal displayed over time in Figure 11-5. In the middle of the [s] (the region around time 0.3 seconds), the best matching spectrum type is spectrum 3, which indeed has its predominant

energy in the high frequencies above 3000 Hz. The middle of the [a] (the region around time 0.55 seconds) yields spectrum types 2 and 4, which have energy concentrations (resonances) in the frequency range between 500 and 1500 Hz. The signal analysis matches the middle of the nasal (the region around time 0.75 seconds) to spectrum types 6 and 0, which have energy concentrated below 500 Hz and above 2000 Hz. In high-performance HMM-based speech


and where they can fit in the grammar. It also specifies the pronunciation of the words and the acoustic events that are likely to be observed when these pronunciations are produced. That is, the language model is built as a series of structures embedded within larger structures. Figures 11-7 and 11-8 illustrate this concept. Figure 11-7 displays the outer four layers of the structure of the language model of the SimpleHMM recognizer, and Figure 11-8 shows the innermost three layers of a model for the syllable [san] as in "sonic." The compiled language model is the hidden Markov model. But what is hidden about a hidden Markov model? In concise form, the hidden Markov model is a multiply embedded probabilistic mod el of the language that is to be recognized by the device. The language model is a network of states and transitions between states in which each transition between states has a probability of occurring; thus, it is a Markov model. Simple-HMM has a sentence network that expands as the word-class models, word models, and phone models (as in Figure 11-7) are embedded. The innermost states of the Markov language model are subphone units (for ex ample, the beginning of [a] or the middle state of [s]) that are not directly observable in the signal but have only a probabilistic relation to the Searching the Compiled Language observed signal parameters that are produced by Model the signal analysis. Thus, it is a hidden Markov Signal analysis, the first major module of the model; the states are not directly observed but recognition system, reduces the incoming signal are inferred with more or less likelihood, de to manageable dimensions, but the essential pending on the sequence of signal analysis elements of HMM will become apparent only parameters that are directly observed. when the search through the compiled language At its heart, Simple-HMM assumes that model is understood. words are composed of sequences of linguistic To understand the search, it is necessary to segments (like phonemes) and that each segment understand the compiled language model and can be represented by a simple statistical model. the way a sequence of number pairs produced Thus, the sentence "Sam went in" has three by the signal analysis process yields a prob words, and the words have three, four, and two ability score for each path in the language segments, respectively. Each segment, for ex model. The path with the best score is taken ample the /m/ in "Sam", is represented as a to be the system's best guess as to what words probability distribution reflecting the likelihood were in the signal. Some very efficient com that various spectral shapes will be observed putational methods, like the Viterbi algorithm when the particular segment is pronounced. The frequency spectra typical of various (Rabiner, 1989), find the best-scoring path. speech sounds are different: an [s] usually has much more energy higher in the frequency The Compiled Language Model spectrum, and an I ml usually has little highA compiled language model, the central element frequency energy but more low-frequency en in an HMM speech recognizer, contains the ergy. Thus, the HMM system represents the [m] definition of the task that the. recognizer is trying as very likely to be observed with frequency to solve. It specifies the grammar of the utter spectra that have predominantly low-frequency ances that can be recognized as well as the words energy, while the HMM system will expect to recognition systems developed through 1990, the signal analysis process generally operated in the manner outlined above, except that the signal analysis produced four or six numbers 100 times per second (a 10-ms frame update), and the numbers represented a finer quantiza tion of the incoming signal than those used in the Simple-HMM. The codebook of spectrum types typically had 256 members, and the energy levels were quantized to 32 levels. In recent high performance systems, the signal analysis mod ules also produce a delta energy and a delta spectrum for each frame of speech data. These delta parameters are derived by calculating the difference in energy or the difference in spectral shape between the current frame of the speech signal and a previous frame, usually two or three frames back. The delta parameters represent the rate and direction of change in the signal. For example, the delta parameters may indicate that the energy in the signal is increasing rapidly, although the spectral shape is changing only slightly toward a spectrum with more energy in the high frequencies. Lee (1989), Cohen et al. (1990), and Woodland et al. (1994) provide more details on the parameters of more recent high-performance systems.

420 Part III: Speech Perception PHONE MODEL

G>= WORD MODEL

WORD CLASS

GRAMMAR

SILENCE >▶( NAME )-▶( VERB >W DET >W OBJECT >*^ SILENCE FIGURE 11-7 Embedded network elements for Simple HMM. observe spectra with more high-frequency en ergy when an [s] is spoken. The compiled language model of Figure 11-3 is disassembled in Figures 11-7 and 11-8. Consider it closely. The model has a grammar (shown at the bottom of Figure 11-7) that expects and accepts exactly one type of sentence: a name, followed by a verb, followed by a determiner, followed by an object. The whole sentence starts and ends in silence. The name in the sentence can be any one of the three forms "John", "Bill", or "Diego". The verb must be one of the forms "painted", "lifted''^ or "made". Each word expands into a sequence of phones, as suggested by the word model for "John". "John" is exactly a direc tional network of three phones, and each phone

(e.g. [a]) is further represented as a sequence of subphone states in sequence. If we make some assumptions about how the words are pro nounced and assign one phone per phoneme and three subphone states per phone, we can see that this simple grammar has about 126 states when fully expanded to its implicit subphone states. What is a state? A state represents a phonetic subsegment like the beginning of [k] or the middle of [s]. In the computer the states are nothing more than tables that say how likely we are to observe a particular spectrum type or energy level when the frame in question is counted as being in that state. Figure 11-8 shows the probability distributions associated with the states in the phones [s], [a], and [n]. Only the probability distributions of the spectrum types

*"

co

M

co

O

co to co p

3*

CT

TO O

5" g

R n co co

TO P

s a. n n

c?Qg* • t 3 D- 3n5

•-I rt co

I3 SJTO ^ p f t

*^ to

P

O T

3 B. co 3^

3 I to ^

09 O

TO 5* &Ql. oal

co p

Ou 3

ts-E

»-. rt '—'

■¥* CO CO

S ?3 CO « i—i

•^ a.

TO o

to

3

ft X

•O £

co co

to O

3> en P P

ti to

5 5-

?T3 to to

s £L .-. »

£ "I

O

!-r"

CO

9i

55'

§ §3-

3- p

3 ^S

p

CO

T O▶ 3 &•co TO I B S

P CO

o p »5 TO cr3

to tr

3?