n-gram in a 1913 paper. • Markov used bigrams and trigrams to predict whether
an upcoming letter in Pushkin's Eugene Onegin would be a vowel or consonant.
HIDDEN MARKOV MODELS John Fry San Jos´e State University
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
Andrei A. Markov (1856-1922)
• A.A. Markov introduced the underlying mathematics of the n-gram in a 1913 paper • Markov used bigrams and trigrams to predict whether an upcoming letter in Pushkin’s Eugene Onegin would be a vowel or consonant
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
1
Markov models • Markov models are used to model sequences of events (or observations) that occur one after another • The easiest sequences to model are deterministic, where one specific observation always follows another Example: changes in traffic lights (green to yellow to red) • In a nondeterministic Markov model, an event might be followed by one of several subsequent events, each with a different probability – – – –
Daily changes in the weather (sunny, cloudy, rainy) Sequences of vowels and consonants in Eugene Onegin Sequences of words in sentences Sequences of phonemes in spoken words
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
2
Markov models • A Markov model consists of a finite set of states together with probabilities for transitioning from state to state • Consider a Markov model of the various pronunciations of “tomato”:
• The probability of a path is the product of the probabilities on the arcs that make up the path P ([towmeytow]) = P ([towmaatow]) = 0.1 P ([tahmeytow]) = P ([tahmaatow]) = 0.4 Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
3
Markov models
• Transition from state i to j is governed by the discrete probability aij = P (sj |si) • The state transition probabilities for a model with n states can be gathered into an n × n matrix
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
4
Hidden Markov models
• In an ordinary Markov model the output (sequence of observations) is simply the sequence of states visited: [towmeytow] • There are also Hidden Markov models (HMMs), where the notions of observation and state are separated – States do not represent observations directly – Different states produce different outputs – The output is not the set of states visited
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
5
Hidden Markov models (HMMs) • It’s “Markov” because the next state is determined solely from the current state • It’s “Hidden” because the actual state sequences are concealed from us – We only know the output, not which set of states led to that output – The state sequences are hidden; only the output observations are visible
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
6
Hidden Markov model (HMM)
• An HMM is specified by a set of states s, a set of transition probabilities a, and a set of observation likelihoods b • bj (ot) is the probability of emitting symbol ot when state sj is entered at time t Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
7
Example of HMM generating observations • At each time t a new state j is entered and an observation ot is generated from the probability bj (ot) • Here we move through the state sequence 1, 2, 2, 3, 4, 4, 5, 6 in order to generate the sequence o1 to o6 a 22
Markov Model M
a12 1
a 33
a 23 2
a 44
a 34 3 a 24
a 55
a 45 4
a 56 5
6
a 35
b2(o1) b2(o 2) b 3(o 3) b 4(o 4) b 4(o 5) b 5(o 6) Observation Sequence
o1
o2
o3
o4
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
o5
o6 8
Steps in speech recognition Concept: a sequence of symbols
s1
s2
s3
etc
Speech Waveform
Parameterise Speech Vectors
Recognise
s1
s2
s3
1. In the parameterization stage, the waveform is sliced up into frames of 10 ms or so, and each frame’s spectral features (energies at different frequencies) are stored as vectors 2. These vectors are then mapped onto linguistic symbols Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
9
HMMs for speech recognition Concept: a sequence of symbols
s1
s2
s3
etc
Speech Waveform
Parameterise Speech Vectors
Recognise
s1
s2
s3
1. In other words, the waveform is parameterized into equally spaced observations o = o1, o2, . . . , oT , where ot is the vector observed at time t 2. Observations (vectors) are mapped onto linguistic symbols Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
10
HMMs for speech recognition a 22
Markov Model M
a12 1
a 33
a 23 2
a 44
a 34 3 a 24
a 55
a 45 4
a 56 5
6
a 35
b2(o1) b2(o 2) b 3(o 3) b 4(o 4) b 4(o 5) b 5(o 6) Observation Sequence
o1
o2
o3
o4
o5
o6
• State transition probabilities aij model the duration and ellipsis of phones • Output probabilities bj (ot) model spectral variability in the phones Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
11
Decoding • HMMs only produce output observations o = (o1, o2, o3 . . . ot) • The precise sequence of states s = (s1, s2, s3 . . . st) that led to those observations is hidden • However, we can estimate the most probable state sequence s = (s1, s2, s3 . . . st) given the set of observations o = (o1, o2, o3 . . . ot) • This process is called decoding • The Viterbi algorithm is a simple and efficient decoding technique
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
12
Viterbi algorithm in pseudocode /* Given an observation sequence o[t], a transition matrix a[i,j], and an observation likelihood b[i,o[t]], create a path probability matrix v[i,t]. */ v[0,0] = 1.0 for t = 0 to T do for s = 0 to num_states do for each transition i from s do new_score = v[s,t] * a[s,i] * b(i,o[t]) if (new_score > v[i,t+1]) then v[i,t+1] = new_score back_pointer[i,t+1] = s /* To find best path, choose the highest probability state in the final column of v[] and backtrack */ Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
13
Viterbi algorithm
• Each time v[i,t+1] is gets a higher score from state s, back pointer[i,t+1] is reset to s • back pointer[s,t] therefore only stores optimal paths • See animation
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
14
Viterbi algorithm with beam search
• In practice, the beam search technique is used to avoid having to consider all possible transitions at each time step • Only transitions whose path probabilities are within some percentage (“beam width”) of the most probable path are considered
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
15
Simplifying assumption of the Viterbi algorithm
• If the ultimate best path happens to go through state si, this path must include the best path up to and including si • This assumption is false: a path might look bad at the beginning but turn out to be the best in the end • A more complex technique called A* (stack) decoding avoids this problem Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
16
Summary • Markov models use state transitions to model sequences of events (or observations) • Hidden Markov Models (HMMs) separate the observations from the states; the observations (outputs) are visible, but the state sequences that led to them are hidden • In HMM-based speech recognizers – Observation ot is the acoustic feature vector at frame t – State transitions model phones – Output probabilities model spectral variability in the phones • Decoding is the process of estimating the state sequence that is most likely to have produced an observation sequence • The Viterbi algorithm is a simple and efficient decoding technique Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
17