HIDDEN MARKOV MODELS John Fry San Jos´e State University
Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
Andrei A. Markov (1856-1922)
• A.A. Markov introduced the underlying mathematics of the n-gram in a 1913 paper • Markov used bigrams and trigrams to predict whether an upcoming letter in Pushkin’s Eugene Onegin would be a vowel or consonant
Markov models • Markov models are used to model sequences of events (or observations) that occur one after another • The easiest sequences to model are deterministic, where one specific observation always follows another Example: changes in traffic lights (green to yellow to red) • In a nondeterministic Markov model, an event might be followed by one of several subsequent events, each with a different probability – – – –
Daily changes in the weather (sunny, cloudy, rainy) Sequences of vowels and consonants in Eugene Onegin Sequences of words in sentences Sequences of phonemes in spoken words
Markov models • A Markov model consists of a finite set of states together with probabilities for transitioning from state to state • Consider a Markov model of the various pronunciations of “tomato”:
• The probability of a path is the product of the probabilities on the arcs that make up the path P ([towmeytow]) = P ([towmaatow]) = 0.1 P ([tahmeytow]) = P ([tahmaatow]) = 0.4
Markov models
• Transition from state i to j is governed by the discrete probability aij = P (sj |si) • The state transition probabilities for a model with n states can be gathered into an n × n matrix
Hidden Markov models
• In an ordinary Markov model the output (sequence of observations) is simply the sequence of states visited: [towmeytow] • There are also Hidden Markov models (HMMs), where the notions of observation and state are separated – States do not represent observations directly – Different states produce different outputs – The output is not the set of states visited
Hidden Markov models (HMMs) • It’s “Markov” because the next state is determined solely from the current state • It’s “Hidden” because the actual state sequences are concealed from us – We only know the output, not which set of states led to that output – The state sequences are hidden; only the output observations are visible
Hidden Markov model (HMM)
• An HMM is specified by a set of states s, a set of transition probabilities a, and a set of observation likelihoods b • bj (ot) is the probability of emitting symbol ot when state sj is entered at time t
Example of HMM generating observations • At each time t a new state j is entered and an observation ot is generated from the probability bj (ot) • Here we move through the state sequence 1, 2, 2, 3, 4, 4, 5, 6 in order to generate the sequence o1 to o6 a 22
Markov Model M
a12 1
a 33
a 23 2
a 44
a 34 3 a 24
a 55
a 45 4
a 56 5
a 35
b2(o1) b2(o 2) b 3(o 3) b 4(o 4) b 4(o 5) b 5(o 6) Observation Sequence
o6 8
Steps in speech recognition Concept: a sequence of symbols
Speech Waveform
Parameterise Speech Vectors
HMMs for speech recognition Concept: a sequence of symbols
Speech Waveform
Parameterise Speech Vectors
HMMs for speech recognition a 22
Markov Model M
a12 1
a 33
a 23 2
a 44
a 34 3 a 24
a 55
a 45 4
a 56 5
a 35
b2(o1) b2(o 2) b 3(o 3) b 4(o 4) b 4(o 5) b 5(o 6) Observation Sequence
Decoding • HMMs only produce output observations o = (o1, o2, o3 . . . ot) • The precise sequence of states s = (s1, s2, s3 . . . st) that led to those observations is hidden • However, we can estimate the most probable state sequence s = (s1, s2, s3 . . . st) given the set of observations o = (o1, o2, o3 . . . ot) • This process is called decoding • The Viterbi algorithm is a simple and efficient decoding technique
Viterbi algorithm in pseudocode /* Given an observation sequence o[t], a transition matrix a[i,j], and an observation likelihood b[i,o[t]], create a path probability matrix v[i,t]. */ v[0,0] = 1.0 for t = 0 to T do for s = 0 to num_states do for each transition i from s do new_score = v[s,t] * a[s,i] * b(i,o[t]) if (new_score > v[i,t+1]) then v[i,t+1] = new_score back_pointer[i,t+1] = s /* To find best path, choose the highest probability state in the final column of v[] and backtrack */ Linguistics 124: Computers and Spoken Language, Fall 2003, SJSU
Viterbi algorithm
• Each time v[i,t+1] is gets a higher score from state s, back pointer[i,t+1] is reset to s • back pointer[s,t] therefore only stores optimal paths • See animation
Viterbi algorithm with beam search
• In practice, the beam search technique is used to avoid having to consider all possible transitions at each time step • Only transitions whose path probabilities are within some percentage (“beam width”) of the most probable path are considered
Simplifying assumption of the Viterbi algorithm
Summary • Markov models use state transitions to model sequences of events (or observations) • Hidden Markov Models (HMMs) separate the observations from the states; the observations (outputs) are visible, but the state sequences that led to them are hidden • In HMM-based speech recognizers – Observation ot is the acoustic feature vector at frame t – State transitions model phones – Output probabilities model spectral variability in the phones • Decoding is the process of estimating the state sequence that is most likely to have produced an observation sequence • The Viterbi algorithm is a simple and efficient decoding technique