FPGA Architecture of HMM-based Decoder Module in Speech Recognizer Trang Hoang, Viet Vo Quoc, Truong Nguyen Ly Thien Ho Chi Minh City University of Technology, Vietnam E-mail:
[email protected]
Abstract – This paper presents the reconfigurable architecture and implementation of HMM-based decoder module in speech recognizer on FPGA. The architecture is done with different parameters of speech recognition system that could be easily reconfigurable. The design and implementation on FPGA have been verified with utterances of 800 test speeches. The implementation results and the recognition accuracy results of up to 98% are also presented.
feature vectors, and N is the dimension of the feature vector. This study uses continuous HMMs, so computation bases on uncorrelated multivariate Gaussian distributions. jk and σjk (Ujk) are mean and variance. For an input frame Ot, the output probability of n-state left-to-right continuous HMMs at time t and the jth state is given by b j o
Keywords – Speech recognition, Viterbi Algorithm, Hidden Markov Model, number of mixture, number of state.
c No,
M
jk
jk
1 j N
,U jk ,
(1)
k 1
N is in the format of I. INTRODUCTION Real time continuous speech recognition is a complicated task and requires a large amount of computations, especially hardware design. A typical speech recognition process includes two stages. The first stage takes speech waveform, pre-processes and extracts feature vectors from it. The second stage is the decoding, which applies discrete or continuous HMMs to seek the most relevant model of speech. There are many researches applying the decoding or recognition algorithms to an isolated word recognition system that uses hardware approaches such as VLSI or FPGA [1-2]. Most of the researches are aiming to improve the recognition accuracy and optimize hardware design [3-5]. Continuous Hidden Markov Model (HMM) is a good algorithm to improve the recognition accuracy, but it demands a large amount of operations, complex computations and needs more hardware resources. Consequently, the main goal of this study is to optimize the design of the decoding stage of the automatic Vietnamese recognition system. This design is designed on FPGA and uses continuous HMMs algorithm. This paper is organized as follows. Section II introduces the decoder module in speech recognizer. In this section, it has two main parts that are computation of observation probability and Viterbi algorithm. Structure of our speech recognition system is described in section III. Implementation results on FPGA, recognition accuracy in different system models and discussion are presented in section IV. The last section gives the conclusion. II.
DECODER MODULE IN SPEECH RECOGNIZER
A. Computation of observation probabilities: Set O1, O2… OT is a sequence of n-dimensional input feature vectors or observation vectors, T is the number of the
No, ,U
1
e
2 n U
1 o U 1 o 2
(2)
However, the result of multiplication is easy to overflow, the hardware of a multiplier needs more resource-efficient than an adder. Moreover, the hardware design of exponent is very complicated, so equation (1) is implemented in logarithmic domain [6]: M ~ b j o t ln k 1
c jk
2 26 U jk
1 ot jk U jk1 ot jk e 2
M C k e X k ln C max X max ln 1 X max k 1,k max C max e ln Cmin X min ln Cmax X max ln M 2 ln C max X max ln 1 e
ln Cmin X min ln Cmax X max ln M 2 ln(1 e A ) B ln 1 e
(3)
(4)
with M Aln Cmin X min ln Cmax X max ln 2
lnCmin + Xmin and lnCmax + Xmax are the two largest values of the term lnCk + Xk, M is the number of mixture, k is mixture index 1≤k≤M. In order to simplify hardware design of B computation, a look up table (LUT) is implemented. A is the index of the LUT and can be easily computed. With 0≤A≤ln (M/2), the value of B in equation (4) is from 0 to ln(1+M/2). In this
study, the maximum number of mixture M is 6 with an accuracy of two places after the decimal. The design uses 110 registers to store the content of look up table, starts from 0.01 and ends at 1.10 (ln3) with a resolution of 0.01. In order to store values of B and A easily, all of the values are multiplied with 215. B. Viterbi algorithm: The goal of decoder is to find out the maximum probability of HMM according to the input feature vector (observation vector). In order to optimize hardware design, we use alternative Viterbi algorithm. This algorithm performs calculations of Viterbi algorithm in the log domain to convert the multiplication into additions because an adder needs less resource for hardware implementation than a multiplier. In addition, in practice, π, aij , bj (Ot) are decimal fraction from 0 to 1, so they are not conducive for FPGA to implement decimal fraction operation, because decimal fraction multiplication may cause the problem of underflow when T is larger than a threshold. Therefore, in this paper, improved Viterbi algorithm that bases on the original algorithm is used to deal with this problem. This algorithm needs to transform π, ~ ~ A, B to logarithmic domain ~ , a~ij , bi Ot . The value t j is
III.
DESIGN OF ARCHITECTURE
In the design, model parameters of all words in the library of the recognizer include mean jk , variance σjk, lnC + lnaii and lnaij – lnaii. They are calculated in advance and stored in c jk external memory, where C is , aii is the state2 26 U jk transition coefficients. After MFCC feature extraction process finishes, all feature vectors are calculated and stored in internal memory. Fig. 1 illustrates a block diagram of Decoder in which Adder, Multiplier and Core Adder blocks calculate Xk for every frame and kth mixture at each state. Final score register Feature vector Adder
1 i N
~
~
t j ln t j max t 1 i a~ij b j ot , ~
Score register
Core Adder
X Register Start Frame_num
Controller
subtractor Control signal shifter
Result_ack
Wr_en to MFCC RAM
Wr_en to Flash RAM
Look up
Figure 1. The block diagram of Decoder
~i ln i ,
~ ~ bi ot ln bi ot , 1 i N , 1 t T a~ij ln a~ij , 1 i N , 1 j N ~ ~ ~ 1 i ln 1 i ~i bi o1 , 1 i N
2. Recursion:
Multiplier
Parameter model
defined as the maximum logarithmic probability, over all partial state sequences ending in state j at time t, where i is a possible previous state. The detail of this algorithm is defined as follows [6]: 1. Initialization:
Result
1i N
~
(5)
𝑖−1
𝑖−1
𝑐𝑖 = 𝑔𝑖−1 + ∑ 𝑔𝑘 ∏ 𝑝𝑗 + 𝑐𝑜 ∏ 𝑝𝑗 𝑘=0
~ ~ Pfinal max T i 1i N
𝑖−2
(6)
2 t T; 1 j N
3. Termination:
The adder is designed as the carry-look-ahead as in Fig. 2 in which we divide this adder into four 4-bit sub adders to compute sum of two 16-bit numbers. A 4-bit sub-adder is described as follows: gi = xiyi pi = xi yi
(7)
Where aij is the state-transition probability from the ith state to the jth state, bj (Ot) is the output probability which is calculated from the probability density function at time t and the jth state. In the hardware design, each state is only reached by itself or the previous one instead of all states of model, and the state sequence must begin with state 1 and end in the state N. The Viterbi algorithm will compute the probabilities of two paths, one is calculated from the value of previous state and the other is calculated from its value at previous time. Afterwards, the probability with higher value will be kept. The researching process will continue to reach the last point at time t and Nth state. This will repeat for observation vector at time t+1from state 1 to state N and continue to reach the last point at time T and state N. At the end of the observation sequence, the whole calculation is terminated.
𝑗=𝑘+1
𝑗=0
In the carry-look-ahead block, the signals are: G[i..i+3] = gi+3 + gi+2 pi+3 + gi+1 pi+2 pi+3 + gi pi+1 pi+2 pi+3 P[i..i+3] = pi pi+1 pi+2 pi+3 ci+1 = G[i..i+3] + ci-1 P[i..i+3] c [15..13]
c [11..9]
c [7..5] g[11..9] p[11..9]
C12 CLA GEN G[15..13]
CLA GEN G[11..9]
P[15..13]
C8
c [3..1] g[7..5] p[7..5] CLA GEN
G[7..5] P[11..9]
C4
g[3..1] p[3..1]
CLA GEN G[3..1]
P[7..5]
P[3..1]
CLA GEN
Figure 2. 16-Bit Carry-Look-Ahead Adder
C0
Adder and Core-Adder blocks in Fig.1 are used to calculate the component as follows in (3):
1 1 2 ` X k ot jk U jk1 ot jk xn xn 2 2un
0≤ n ≤25 01
25 < n
Reset=1 Initial
10
11
00
x
1 𝑥1 |x | 𝑥2 2 Where 𝑂𝑡 = ||𝑥3 || ; 𝜇𝑡 = x ; 3 ⋮ | ⋮ | 𝑥𝑛
x
𝑢1 0 0 𝑢2 | 0 0 𝑈𝑗𝑘 = 0 0 |
0 0
0 0
Figure 3. State Machine of Core_Adder block
The “XRegister” block stores the two largest values of the term lnCk + Xk (0≤k≤M) and then passes them to Subtractor. The Subtractor block is used to calculate ln Cmin X min ln Cmax X max . The output of the Subtractor
𝑛
0 0 𝑢3 0 ⋮ 0 0
0 … 0 … 0 … 𝑢4 …
0 0 | 0 0
0 … 0 0 … 𝑢𝑛
M block is shifted to the right scale and ln is added to the 2 shifted value to compute the index of the LUT by Shifter block. MUX is used to choose sum of the shifted output and
|
The Adder block is used to compute Ot ( jk ) and
xn xn . The Multiplier block is designed as modified Booth 2 multiplier. At first, it is used to calculate xn xn , with xn xn is the result of the output of the Adder block. 2 Afterwards, Multiplier block is used to multiply xn xn
M ln or the shifted output of LUT. These operations of 2 the Shifter block are described as in Fig. 4. Output of LUT
Output of subtractor
enl enr Shift number
shifter
MUX
1 and together. 2u n
Core_Adder
x
n
xn
2
1 2u n
+ block
is
used
to
compute
Mixture number
Ln(M/2)
. After Xk is computed, this block adds
p1 ln a jj ln Cmax or p2 ln a j j 1 ln a jj to Xk. In addition, it is used to add ln CX max ln Cmax X max to
B ln(1 e ) . Moreover, it is used to add the previous partial ~ ~ probability t 1 i to b j ot ln CX max B . In general, the A
state machine of Core_Adder block includes four states as in Fig. 3 and each state computes an equation as follows: State 00:
t 1 i b j ot with 1 i N ~
out
Figure 4. Block diagram of the shifter
Afterwards, the output of the LUT block is shifted left scale to get exact the result of B. Table I describes relationship between index A and result B. There are two functions that are used to calculate A and B. A= ln(eB-0.005-1).215 (8) (9) A B=ln(1+e )
~
1 State 01: X k x n x n 2 2u n State 10: Xk+p1 or Xk+p2
~ State 11: b j ot ln CX max B
with 0 n 25
TABLE I.
Reg_num Reg0 Reg1 … Reg108 Reg109
CONCEPT OF LUT
Range of A -173533≤A