FPGA Architecture of HMM-based Decoder Module in ...

FPGA Architecture of HMM-based Decoder Module in Speech Recognizer Trang Hoang, Viet Vo Quoc, Truong Nguyen Ly Thien Ho Chi Minh City University of Technology, Vietnam E-mail: [email protected]

Abstract – This paper presents the reconfigurable architecture and implementation of HMM-based decoder module in speech recognizer on FPGA. The architecture is done with different parameters of speech recognition system that could be easily reconfigurable. The design and implementation on FPGA have been verified with utterances of 800 test speeches. The implementation results and the recognition accuracy results of up to 98% are also presented.

feature vectors, and N is the dimension of the feature vector. This study uses continuous HMMs, so computation bases on uncorrelated multivariate Gaussian distributions.  jk and σjk (Ujk) are mean and variance. For an input frame Ot, the output probability of n-state left-to-right continuous HMMs at time t and the jth state is given by b j o  

Keywords – Speech recognition, Viterbi Algorithm, Hidden Markov Model, number of mixture, number of state.

 c No, 



M

jk

jk

1 j  N

,U jk ,

(1)

k 1

N is in the format of I. INTRODUCTION Real time continuous speech recognition is a complicated task and requires a large amount of computations, especially hardware design. A typical speech recognition process includes two stages. The first stage takes speech waveform, pre-processes and extracts feature vectors from it. The second stage is the decoding, which applies discrete or continuous HMMs to seek the most relevant model of speech. There are many researches applying the decoding or recognition algorithms to an isolated word recognition system that uses hardware approaches such as VLSI or FPGA [1-2]. Most of the researches are aiming to improve the recognition accuracy and optimize hardware design [3-5]. Continuous Hidden Markov Model (HMM) is a good algorithm to improve the recognition accuracy, but it demands a large amount of operations, complex computations and needs more hardware resources. Consequently, the main goal of this study is to optimize the design of the decoding stage of the automatic Vietnamese recognition system. This design is designed on FPGA and uses continuous HMMs algorithm. This paper is organized as follows. Section II introduces the decoder module in speech recognizer. In this section, it has two main parts that are computation of observation probability and Viterbi algorithm. Structure of our speech recognition system is described in section III. Implementation results on FPGA, recognition accuracy in different system models and discussion are presented in section IV. The last section gives the conclusion. II.

DECODER MODULE IN SPEECH RECOGNIZER

A. Computation of observation probabilities: Set O1, O2… OT is a sequence of n-dimensional input feature vectors or observation vectors, T is the number of the

No,  ,U  



1

e

2 n U

1 o  U 1 o   2

(2)

However, the result of multiplication is easy to overflow, the hardware of a multiplier needs more resource-efficient than an adder. Moreover, the hardware design of exponent is very complicated, so equation (1) is implemented in logarithmic domain [6]: M  ~ b j o t   ln   k 1 



c jk

2 26 U jk







1  ot   jk  U jk1 ot  jk e 2



   

M  C k e X k   ln C max  X max  ln 1  X max    k 1,k  max C max e   ln Cmin  X min ln Cmax  X max  ln  M     2   ln C max  X max  ln 1  e     



 ln Cmin  X min ln Cmax  X max ln  M     2   ln(1  e A ) B  ln 1  e     

(3)

(4)

with M  Aln Cmin  X min ln Cmax  X max ln    2 

lnCmin + Xmin and lnCmax + Xmax are the two largest values of the term lnCk + Xk, M is the number of mixture, k is mixture index 1≤k≤M. In order to simplify hardware design of B computation, a look up table (LUT) is implemented. A is the index of the LUT and can be easily computed. With 0≤A≤ln (M/2), the value of B in equation (4) is from 0 to ln(1+M/2). In this

study, the maximum number of mixture M is 6 with an accuracy of two places after the decimal. The design uses 110 registers to store the content of look up table, starts from 0.01 and ends at 1.10 (ln3) with a resolution of 0.01. In order to store values of B and A easily, all of the values are multiplied with 215. B. Viterbi algorithm: The goal of decoder is to find out the maximum probability of HMM according to the input feature vector (observation vector). In order to optimize hardware design, we use alternative Viterbi algorithm. This algorithm performs calculations of Viterbi algorithm in the log domain to convert the multiplication into additions because an adder needs less resource for hardware implementation than a multiplier. In addition, in practice, π, aij , bj (Ot) are decimal fraction from 0 to 1, so they are not conducive for FPGA to implement decimal fraction operation, because decimal fraction multiplication may cause the problem of underflow when T is larger than a threshold. Therefore, in this paper, improved Viterbi algorithm that bases on the original algorithm is used to deal with this problem. This algorithm needs to transform π, ~ ~ A, B to logarithmic domain ~ , a~ij , bi Ot  . The value  t  j  is

III.

DESIGN OF ARCHITECTURE

In the design, model parameters of all words in the library of the recognizer include mean   jk , variance σjk, lnC + lnaii and lnaij – lnaii. They are calculated in advance and stored in c jk external memory, where C is , aii is the state2 26 U jk transition coefficients. After MFCC feature extraction process finishes, all feature vectors are calculated and stored in internal memory. Fig. 1 illustrates a block diagram of Decoder in which Adder, Multiplier and Core Adder blocks calculate Xk for every frame and kth mixture at each state. Final score register Feature vector Adder



1 i  N



 

 

~ 

~



 t  j   ln  t  j   max  t 1 i   a~ij  b j ot , ~

Score register

Core Adder

X Register Start Frame_num

Controller

subtractor Control signal shifter

Result_ack

Wr_en to MFCC RAM

Wr_en to Flash RAM

Look up

Figure 1. The block diagram of Decoder

~i  ln  i ,

~ ~ bi ot   ln bi ot  , 1  i  N , 1  t  T a~ij  ln a~ij , 1  i  N , 1  j  N ~ ~ ~ 1 i   ln 1 i   ~i  bi o1 , 1  i  N

2. Recursion:

Multiplier

Parameter model

defined as the maximum logarithmic probability, over all partial state sequences ending in state j at time t, where i is a possible previous state. The detail of this algorithm is defined as follows [6]: 1. Initialization:

Result

1i  N

~

(5)

𝑖−1





𝑖−1

𝑐𝑖 = 𝑔𝑖−1 + ∑ 𝑔𝑘 ∏ 𝑝𝑗 + 𝑐𝑜 ∏ 𝑝𝑗 𝑘=0

~ ~ Pfinal  max  T i  1i  N

𝑖−2

(6)

2  t  T; 1 j  N

3. Termination:

The adder is designed as the carry-look-ahead as in Fig. 2 in which we divide this adder into four 4-bit sub adders to compute sum of two 16-bit numbers. A 4-bit sub-adder is described as follows: gi = xiyi pi = xi  yi

(7)

Where aij is the state-transition probability from the ith state to the jth state, bj (Ot) is the output probability which is calculated from the probability density function at time t and the jth state. In the hardware design, each state is only reached by itself or the previous one instead of all states of model, and the state sequence must begin with state 1 and end in the state N. The Viterbi algorithm will compute the probabilities of two paths, one is calculated from the value of previous state and the other is calculated from its value at previous time. Afterwards, the probability with higher value will be kept. The researching process will continue to reach the last point at time t and Nth state. This will repeat for observation vector at time t+1from state 1 to state N and continue to reach the last point at time T and state N. At the end of the observation sequence, the whole calculation is terminated.

𝑗=𝑘+1

𝑗=0

In the carry-look-ahead block, the signals are: G[i..i+3] = gi+3 + gi+2 pi+3 + gi+1 pi+2 pi+3 + gi pi+1 pi+2 pi+3 P[i..i+3] = pi pi+1 pi+2 pi+3 ci+1 = G[i..i+3] + ci-1 P[i..i+3] c [15..13]

c [11..9]

c [7..5] g[11..9] p[11..9]

C12 CLA GEN G[15..13]

CLA GEN G[11..9]

P[15..13]

C8

c [3..1] g[7..5] p[7..5] CLA GEN

G[7..5] P[11..9]

C4

g[3..1] p[3..1]

CLA GEN G[3..1]

P[7..5]

P[3..1]

CLA GEN

Figure 2. 16-Bit Carry-Look-Ahead Adder

C0

Adder and Core-Adder blocks in Fig.1 are used to calculate the component as follows in (3):

1 1   2 ` X k   ot   jk  U jk1 ot   jk    xn  xn    2  2un 

0≤ n ≤25 01

25 < n

Reset=1 Initial

10

11

00

x

1 𝑥1 |x | 𝑥2 2 Where 𝑂𝑡 = ||𝑥3 || ; 𝜇𝑡 = x ; 3 ⋮ | ⋮ | 𝑥𝑛

x

𝑢1 0 0 𝑢2 | 0 0 𝑈𝑗𝑘 = 0 0 |

0 0

0 0

Figure 3. State Machine of Core_Adder block

The “XRegister” block stores the two largest values of the term lnCk + Xk (0≤k≤M) and then passes them to Subtractor. The Subtractor block is used to calculate ln Cmin  X min   ln Cmax  X max  . The output of the Subtractor

𝑛

0 0 𝑢3 0 ⋮ 0 0

0 … 0 … 0 … 𝑢4 …

0 0 | 0 0

0 … 0 0 … 𝑢𝑛

M  block is shifted to the right scale and ln   is added to the  2  shifted value to compute the index of the LUT by Shifter block. MUX is used to choose sum of the shifted output and

|

The Adder block is used to compute Ot  (  jk ) and

xn  xn  . The Multiplier block is designed as modified Booth 2 multiplier. At first, it is used to calculate xn  xn  , with xn  xn  is the result of the output of the Adder block. 2 Afterwards, Multiplier block is used to multiply xn  xn 

M  ln  or the shifted output of LUT. These operations of  2  the Shifter block are described as in Fig. 4. Output of LUT

Output of subtractor

enl enr Shift number

shifter

MUX

1 and  together. 2u n

Core_Adder

 x

n

 xn 

2

1   2u n 

+ block

is

used

to

compute

Mixture number

Ln(M/2)

  . After Xk is computed, this block adds  

p1  ln a jj  ln Cmax or p2  ln a j  j 1  ln a jj to Xk. In addition, it is used to add ln CX max  ln Cmax  X max to

B  ln(1  e ) . Moreover, it is used to add the previous partial ~ ~ probability  t 1 i  to b j ot   ln CX max  B . In general, the A

state machine of Core_Adder block includes four states as in Fig. 3 and each state computes an equation as follows: State 00:

 t 1 i   b j ot  with 1  i  N ~

out

Figure 4. Block diagram of the shifter

Afterwards, the output of the LUT block is shifted left scale to get exact the result of B. Table I describes relationship between index A and result B. There are two functions that are used to calculate A and B. A= ln(eB-0.005-1).215 (8) (9) A B=ln(1+e )

~

 1 State 01: X k    x n  x n 2    2u n  State 10: Xk+p1 or Xk+p2

~ State 11: b j ot   ln CX max  B

  with 0  n  25 

TABLE I.

Reg_num Reg0 Reg1 … Reg108 Reg109

CONCEPT OF LUT

Range of A -173533≤A

FPGA Architecture of HMM-based Decoder Module in ...

FPGA Architecture of HMM-based Decoder Module in ...

Suggest Documents

FPGA based implementation of Baseline JPEG decoder - CiteSeerX

Design of LDPC Decoder Using FPGA: Review of ... - Research Inventy

FPGA Implementation of Viterbi Decoder using ... - Semantic Scholar

Complexity Analysis of H.264 Decoder for FPGA Design - CiteSeerX

Module Construction et Architecture

Parallel Processors Architecture in FPGA for the

VLSI reed solomon decoder architecture for ...

FPGA Architecture: Survey and Challenges

An FPGA-Based Transceiver Module - TAPR

DESCRIBING THE FPGA-BASED HARDWARE ARCHITECTURE OF ...

Architecture Design and FPGA Implementation of ...

a fast parallel huffman decoder for fpga ... - Semantic Scholar

MODULE 14 - BASIC COMPUTER ARCHITECTURE

LUT-based Image Rectification Module Implemented in FPGA

Microprocessors and Microcontrollers Module 1: Architecture of ...

Sustainable Architecture Module: Recycling And Reuse Of

FPGA Implementation of a Priority Scheduler Module - CiteSeerX

Micropositioning mechatronics system based on FPGA architecture

An FPGA-Based Region-Growing Architecture for

FPGA-based Parallel Hardware Architecture for Real

An FPGA Architecture Supporting Dynamically Controlled ... - CiteSeerX

FPGA Architecture Optimisation Using Geometric Programming - Core

The Triptych FPGA Architecture - Semantic Scholar

Beehive: an FPGA-based multiprocessor architecture - UPCommons