Development of Polish Large Vocabulary Continuous Speech ...

9 downloads 1607 Views 187KB Size Report
Nov 5, 2009 ... Development of Polish Large Vocabulary Continuous. Speech Recognition acoustic models. Gra˙zyna Demenko1,2. Stefan Grocholewski1,3.
Development of Polish Large Vocabulary Continuous Speech Recognition acoustic models Gra˙zyna Demenko1,2 Marek Lange1 1 Pozna´ n 2 Adam 3 Pozna´ n

Stefan Grocholewski1,3 Katarzyna Klessa2 Bartosz Rapp1 Marcin Szyma´ nski1 Supercomputing and Networking Center

Mickiewicz University, Institute of Linguistics

University of Technology, Institute of Computing Science

i3 Conference, Nov. 5, 2009

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

1 / 11

LVCSR

LVCSR – Large Vocabulary Continuous Speech Recognition: typically able to use vocabularies of 20k–100k words mostly developed for non-inflectional languages acoustic models trained on large corpora (over 100h of speech) speakers selected to represent typical distribution of age, sex and dialect

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

2 / 11

Hidden Markov Model

Context-dependent phoneme realizations are represented by HMMs a11

a22 a12

S1

a33 a23

S2

b1

Legend

S3

b2

b3

Phone-level hidden Markov model with

aij transition probabilities bj observation emission distribution

three emitting states and no skip-transitions.

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

3 / 11

Speech parametrization

window size

frame period

Observation m-1

Observation m

Observation m+1

Short-time signal analysis by overlapping Hamming windows. MFCC parameters are commonly used as feature vectors.

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

4 / 11

Training procedure

Initial estimates for monophones (context-independent HMMs) based on reference segmentation

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

5 / 11

Training procedure

Initial estimates for monophones (context-independent HMMs) based on reference segmentation

Conversion to plain triphones (context-dependent models)

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

5 / 11

Training procedure

Initial estimates for monophones (context-independent HMMs) based on reference segmentation

Conversion to plain triphones (context-dependent models) /o/ /s’/ /e/ /m/

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

5 / 11

Training procedure

Initial estimates for monophones (context-independent HMMs) based on reference segmentation

Conversion to plain triphones (context-dependent models) /o/ /s’/ /e/ /m/ /o+s’/ /o-s’+e/ /s’-e+m/ /e-m/

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

5 / 11

Training procedure

Initial estimates for monophones (context-independent HMMs) based on reference segmentation

Conversion to plain triphones (context-dependent models) /o/ /s’/ /e/ /m/ /o+s’/ /o-s’+e/ /s’-e+m/ /e-m/

Context-dependent clustering of triphones Expert-prepared contextual questions are used for clustering

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

5 / 11

Training procedure

Initial estimates for monophones (context-independent HMMs) based on reference segmentation

Conversion to plain triphones (context-dependent models) /o/ /s’/ /e/ /m/ /o+s’/ /o-s’+e/ /s’-e+m/ /e-m/

Context-dependent clustering of triphones Expert-prepared contextual questions are used for clustering

Splitting of single Gaussians into multi-modal distributions Baum-Welch reestimation procedure is routinely run after each step.

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

5 / 11

Experiment: stressed vowels as separate phonemes

Polish phonetic alphabet consists of 39 phonemes including 6 vowel sounds: /i/, /y/, /e/, /a/, /o/, /u/

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

6 / 11

Experiment: stressed vowels as separate phonemes

Polish phonetic alphabet consists of 39 phonemes including 6 vowel sounds: /i/, /y/, /e/, /a/, /o/, /u/

Distinction between lexically stressed and unstressed instances is introduced Modification made on dictionary level only no acoustic analysis of stress on training set or during recognition accentuation rules are used

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

6 / 11

Experiment: stressed vowels as separate phonemes

Polish phonetic alphabet consists of 39 phonemes including 6 vowel sounds: /i/, /y/, /e/, /a/, /o/, /u/

Distinction between lexically stressed and unstressed instances is introduced Modification made on dictionary level only no acoustic analysis of stress on training set or during recognition accentuation rules are used

Further extension: sonorants (/m/, /n/, /n’/, /N/, /j/, /l/, /w/) placed within stressed syllables are also distinguished from those inside unstressed syllables

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

6 / 11

Experimental results

ca. 50h used for experiments ca. 36k word-list 5-fold cross-validation Setup

Num. phonemes

%Acc1

±

Standard Stressed vowels Vowels & sonorants

39 45 52

45.82 49.10 49.62

1.27 1.18 1.52

1 Accuracy is A = H−I , where H is number of correctly recognized words, I is the N number of inserted words, N is total number of words in reference. Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

7 / 11

Speaker adaptation experiment

Variability of speaker characteristics has negative impact on performance of speaker-independent recognition systems Speaker adaptation is recommended

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

8 / 11

Speaker adaptation experiment

Variability of speaker characteristics has negative impact on performance of speaker-independent recognition systems Speaker adaptation is recommended Maximum-likelihood linear regression (MLLR) yields ca. 12.5% of accuracy boost

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

8 / 11

Conclusions

Significant accuracy boost introduced by vowels’ stress “distinction” Overall accuracy still below levels acceptable for real applications

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

9 / 11

Future works / challenges

More advanced speech parametrization (LDA) Environmental noise and out-of-vocabulary words Dictionary of over 100k words (possibly over 1M words) Polish has complex inflection rules

High-order language modeling (trigrams or higher) Polish has comparably flexible word-order

Decoder speed optimization

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

10 / 11

Thank you.

Szyma´ nski et al. (PSNC)

Polish LVCSR acoustic models

i3 Conference, Nov. 5, 2009

11 / 11