Language-Universal Speech Modeling

Language-Universal Speech Modeling: What, Why, When and How Chin-Hui Lee School of ECE, Georgia Tech [email protected] Joint work with Marco Siniscalchi and Torbjorn Svendsen

Outline and Talk Agenda - Language-universal speech modeling - Universal acoustic modeling (UAM) - Phonetic versus attribute definitions - Sharable knowledge-based fundamental speech units

• Overview of the ASAT Paradigm - Bottom-up versus top-down - Speech attribute detection - Knowledge Integration

• Two new multilingual applications - Language-universal attribute and phone recognition - Attribute-based spoken language recognition

- Summary 2

LUAM-JHU

Language Universal Speech Modeling Overall speech Databases 1 l …

L

Universal Acoustic Modeling

Universal AMs

(a) Training Language Universal AMs

Voice Tokenization

l

Individual Linguistic Event Labeled Database

Feature Extraction Spoken Documents

Universal AMs (b) Converting utterances into spoken document vectors Overall Vector Collections

1 l …

L

Classifier Design

Utterance Vector Collection

Event Classifier

(c) Building Even-Specific Classifiers 3

LUAM-JHU

Frame versus Segment Modeling • Language-universal speech characterization - Overall acoustic and speech variations - Acoustic vs. linguistic unit definitions - Sharable knowledge-based fundamental speech units

• Acoustic units: no direct linguistic definition • Frame-based vector quantization • Segment-based matrix quantization • Acoustic segment models (ASM)

• Language-specific linguistic units - Phones, syllables, phrases, words, etc.

• Language-universal linguistic units - Attributes, e.g., manner and place of articulation 4

LUAM-JHU

Learning from Human-Based Recognition: Knowledge-Rich ASR (ICSLP 2004 paper) • • • • • •

Human speech recognition (HSR) better than ASR Acoustic landmarks and invariance: Klatt (ARPA) Distinctive features: Fant, Stevens Phonological parsing: Church Speech production, perception, analysis: plenty Learning from reading spectrograms and HSR – – – – –

Explore KS hierarchy, from acoustics to pragmatics Detect acoustic and auditory evidences in speech Weigh and combine them to form cognitive hypotheses Validate them until consistent decisions can be reached Perform partial understanding only on islands of reliability

Strong Message: bottom-up KS integration – but also taking advantage of data-driven modeling 5

LUAM-JHU

Automatic Speech Attribute Transcription • Project Period: 10/01/04–02/28/11 (an ITR project) • The ASAT Team – – – – – – –

Mark Clements ([email protected]) Sorin Dusan ([email protected]) Eric Fosler-Lussier ([email protected]) Keith Johnson ([email protected]) Biing-Hwang Juang ([email protected]) Larry Rabiner ([email protected]) Chin-Hui Lee (Coordinator, [email protected])

• NSF Human Language Communication Program Director: (Mary Harper first, Tanya Korelsky now) 6

LUAM-JHU

ASAT Paradigm and Statement of Work 1 2

3

4

5. Overall System Prototypes and Common Platform 7

LUAM-JHU

ASAT: Asynchronous Attribute Detection & Bottom-Up Knowledge Integration Speech

Parameter Ensemble

Speech Attribute Detection

Speech Analysis

s(t )

Attribute Lattice

F (t )

Probabilistic phone Lattice

Attribute to Phone Lattice

P( Attr.|F(t))

Probabilistic Word Lattice

Phone to Word Lattice

P1 ( Phone | F (t ))

Recognized Sentence

Theory Verification

P2 (Word | F (t ))

Speech knowledge hierarchy, memory organization and unit description Parameters: F(t)= zero crossing, spectrum/cepstrum, autocorrelation/LPC, energy ratios ……..

8

Attribute:

Phone and word lattices:

Place & manner, word, speaker, Accent, SNR, distinctive features, articulator trajectories, ……..

intermediate phones, syllables, words, events, speech cues, ……..

Additional Outputs: P(Phone |F(t)) P(Word |F(t))

LUAM-JHU

1.

Bank of Speech Attribute Detectors

• Each detected attribute represented by a time series – e.g. frame-based detector (0-1 simulating posterior probability) • Frame-based and segment-based attribute Detectors – ANN-based detectors of nasal and stop – HMM-based detectors for manner and place of articulation

• Sound-specific parameters and attribute detectors – e.g. “VOT” for V/UV stop discrimination, LF zero for nasal • Physiologically-motivated processors and detectors – Analog detectors, short-term and long-term detectors – 3D tonotopic spectral analysis and cortical response models • Perceptually-motivated processors and detectors – Converting speech into neural activity level functions

• Others? 9

LUAM-JHU

Detection Curves for Manner Attributes

10

LUAM-JHU

Detector Performance (Interspeech05) Error Rate (%)

Baseline (ANN*) Baseline (HMM)

SEG_MCE

Vowel

9.0

1.7

1.8

Fricative

11.3

6.4

3.6

Stop

14.5

9.9

5.4

Nasal

12.2

11.2

5.4

Approximant

15.9

8.0

6.1

Silence

3.7

2.1

0.8

* Frame-based ANN with a detection threshold of 0.5

11

LUAM-JHU

Language-Universal Detectors ? h aw m eh n iy sh ih

p s

aa r g ae

s

p

aw

er

d

Stop

Nasal

Vowel …..

XX

XX

Two errors caused mainly by mispronunciation of a non-native speaker 12

LUAM-JHU

2. Event Merger • Merge multiple time series into another time series – Maintaining the same detector output characteristics • Combine temporal events – e.g. combining phones into words (word detectors)

• Combine spatial events – e.g. combining vowel and nasal into nasalized vowels • An extreme: Build a 20K keyword detectors for WSJ – A key: building a word detector for “any word” • Others: OOV detection, partial recognition

13

LUAM-JHU

Event Merging Under Speech KS Hierarchy Speech Signal x(t)

Event Lattice P([nasal] | x(t)) P([dental] | x(t)) P([fricative] | x(t))

? (A1)





P([labial] | x(t))

L1(t)

Meta Event Lattice

? (A2)

P([m1(t)] | x(t)) P([m2(t)] | x(t)) P([m3(t)] | x(t))



L2(t)



P([mq(t)] | x(t))

• Each event is modeled by a time series of activity levels, mimicking neural perception (confidence measure available at every stage) • Detection of higher level meta events comes from coordination of lower level events integrating over space and time (perception?)

14

LUAM-JHU

MASR: Resource-Rich vs. Resource-Limited Multilingual Automatic Speech Recognition High Cost? uw aa

Resource-Limited Languages (Performance & Robustness?)

l English ASR

p

b

l y

ey

English Corpus (1K hrs + !B words) uw e

u

Resource-Rich Languages

Mandarin ASR iy

Mandarin Corpus (1K hrs + 1B words) 16

High Performance ?

Vietnamese Corpus (50 hours) Vietnamese ASR

1. p

g t

2.

Can we build systems with little or no languagespecific training data ? How to build systems for all languages ? No need for extensive training? LUAM-JHU

Motivation for Universal Modeling • Current state-of-the-art HMM-based LVCSR systems require massive resources in data collection for application-specific conditions in order to achieve high accuracy (only a few groups can do it) • Developing and deploying top-performance systems is becoming prohibitively costly, if not impossible, even for large corporations, like Microsoft and IBM • Little study is done for resource-limited languages • Developing an LVCSR framework without extensive retraining is both intellectually challenging (human language acquisition) and practically important (for conversational user interfaces and translation) 17

LUAM-JHU

Multilingual ASR Studies • Past: State-of-the-art: Shultz, et al. (CMU) – Language-specific (LS): (WER 19.0%, 16-hour training data) > – Language-adaptive (LA): (WER 19.6%, 1.5-hour adaptation data) > – Language-universal (LU): (WER 72%, 200-hour training data)

• Now: ICASSP2009: Lin, et al. (MSR) 1. 2.

LU (180-hour) & LA (24-hour) can outperform LS (24-hour) modeling Care needed in defining and modeling the set of universal phones

• Future: ASAT extension 1. 2. 3. 4.

18

ASAT is a recent framework beyond HMM-based LVCSR aiming at ASR based on detecting basic speech cues and integrating knowledge sources Can some speech attributes be more universal, fundamental and easier to share and model than phones across languages? How to define this set of universal set for all spoken languages? Is it possible to have a collaborative effort among acoustic-phonetic and ASR communities? IPA or similar sets seem not enough?

LUAM-JHU

Building Universal Phone Models • Use existing or new universal phone definition, e.g. IPA • Expand current ASR capabilities with UPM description • Examine language-independence & language-dependency • Share and explore acoustic phonetic properties (attributes) among all languages for universal phone definitions Overall Speech DB & Transcription IPA-Based Lexicon

Universal Phone Modeling

UPM: Universal Phone Models

Universal Phone Description 19

LUAM-JHU

Research Issues in UPM • Why CMU and MSR conclusions were different? • How to define and model a common set of phone units across all spoken languages? IPA? • Whither phonetic and/or acoustic definition? • How to select a set of languages with enough training data for high performance UPM? • Can we achieve high performance for languages used in training and generalize to new languages with no training data, i.e. training once and for all? • Are attributes more fundamental than phones? Can we design attribute detectors once and for all? 22

LUAM-JHU

A Grand View on UPM for MASR • Universal phone modeling: train once and for all – Speech attributes, such as place and manner of articulation, are clearly defined and more fundamental than phones to be shared across languages – Cross-language phone and attribute recognition is highly possible without using language-specific training speech (Interspeech2008 paper) – Attribute-based spoken language identification gives very good performance without direct phone modeling (Interspeech2009 paper) – Easy extension to resource-limited languages – Unit definition beyond IPA with both acoustic and phonetic similarities taken into account 23

LUAM-JHU

Speech Attribute Detectors Attribute-specific parametrization for each detector

Speech s(t)

Either frame- or segment-based detector

Each output informs the presence of an attribute in speech

anterior front-end

o(t)

anterior detector

P( anterior | o(t) )

back front-end

o(t)

back detector

P( back | o(t) )

vowel front-end

o(t)

vowel detector

P( vowel | o(t) )

Asynchronous attribute detection with mixed signal processing is allowed 24

LUAM-JHU

Attribute-Based Phone Modeling Attribute Detectors (place and manner of articulation)

Attribute to Phone Mapping

LS or LU

LS or LU

Speech Data

Unit Description

LS or LU Phone Models

• Attribute detectors can be language-specific or more interestingly language-universal (ASRU2007) • Attribute-to-phone mapping is usually LS (ICASSP2008) • Attribute-to-phone mapping can be LU (ITRW2008), however phone sharing needs to be defined • Lattice parsing is preferred, but not yet fully developed 25

LUAM-JHU

Event Merger & Phone Recognition • In this study: ANN-based (data-driven) phone mergers • Combine together the outputs of the attribute detectors to generate phone posterior probabilities (ASRU2007) • Implement attribute-to-phone mapping with data-driven artificial neural networks or other mechanisms • Can be language-universal with shared definitions • Produce phone sequences using Viterbi decoding P( anterior | o(t) ) P( back | o(t) )

P( vowel | o(t) ) 26

P( /aa/ ) Event Merger

P( /b/ )

Viterbi Algorithm

/ ow h a iy /

P( /y/ )

LUAM-JHU

Attribute-to-Phone Mapping Merger The merger is implemented using an MLP

Trained using the output of the speech detectors as inputs 27

ANN

Either phone posteriors or state posteriors can be generated LUAM-JHU

An Illustration: Phone Recognition Input speech acoustic features

Attributes Detector probabilistic features

phones 30

lyu

( green

island )

dau

coronal low dental sil voiced back anterior vowel phone

sil sil sil l yuyuyuyuyuyuyuyusil sil sil sil d a u sil sil sil sil sil LUAM-JHU

OGI Multilingual Stories Corpus • Labeled material is available for six languages: English, German, Mandarin, Spanish, Hindi and Japanese • Telephony data: acoustic quality is not good, the amount of training speech is little when compared to other cases • Better multilingual speech corpus should be used later • Phone set size is different among different languages • Manner and place sets of 21 dedicated detectors, one for each speech attribute (either “present” or “absent”) – Anterior, back, continuant, coronal, dental, fricative, glottal, approximant, high, labial, low, mid, nasal, retroflex, round, stop, tense, velar, voiced, vowel, and silence

31

LUAM-JHU

Multilingual Phone Recognition • Based on existing acoustic modeling techniques – HMM and ANN – Long-term temporal energy based features

• The amount of training data is little: telephony corpus • LS and LU attribute detectors for phone recognition (ICASSP2008 and ITRW2008): better than the best Language (Phone set size)

ENG (39)

GER (43)

HIN (46)

MAN (45)

SPA (38)

JAP (29)

Training [hours]

1.71

0.97

0.71

0.43

1.10

0.65

Test [hours]

0.42

0.24

0.17

0.11

0.26

0.15

LS Phone Error (%) 46.68 50.84 45.55 49.93 41.19 39.95 LU Phone Error (%) 45.24 49.53 43.03 47.42 38.75 38.22 32

LUAM-JHU

Cross-Language Attribute Detection

33

LUAM-JHU

Language-Universal Attribute Detection

34

LUAM-JHU

Cross-Language Phone Recognition (1) • Detectors and Merger are implemented using languagespecific training data and attribute-to-phone mapping rules • Evaluation performed on a never seen language: Japanese • SPA-SPA performs the best in phone error rate (PER) • Some Japanese phones are not defined in all mergers DetectorMerger

GEMGEM

MANMAN

ENGENG

HINHIN

SPASPA

PER(%)

65.76

63.58

59.25

51.24

49.20

• LS can be replaced by LU attribute detectors

35

LUAM-JHU

Cross-Language Phone Recognition (2) • Detectors were trained with – LS or LU (trained with other 5 languages) – Selecting better language-specific detectors: e.g. [retroflex: Mandarin], [Voiced & Vowel: Spanish] • Mergers were trained with only Spanish data Detector-Merger

SPA-SPA

LU5-SPA

PER(%)

49.20

47.51

• Selecting better attribute detectors reduced PER • Using LS merger (LU5-JAP) achieved the best PER: 38.2%

36

LUAM-JHU

Spoken Language Recognition • Language identification (LID) or verification (LV) • NIST annual language recognition evaluation (LRE) • Front-end utterance tokenizer - Multiple phone sets - Acoustic segment models (ASMs): no phonetic definition - Universal acoustic recognizers (UARs): speech attributes

• Backend vector space model (VSM) based recognizer - SVM with super-vector language feature representation - multiple system combination

• Universal acoustic characterization for all languages – Acoustic segment model unit (previous studies) – Manner and place of articulation (this study)

37

LUAM-JHU

Overall LID System: UAR Tokenizer Followed by VSM Recognizer

UAR and VSM: any universal tokenizer and vector-based recognizer 38

LUAM-JHU

Study on Entropy of English Letters Model

Cross Entropy (bits)

Comments

Zeroth order

4.76

uniform letter log(27)

First order

4.03

unigram

Second order

2.8

bigram

Shannon’s 2nd Experiment

1.34

human prediction

1. Students’ in-class computations verify results, and trigram ~ 2 bits 2. LID on encrypted documents can be accomplished

C. E. Shannon, “Prediction and Entropy of Printed English”, Bell System Technical Journal, Vol. 30, pp. 50-64, 1951. 39

LUAM-JHU

Finding Acoustic Alphabets • VQ First: Find a set of J codewords, {  j }, such that the total distortion for all training utterances is minimized: too many may be needed to have a good acoustic coverage • ASM: Find a set of units of size J, with a corresponding segment model set, such that the total likelihood of all training utterances is maximized: how many is needed ? • Segmentation, clustering, iteration, modeling, EM iteration (ICASSP88 for IWR with acoustic lexicon) In

Utterance Segmentation (DP): D(On , ) 

bi



d (otn ,i )

i 1 t bi1 1

Segment Clustering: ˆ  arg max  n1 D (O , ) N

40

n

LUAM-JHU

LID Results on 12 Languages • ASM acoustic and linguistic coverage (T-ASLP 2007) Error Rate (%) ASM Unigrams

32-ASM 40.1

64-ASM 26.7

128-ASM 22.3

ASM Bigrams ASM Trigrams

32.6 27.9

18.6 NA

13.9 NA

• Performance Comparison (Interspeech 2005)

41

P-PRLM *

22.0

P-PRLM & Score Fusion *

17.0

P-ASM SVM (259 units)

12.6 LUAM-JHU

LID Based on Universal Models Overall Language Databases

1 l …

L

Universal Acoustic Modeling

Universal USAMs Utterance Vector Collection

(a) Training universal USAMs

Voice Tokenization

l

Feature Extraction Spoken Documents

Individual Language Database

Universal USAMs (b) Converting utterances into spoken document vectors Overall Vector Collections

1 l …

L

Language Classifier Design

LID System

(c) Building language classifier 42

LUAM-JHU

Utterance Based Document Vectors

Key idea: information retrieval based text categorization with terms as n-gram co-occurrence statistics of manner & place 43

LUAM-JHU

Attribute Transcription Performance • Detectors were trained with: – Data from all six OGI Story languages – Both HMM and ANN-based configurations – Both Context-Independent (CI) and right-contextdependent (RC) units with 5 and 9 manner attributes – Evaluation in terms of attribute error rate

44

LUAM-JHU

SVD-Based Dimension Reduction

UPR and UMR: universal place and manner representations 45

LUAM-JHU

Performance with Detailed Models

Up to 4-gram statistics for building LSA document vectors 46

LUAM-JHU

Comparison in DET Curves

47

LUAM-JHU

Summary • Universal speech attribute modeling (USAM) - A new tool for speech modeling: an ASAT extension

• Language-universal speech attribute detection - Outperforming language-specific detector design

• Cross-language phone recognition - Possible with no training speech from target languages - Attribute-to-phone mapping is still very challenging

• Attribute-based spoken language recognition -

Use simple attributes: manner and place of articulation Utilize n-gram attribute co-occurrence statistics Achieve top performance Possible with no training speech from target languages Extend to resource-limited languages

• Recent effort: ASAT-based LVCSR with WFSM 48

LUAM-JHU

References C.-H. Lee, "On Automatic Speech Recognition at the Dawn of the 21st Century," IEICE Trans. on Information and Systems, Special Issue on Speech Information Processing, Vol. E86-D, No. 3, pp. 377-396, March C.-H. Lee, “Back to Speech Science - Towards a Collaborative ASR Community of the 21st Century," Dynamics in Speech Production and Perception, P. Divenyi, S. Greenberg, and G. Meyer (eds), NATO Science Series, IOS Press, 2006. S. M. Siniscalchi and C.-H. Lee, “A Study on Integrating Acoustic-Phonetic Information into Lattice Rescoring for Automatic Speech Recognition,” accepted for publication is Speech Communication, 2009. C.-H. Lee, ''From Knowledge-Ignorant to Knowledge-Rich Modeling: A New Speech Research Paradigm for Next Generation Automatic Speech Recognition,'' Proc. ICSLP, Jeju, South Korea, 2004. J. Li and C.-H. Lee, “On Designing and Evaluating Speech Event Detectors,” Proc. InterSpeech, Lisbon, September 2005. C.-H. Lee, “An Overview on Automatic Speech Attribute Transcription (ASAT),” Proc. Interspeech, Antwerp, Aug. 2007. S. M. Siniscalchi, T. Svendsen and C.-H. Lee, “Towards Bottom-Up Continuous Phone Recognizer,” Proc. ASRU, Kyoto, Japan, Dec. 2007. S. M. Siniscalchi, T. Svendsen and C.-H. Lee, “Toward A Detector-Based Universal Phone Recognizer,” Proc. ICASSP, Las Vegas, Nevada, March 2008. D. Lyu, S. M. Siniscalchi, T.-Y. Kim and C.-H. Lee, “An Experimental Study on Continuous Phone Recognition with Little or No Language-Specific Training Data,” Proc. ITRW, Aalborg, Denmark, June 2008. D. Lyu, S. M. Siniscalchi, T.-Y. Kim and C.-H. Lee, “Continuous Phone Recognition without Target Language Training Data,” Proc. Interspeech, Brisbane, Australia, Sept. 2008. S. M. Siniscalchi, J. Reed, T. Svendsen, and C.-H. Lee, “Exploring Universal Attribute Characterization of Spoken Languages for Spoken Language Recognition,” Proc. Interspeech, Brighton, UK, Sept. 2009. C.-H. Lee, NSF Symposium on Next Generation ASR -- http://users.ece.gatech.edu/~chl/ngasr0/3, Atlanta, Oct. 2003. 49

LUAM-JHU

Acknowledgment

• NSF Grant: ITR IIS-0427413 (M. Harper & T. Korelsky) • ASAT Team members and colaborators: J. Li, C. Ma, Y. Tsao, J. Reed, M. Siniscalchi, D. Lyu, E. Fosler-Lussier (OSU), T. Svendson (NTNU), A. Potamianos (TU-Crete)

Language-Universal Speech Modeling

Language-Universal Speech Modeling

Suggest Documents

PRONUNCIATION MODELING IN SPEECH

Speech Coding Based On Sparse Modeling

DIRECTLY MODELING SPEECH WAVEFORMS ... - Research at Google

Rhetorical Structure Modeling for Lecture Speech Summarization

Modeling the speech intelligibility in fluctuating noise

Leveraging Effective Query Modeling Techniques for Speech

Issues in acoustic modeling of speech for

structured language modeling for speech ... - Semantic Scholar

Pronunciation Modeling of Mandarin Casual Speech

PRONUNCIATION MODELING IN SPEECH ... - Semantic Scholar

Connectionist modeling of speech perception

Feature-based Pronunciation Modeling for Speech Recognition

Speech Enhancement Based on Statistical Modeling ...

SPEECH MODELING WITH MAGNITUDE ... - Google Sites

On modeling of conversational speech - DiVA

fractal modeling of speech signals - Evelyne Lutton

Modeling Coarticulation in Synthetic Visual Speech - CiteSeerX

Acoustic Modeling for Speech Synthesis - ASRU 2015

Pronunciation Variation Modeling for Dutch Automatic Speech ...

Computational Modeling of Assimilated Speech ... - Semantic Scholar

STRUCTURED LANGUAGE MODELING FOR SPEECH ... - Google Sites

Distortion-class modeling for robust speech recognition under GSM ...

Dialogue Modeling for Speech Generation in Multimodal Information ...

Duration Modeling for Arabic Text to Speech Synthesis