Rapid Language Portability of Speech Processing

0 downloads 0 Views 1MB Size Report
Myth: “Everyone speaks English, why bother?” ➡ NO: About 6500 .... Speech & Text. NLP. /. MT. TTS. Phone set & Speech data. +. Hello. Rapid Portability: Data ...
Rapid Language Portability of Speech Processing Systems

Tanja Schultz Carnegie Mellon University InterACT: Center for Advanced Communication Technologies CMU, 11-733, February 13 2007

Motivation ™

Computerization: Speech is key technology ·

š

Mobile Devices, Ubiquitous Information Access

Globalization: Multilinguality ·

More than 6900 Languages in the world

·

Multiple official languages ·

Europe has 20+ official languages

·

South Africa has 11 official languages

⇒ Speech Processing in multiple Languages

2/53

·

Cross-cultural Human-Human Interaction

·

Human-Machine Interface in mother tongue

Why Bother? Do we really need ASR in many languages? Myth: “Everyone speaks English, why bother?” ·

NO: About 6500 different languages in the world

·

Increasing number of languages in the Internet

·

Humanitarian and military needs ·

Rural areas, uneducated people, illiteracy

Myth: It’s just retraining on foreign data - no science ·

3/53

NO: Other languages bring unseen challenges, for example: ·

different scripts, no vowelization, no writing system

·

no word segmentation, rich morphology,

·

tonality, click sounds,

·

social factors: trust, access, exposure, cultural background

Language Characteristics

4/53

· Writing systems:

Logographic: >10.000 Chinese hanzi, ~7000 Japanese kanji + 3 other scripts ☺ Phonographic: ~5600 Korean gulja, Arabic, Devanagari, Cyrillic, Roman ~100

· G-2-P:

Logographic: NO Phonographic: close – far – complicated e.g. Spanish, Roman: more or less 1:1 e.g. Thai, Devanagari: C-V flips e.g. Arabic: no vowels written e.g. English: try „Phydough“

Language Characteristics · Prosody:

Tonal languages like Mandarin

· Sound system:

simple vs very complex systems

· Phonotactics:

simple syllable structure vs complex consonant clusters

· Morphology:

short units, compounds, agglutination

English: German:

Natural segmentation into short units – great! Compounds

Donau-dampf-schiffahrts-gesellschafts-kapitäns-mütze … Turkish:

Agglutination – looooong phrases

Osman-lι-laç-tιr-ama-yabil-ecek-ler-imiz-den-miş-siniz behaving as if you were of those whom we might consider not converting into Ottoman

· Segmentation: 5/53

some scripts do not provide segmentation in written form: Chinese, Japanese, Thai

Morphology & OOV

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.88 6/53

Language & Compactness

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.90 7/53

Challenges o Algorithms are language independent but require data o Dozens of hours audio recordings and corresponding transcriptions o Pronunciation dictionaries for large vocabularies (>100.000 words) o Millions of words written text corpora in various domains in question o Bilingual aligned text corpora

o BUT: Such data are only available in very few languages o Audio data ≤ 40 languages, Transcriptions take up to 40x real time o Large vocabulary pronunciation dictionaries ≤ 20 languages o Small text corpora ≤ 100 languages, large corpora ≤ 30 languages o Bilingual corpora in very few language pairs, pivot mostly English

o Additional complications: o Combinatorical explosion (domain, speaking style, accent, dialect, ...) o Few native speakers at hand for minority (endangered) languages o Languages peciularities, social factors 8/53

Solution: Learning Systems ⇒ Intelligent systems that learn a language from the user o Effizient learning algorithms for speech processing o Learning: o Interactive learning with user in the loop o Statistical modeling approaches

o Efficiency: o Reduce amount of data (save time and costs): by a factor of 10 o Speed up development cycles: days rather than months

⇒ Rapid Language Adaptation from universal models

o Bridge the gap between language and technology experts o Technology experts do not speak all languages in question o Native users are not in control of the technology 9/53

SPICE Speech Processing: Interactive Creation and Evaluation toolkit • National Science Foundation, Grant 10/2004, 3 years • Principle Investigator Tanja Schultz • Bridge the gap between technology experts → language experts • Automatic Speech Recognition (ASR), • Machine Translation (MT), • Text-to-Speech (TTS) • Develop web-based intelligent systems • Interactive Learning with user in the loop • Rapid Adaptation of universal models to unseen languages • SPICE webpage http://cmuspice.org 10/53

11/53

Speech Processing Systems

Phone set & Speech data Pronunciation rules Text data

Hello Input: Speech

12/53

hi /h//ai/ you /j/u/ we /w//i/

AM

Lex

hi you you are I am

LM

NLP / MT

TTS Output: Speech & Text

Rapid Portability: Data

Phone set & Speech data

+

Hello Input: Speech

13/53

hi /h//ai/ you /j/u/ we /w//i/

AM

Lex

hi you you are I am

LM

NLP / MT

TTS Output: Speech & Text

GlobalPhone Multilingual Database Widespread languages O Native Speakers O Uniform Data O Broad Domain O Large Text Resources · Internet, Newspaper O

Corpus Arabic Ch-Mandarin Ch-Shanghai German French Japanese Korean 14/53

Croatian Portuguese Russian Spanish Swedish Tamil Czech

Turkish + Thai + Creole + Polish + Bulgarian + ... ???

19 Languages … counting O ≥ 1800 native speakers O ≥ 400 hrs Audio data O Read Speech O Filled pauses annotated O

Speech Recognition in 17 Languages 40

Word Error Rate [%]

33.5 29

30

20

10

16.9 14 14 14.514.5 10

20 20 19 18

29

20

23.4 21.7

11.8

0 i se an lish ha ean arin kish nch ese tian ish rian ian e T or d ur re gu oa n e rm n g an lga uss n a p r F T E K a u p G S Bu R C rt M Ja o P Ch

15/53

i ns ese bic raq a I a in Ara rf ik Ch A

Phones and Languages

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.86 16/53

Rapid Portability: Acoustic Models

Phone set & Speech data

+

Hello Input: Speech

17/53

hi /h//ai/ you /j/u/ we /w//i/

AM

Lex

hi you you are I am

LM

NLP / MT

TTS Output: Speech & Text

Acoustic Model Combination Sound production is human not language specific: · International Phonetic Alphabet (IPA)

1) Build Universal sound inventory based on IPA 485 sounds are reduced to 162 IPA-sound classes 2) Each sound class is represented by one “phoneme” which is trained through data sharing across languages m,n,s,l occur in all languages O p,b,t,d,k,g,f and i,u,e,a,o occur in almost all languages O no sharing of triphthongs and palatal consonants O

18/53

Acoustic Model Combination

ML-Mix

ML-Sep

19/53

ML-Tag

Acoustic Model Combination Mono

ML-Tag7500

ML-Tag3000

ML-Mix3000

Word Error Rate [%]

50 40 30 30

32

37

35 28

27

30

32 20 21 21

20 20

29

15 13 14

10 0

Croatian

20/53

Japanese

Spanish

Turkish

Combine Monolingual ASR Systems? Can we build a language independent ASR system? o Language Independent Acoustic Modeling · International Phonetic Alphabet (IPA)

simple to implement, easy to port to other languages · Fully data-driven procedure considers spectral properties and statistical similarities apply to context independent and dependent models o

Universal Language Model · Combine LMs of languages to allow code switching

o

Approach: · Train language dependent and independent AM+LM · Evaluate in monolingual and multilingual mode

21/53

Word error rate [%]

Language Independent ASR 30 25 20 15 10 5 0 English (BN) Baseline Data-driven CD D-D CD +LM 22/53

IPA IPA + LM

German (SST) Data-driven CI D-D CI + LM

Multilingual Speech Interfaces o Large need for multilingual (mobile) devices o System handles speech from multiple languages o System allows seamless language switches o Fast and small footprint engine o Multilingual Speech Interface Multilingual Acoustic Models Multilingual Dictionary o Multilingual Language Model / Grammar o Integration into one truly multilingual system

23/53

Code Switching o ‘Code switching’ is the phenomenon of changing the language while speaking o Language model approaches can be divided by the degree to which they allow ‘code switching’: The meeting that was held in a castle lasted two days. Everybody agreed that it was successful.

o At any time in an utterance The meeting that was held in a castle dauerte zwei Tage. Everybody agreed that it was successful.

o At certain positions within an utterance, e.g. phrase boundaries The meeting das ein einem Schloss stattfand lasted two days. Everybody agreed that it was successful.

o At the sentence level, i.e. between utterances The meeting that was held in a castle lasted two days. Alle waren sich einig dass es ein Erfolg war.

o We focus on the sentence level code switching 24/53

Multilingual Language Models (ML-LM) o Simple merging of N-gram LM corpora o Pros: simple to implement, allows data sharing o Cons: requires balanced corpora, back-off to unigrams results in unwanted language switches, requires recalculation of the ngram probability o Meta LM: parallel combination of monolingual LMs through central scoring function o Pros: control over language switches by penalty function, simple user interface o Cons: overhead due to score combination o ML Grammars: combination of language dependent rules into one grammar o Pros: sharing of classes across languages, efficient o Cons: hand-crafting, bilingual expert required 25/53

Experiments and Results English

German

60,0 50,0

47,4 44,1

46,8

48,2 43,2

WER SER LID-ER

60,0

56,2

54,4

55,4

50,0 42,7

42,2

40,0

40,0

30,0

30,0

31,0 24,9 21,7

25,9

24,8

26,5

21,6

27,0

25,0

24,0

20,0

20,0

10,3 10,0

5,5

6,1

10,0 3,6

0,2 0,0

0,0 mono ngram

mono grammar

multi ngram MIX

multi ngram Meta

mono ngram

multi grammar

1,50

1,00 0,70

RTF

1,24 0,67

0,48

1,50

multi n-gram multi n-gram MIX Meta

0,50 0,00

0,00 mono grammar

multi ngram MIX

multi ngram Meta

multi grammar

1,38

1,29

1,00 0,50

mono ngram

26/53

mono grammar

2,00

2,00

1,00

1,8

1,51 0,61

0,48

mono ngram

mono grammar

multi grammar

multi ngram MIX

multi ngram Meta

multi grammar

Universal Sound Inventory Speech Production is independent from Language ⇒ IPA 1) IPA-based Universal Sound Inventory 2) Each sound class is trained by data sharing Reduction from 485 to 162 sound classes O m,n,s,l appear in all 12 languages O p,b,t,d,k,g,f and i,u,e,a,o in almost all O

Blaukraut Brautkleid Brotkorb Weinkarte

k lau k ra ut k le ot k or in k ar

-1=Plosiv? N

J

k (0)

+2=Vokal? N k (1)

J k (2)

lau k ra in k ar ot k or

27/53

ut k le

Problem: Context of sounds are language specific How to train context dependent models for new languages? Solution: 1) Multilingual Decision Context Trees 2) Specialize decision tree by Adaptation

Sharing Across Languages

28/53

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.98

Polyphones & Languages

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.87 29/53

Polyphone Coverage

Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.101 30/53

Context Decision Trees lau k ra ut k le ot k or in k ar

Blaukraut Brautkleid Brotkorb Weinkarte

k

O

O

O

-1=Plosiv? N

J O

k (0)

+2=Vokal? N k (1)

J k (2)

lau k ra in k ar

O

ot k or

31/53

O

ut k le

Context Decision Tree

O

Context dependent phones (± n = polyphone) Trainability vs Granularity Divisive clustering based on linguistic motivated questions One model is assigned to each context cluster Multi-lingual case: should we ignore language information? Depends on application Yes in case of adaptation to new target languages

Polyphone Decision Tree Specialization

English Polyphone Tree 32/53

Polyphone Decision Tree Specialization

English 33/53

Other languages

Polyphone Decision Tree Specialization

Multilingual Polyphone Tree 34/53

Polyphone Decision Tree Specialization

Polyphones found in Portuguese 35/53

Polyphone Decision Tree Specialization

1. Tree Pruning Select from all polyphones only the ones which are relevant for the particular language

36/53

Polyphone Decision Tree Specialization

2. Tree Regrowing Further specialize the tree according to the adaptation data 37/53

Rapid Portability: Acoustic Model 100

Ø Tree Word Error rate [%]

80

ML-Tree

Po-Tree

PDTS

69,1 57,1

60

49,9 40,6

40

32,8

28,9 19,6

19

20

0 0

0:15

0:15

0:25

+ 38/53

0:25

0:25

1:30

16:30

39/38

Rapid Portability: Pronunciation Dictionary

Pronunciation rules „adios“ Textdaten /a/ /d/ /i/ /o/ /s/ „Hallo“ /h/ /a/ /l/ /o/ „Phydough“ ??? Hello Input: Speech

40/53

hi /h//ai/ you /j/u/ we /w//i/

AM

Lex

hi you you are I am

LM

NLP / MT

TTS Output: Speech & Text

Phoneme vs. Grapheme-based ASR 50.0

Grapheme

Word Error Rate [%]

Phoneme 40.0

36.4 33 32.8 26.8 24.5

30.0 20.0

Grapheme (FTT)

19.218.4

26.4 15.6 14

11.5

18.3

16

12.7

10.0 0.0

English

Spanish

German

Problem: • 1 Grapheme ≠ 1 Phoneme

Russian AX-m

41/53

0=vowel?

0=obstruent?

AX-b

Flexible Tree Tying (FTT): One decision tree • Improved parameter tying • Less over specification • Fewer inconsistencies

Thai

0=begin-state?

IX-m -1=syllabic?0=mid-state?-1=obstruent?0=end-state?

Dictionary: Interactive Learning * Follow the work of Davel&Barnard

Word list W Delete wi

Delete wi

i:= best select

Word wi Generate pronunciation P(wi)

* Word list: extract from text

G-2-P

TTS

Yes

Lex 42/53

P(wi) okay?

Skip

Update G-2-P

No

Improve P(wi)

* G-2-P - explicit map rules - neural networks - decision trees - instance learning (grapheme context) * Update after each wi → effective training

User

43/53

44/53

Issues and Challenges o How to make best use of the human? o o o o

Definition of successful completion Which words to present in what order How to be robust against mistakes Feedback that keeps users motivated to continue

o How many words to be solicited? o G2P complexity depends on the language (SP easy, EN hard) o 80% coverage hundred (SP) to thousands (EN) o G2P rule system perplexity 45/53

Language

Perplexity

English

50.11

Dutch

16.80

German

16.70

Afrikaans

11.48

Italian

3.52

Spanish

1.21

Rapid Portability: LM ↔

Resource rich languages

Resource low languages:

Inquiry

Bridge Languages

Internet TV

+ Automatic Extraction

LM

Text data

Hello Input: Speech

46/53

hi /h//ai/ you /j/u/ we /w//i/

AM

Lex

hi you you are I am

LM

NLP / MT

TTS Output: Speech & Text

47/53

Rapid Portability: TTS

Phone set & Speech data

Hello Input: Speech

48/53

hi /h//ai/ you /j/u/ we /w//i/

AM

Lex

hi you you are I am

LM

NLP / MT

TTS Output: Speech & Text

Parametric TTS o Text-to-speech for G2P Learning: o Technique: phoneme-by-phoneme concatenation, speech not natural but understandable (Marelie Davel) o Units are based on IPA phoneme examples o PRO: covers languages through simple adaptation o CONS: not good enough for speech applications

o Text-to-speech for Applications: o Common technologies o Diphone: too hard to record and label o Unit selection: too much to record and label

o New technology: clustergen trajectory synthesis o Clusters representing context-dependent allophones o PRO: can work with little speech (10 minutes) o CONS: speech sounds buzzy, lacks natural prosody 49/53

Mono vs. Multilingual Models

MCD Score

10

Mono - M

9 8 7 6

Multi - M

Multi+ - M

5 4 3 2 1 0 CH

DE

EN

Manual speaker selection

JA

CR

PO

RU

SP

SW

TU

All

Languages

⇒ For all languages monolingual TTS performs best ⇒ Multilingual Models perform well … … only if knowledge about language is preserved (Multi+) (only small amount of sharing actually happens) 50/53

SPICE: Afrikaans – English o Goal: Build Afrikaans – English Speech Translation System with SPICE o Cooperation with University Stellenbosch and ARMSCOR o Bilingual PhD visited CMU for 3 month o Afrikaans: Related to Dutch and English, g-2-p very close, regular grammar, simple morphology

o SPICE, all components apply statistical modeling paradigm o o o o

ASR: HMMs, N-gram LM (JRTk-ISL) MT: Statistical MT (SMT-ISL) TTS: Unit-Selection (Festival) Dictionary: G-2-P rules using CART decision trees

o Text: 39 hansards; 680k words; 43k bilingual aligned sentence pairs; Audio: 6 hours read speech; 10k utterances, telephone speech (AST)

51/53

Time Effort o o o o

Good results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9) Shared pronunciation dictionaries (for ASR+TTS) and LM (for ASR+MT) Most time consuming process: data preparation → reduce amount of data! Still too much expert knowledge required (e.g. ASR parameter tuning!)

AM (ASR)

Lex

LM (ASR, MT)

TM (MT)

TTS

S-2-S

days

25 20 15 10 5 0

11 5 Data

52/53

3 8

7

5

5

Training

Tuning

Evaluation

Prototype

Conclusion o Intelligent systems to learn language o SPICE: Learning by interaction with the (naive) user o Rapid Portability to unseen languages o Multilingual Systems o Systems and data in multiple languages o Universal language independent models o Other Applications and Issues in Multilinguality o Speech translation (earlier talk) o Multilingual Interfaces o Code switching, non-native / accented speech

o Non-verbal Cue Identification 53/53