Myth: âEveryone speaks English, why bother?â â¡ NO: About 6500 .... Speech & Text. NLP. /. MT. TTS. Phone set & Speech data. +. Hello. Rapid Portability: Data ...
Rapid Language Portability of Speech Processing Systems
Tanja Schultz Carnegie Mellon University InterACT: Center for Advanced Communication Technologies CMU, 11-733, February 13 2007
Motivation
Computerization: Speech is key technology ·
Mobile Devices, Ubiquitous Information Access
Globalization: Multilinguality ·
More than 6900 Languages in the world
·
Multiple official languages ·
Europe has 20+ official languages
·
South Africa has 11 official languages
⇒ Speech Processing in multiple Languages
2/53
·
Cross-cultural Human-Human Interaction
·
Human-Machine Interface in mother tongue
Why Bother? Do we really need ASR in many languages? Myth: “Everyone speaks English, why bother?” ·
NO: About 6500 different languages in the world
·
Increasing number of languages in the Internet
·
Humanitarian and military needs ·
Rural areas, uneducated people, illiteracy
Myth: It’s just retraining on foreign data - no science ·
3/53
NO: Other languages bring unseen challenges, for example: ·
different scripts, no vowelization, no writing system
·
no word segmentation, rich morphology,
·
tonality, click sounds,
·
social factors: trust, access, exposure, cultural background
Language Characteristics
4/53
· Writing systems:
Logographic: >10.000 Chinese hanzi, ~7000 Japanese kanji + 3 other scripts ☺ Phonographic: ~5600 Korean gulja, Arabic, Devanagari, Cyrillic, Roman ~100
· G-2-P:
Logographic: NO Phonographic: close – far – complicated e.g. Spanish, Roman: more or less 1:1 e.g. Thai, Devanagari: C-V flips e.g. Arabic: no vowels written e.g. English: try „Phydough“
Language Characteristics · Prosody:
Tonal languages like Mandarin
· Sound system:
simple vs very complex systems
· Phonotactics:
simple syllable structure vs complex consonant clusters
· Morphology:
short units, compounds, agglutination
English: German:
Natural segmentation into short units – great! Compounds
Donau-dampf-schiffahrts-gesellschafts-kapitäns-mütze … Turkish:
Agglutination – looooong phrases
Osman-lι-laç-tιr-ama-yabil-ecek-ler-imiz-den-miş-siniz behaving as if you were of those whom we might consider not converting into Ottoman
· Segmentation: 5/53
some scripts do not provide segmentation in written form: Chinese, Japanese, Thai
Morphology & OOV
Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.88 6/53
Language & Compactness
Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.90 7/53
Challenges o Algorithms are language independent but require data o Dozens of hours audio recordings and corresponding transcriptions o Pronunciation dictionaries for large vocabularies (>100.000 words) o Millions of words written text corpora in various domains in question o Bilingual aligned text corpora
o BUT: Such data are only available in very few languages o Audio data ≤ 40 languages, Transcriptions take up to 40x real time o Large vocabulary pronunciation dictionaries ≤ 20 languages o Small text corpora ≤ 100 languages, large corpora ≤ 30 languages o Bilingual corpora in very few language pairs, pivot mostly English
o Additional complications: o Combinatorical explosion (domain, speaking style, accent, dialect, ...) o Few native speakers at hand for minority (endangered) languages o Languages peciularities, social factors 8/53
Solution: Learning Systems ⇒ Intelligent systems that learn a language from the user o Effizient learning algorithms for speech processing o Learning: o Interactive learning with user in the loop o Statistical modeling approaches
o Efficiency: o Reduce amount of data (save time and costs): by a factor of 10 o Speed up development cycles: days rather than months
⇒ Rapid Language Adaptation from universal models
o Bridge the gap between language and technology experts o Technology experts do not speak all languages in question o Native users are not in control of the technology 9/53
SPICE Speech Processing: Interactive Creation and Evaluation toolkit • National Science Foundation, Grant 10/2004, 3 years • Principle Investigator Tanja Schultz • Bridge the gap between technology experts → language experts • Automatic Speech Recognition (ASR), • Machine Translation (MT), • Text-to-Speech (TTS) • Develop web-based intelligent systems • Interactive Learning with user in the loop • Rapid Adaptation of universal models to unseen languages • SPICE webpage http://cmuspice.org 10/53
11/53
Speech Processing Systems
Phone set & Speech data Pronunciation rules Text data
Hello Input: Speech
12/53
hi /h//ai/ you /j/u/ we /w//i/
AM
Lex
hi you you are I am
LM
NLP / MT
TTS Output: Speech & Text
Rapid Portability: Data
Phone set & Speech data
+
Hello Input: Speech
13/53
hi /h//ai/ you /j/u/ we /w//i/
AM
Lex
hi you you are I am
LM
NLP / MT
TTS Output: Speech & Text
GlobalPhone Multilingual Database Widespread languages O Native Speakers O Uniform Data O Broad Domain O Large Text Resources · Internet, Newspaper O
Corpus Arabic Ch-Mandarin Ch-Shanghai German French Japanese Korean 14/53
Croatian Portuguese Russian Spanish Swedish Tamil Czech
Turkish + Thai + Creole + Polish + Bulgarian + ... ???
19 Languages … counting O ≥ 1800 native speakers O ≥ 400 hrs Audio data O Read Speech O Filled pauses annotated O
Speech Recognition in 17 Languages 40
Word Error Rate [%]
33.5 29
30
20
10
16.9 14 14 14.514.5 10
20 20 19 18
29
20
23.4 21.7
11.8
0 i se an lish ha ean arin kish nch ese tian ish rian ian e T or d ur re gu oa n e rm n g an lga uss n a p r F T E K a u p G S Bu R C rt M Ja o P Ch
15/53
i ns ese bic raq a I a in Ara rf ik Ch A
Phones and Languages
Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.86 16/53
Rapid Portability: Acoustic Models
Phone set & Speech data
+
Hello Input: Speech
17/53
hi /h//ai/ you /j/u/ we /w//i/
AM
Lex
hi you you are I am
LM
NLP / MT
TTS Output: Speech & Text
Acoustic Model Combination Sound production is human not language specific: · International Phonetic Alphabet (IPA)
1) Build Universal sound inventory based on IPA 485 sounds are reduced to 162 IPA-sound classes 2) Each sound class is represented by one “phoneme” which is trained through data sharing across languages m,n,s,l occur in all languages O p,b,t,d,k,g,f and i,u,e,a,o occur in almost all languages O no sharing of triphthongs and palatal consonants O
18/53
Acoustic Model Combination
ML-Mix
ML-Sep
19/53
ML-Tag
Acoustic Model Combination Mono
ML-Tag7500
ML-Tag3000
ML-Mix3000
Word Error Rate [%]
50 40 30 30
32
37
35 28
27
30
32 20 21 21
20 20
29
15 13 14
10 0
Croatian
20/53
Japanese
Spanish
Turkish
Combine Monolingual ASR Systems? Can we build a language independent ASR system? o Language Independent Acoustic Modeling · International Phonetic Alphabet (IPA)
simple to implement, easy to port to other languages · Fully data-driven procedure considers spectral properties and statistical similarities apply to context independent and dependent models o
Universal Language Model · Combine LMs of languages to allow code switching
o
Approach: · Train language dependent and independent AM+LM · Evaluate in monolingual and multilingual mode
21/53
Word error rate [%]
Language Independent ASR 30 25 20 15 10 5 0 English (BN) Baseline Data-driven CD D-D CD +LM 22/53
IPA IPA + LM
German (SST) Data-driven CI D-D CI + LM
Multilingual Speech Interfaces o Large need for multilingual (mobile) devices o System handles speech from multiple languages o System allows seamless language switches o Fast and small footprint engine o Multilingual Speech Interface Multilingual Acoustic Models Multilingual Dictionary o Multilingual Language Model / Grammar o Integration into one truly multilingual system
23/53
Code Switching o ‘Code switching’ is the phenomenon of changing the language while speaking o Language model approaches can be divided by the degree to which they allow ‘code switching’: The meeting that was held in a castle lasted two days. Everybody agreed that it was successful.
o At any time in an utterance The meeting that was held in a castle dauerte zwei Tage. Everybody agreed that it was successful.
o At certain positions within an utterance, e.g. phrase boundaries The meeting das ein einem Schloss stattfand lasted two days. Everybody agreed that it was successful.
o At the sentence level, i.e. between utterances The meeting that was held in a castle lasted two days. Alle waren sich einig dass es ein Erfolg war.
o We focus on the sentence level code switching 24/53
Multilingual Language Models (ML-LM) o Simple merging of N-gram LM corpora o Pros: simple to implement, allows data sharing o Cons: requires balanced corpora, back-off to unigrams results in unwanted language switches, requires recalculation of the ngram probability o Meta LM: parallel combination of monolingual LMs through central scoring function o Pros: control over language switches by penalty function, simple user interface o Cons: overhead due to score combination o ML Grammars: combination of language dependent rules into one grammar o Pros: sharing of classes across languages, efficient o Cons: hand-crafting, bilingual expert required 25/53
Experiments and Results English
German
60,0 50,0
47,4 44,1
46,8
48,2 43,2
WER SER LID-ER
60,0
56,2
54,4
55,4
50,0 42,7
42,2
40,0
40,0
30,0
30,0
31,0 24,9 21,7
25,9
24,8
26,5
21,6
27,0
25,0
24,0
20,0
20,0
10,3 10,0
5,5
6,1
10,0 3,6
0,2 0,0
0,0 mono ngram
mono grammar
multi ngram MIX
multi ngram Meta
mono ngram
multi grammar
1,50
1,00 0,70
RTF
1,24 0,67
0,48
1,50
multi n-gram multi n-gram MIX Meta
0,50 0,00
0,00 mono grammar
multi ngram MIX
multi ngram Meta
multi grammar
1,38
1,29
1,00 0,50
mono ngram
26/53
mono grammar
2,00
2,00
1,00
1,8
1,51 0,61
0,48
mono ngram
mono grammar
multi grammar
multi ngram MIX
multi ngram Meta
multi grammar
Universal Sound Inventory Speech Production is independent from Language ⇒ IPA 1) IPA-based Universal Sound Inventory 2) Each sound class is trained by data sharing Reduction from 485 to 162 sound classes O m,n,s,l appear in all 12 languages O p,b,t,d,k,g,f and i,u,e,a,o in almost all O
Blaukraut Brautkleid Brotkorb Weinkarte
k lau k ra ut k le ot k or in k ar
-1=Plosiv? N
J
k (0)
+2=Vokal? N k (1)
J k (2)
lau k ra in k ar ot k or
27/53
ut k le
Problem: Context of sounds are language specific How to train context dependent models for new languages? Solution: 1) Multilingual Decision Context Trees 2) Specialize decision tree by Adaptation
Sharing Across Languages
28/53
Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.98
Polyphones & Languages
Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.87 29/53
Polyphone Coverage
Multilingual Speech Processing, Schultz&Kirchhoff (ed.), Chapter 4, p.101 30/53
Context Decision Trees lau k ra ut k le ot k or in k ar
Blaukraut Brautkleid Brotkorb Weinkarte
k
O
O
O
-1=Plosiv? N
J O
k (0)
+2=Vokal? N k (1)
J k (2)
lau k ra in k ar
O
ot k or
31/53
O
ut k le
Context Decision Tree
O
Context dependent phones (± n = polyphone) Trainability vs Granularity Divisive clustering based on linguistic motivated questions One model is assigned to each context cluster Multi-lingual case: should we ignore language information? Depends on application Yes in case of adaptation to new target languages
Polyphone Decision Tree Specialization
English Polyphone Tree 32/53
Polyphone Decision Tree Specialization
English 33/53
Other languages
Polyphone Decision Tree Specialization
Multilingual Polyphone Tree 34/53
Polyphone Decision Tree Specialization
Polyphones found in Portuguese 35/53
Polyphone Decision Tree Specialization
1. Tree Pruning Select from all polyphones only the ones which are relevant for the particular language
36/53
Polyphone Decision Tree Specialization
2. Tree Regrowing Further specialize the tree according to the adaptation data 37/53
Rapid Portability: Acoustic Model 100
Ø Tree Word Error rate [%]
80
ML-Tree
Po-Tree
PDTS
69,1 57,1
60
49,9 40,6
40
32,8
28,9 19,6
19
20
0 0
0:15
0:15
0:25
+ 38/53
0:25
0:25
1:30
16:30
39/38
Rapid Portability: Pronunciation Dictionary
Pronunciation rules „adios“ Textdaten /a/ /d/ /i/ /o/ /s/ „Hallo“ /h/ /a/ /l/ /o/ „Phydough“ ??? Hello Input: Speech
40/53
hi /h//ai/ you /j/u/ we /w//i/
AM
Lex
hi you you are I am
LM
NLP / MT
TTS Output: Speech & Text
Phoneme vs. Grapheme-based ASR 50.0
Grapheme
Word Error Rate [%]
Phoneme 40.0
36.4 33 32.8 26.8 24.5
30.0 20.0
Grapheme (FTT)
19.218.4
26.4 15.6 14
11.5
18.3
16
12.7
10.0 0.0
English
Spanish
German
Problem: • 1 Grapheme ≠ 1 Phoneme
Russian AX-m
41/53
0=vowel?
0=obstruent?
AX-b
Flexible Tree Tying (FTT): One decision tree • Improved parameter tying • Less over specification • Fewer inconsistencies
Thai
0=begin-state?
IX-m -1=syllabic?0=mid-state?-1=obstruent?0=end-state?
Dictionary: Interactive Learning * Follow the work of Davel&Barnard
Word list W Delete wi
Delete wi
i:= best select
Word wi Generate pronunciation P(wi)
* Word list: extract from text
G-2-P
TTS
Yes
Lex 42/53
P(wi) okay?
Skip
Update G-2-P
No
Improve P(wi)
* G-2-P - explicit map rules - neural networks - decision trees - instance learning (grapheme context) * Update after each wi → effective training
User
43/53
44/53
Issues and Challenges o How to make best use of the human? o o o o
Definition of successful completion Which words to present in what order How to be robust against mistakes Feedback that keeps users motivated to continue
o How many words to be solicited? o G2P complexity depends on the language (SP easy, EN hard) o 80% coverage hundred (SP) to thousands (EN) o G2P rule system perplexity 45/53
Language
Perplexity
English
50.11
Dutch
16.80
German
16.70
Afrikaans
11.48
Italian
3.52
Spanish
1.21
Rapid Portability: LM ↔
Resource rich languages
Resource low languages:
Inquiry
Bridge Languages
Internet TV
+ Automatic Extraction
LM
Text data
Hello Input: Speech
46/53
hi /h//ai/ you /j/u/ we /w//i/
AM
Lex
hi you you are I am
LM
NLP / MT
TTS Output: Speech & Text
47/53
Rapid Portability: TTS
Phone set & Speech data
Hello Input: Speech
48/53
hi /h//ai/ you /j/u/ we /w//i/
AM
Lex
hi you you are I am
LM
NLP / MT
TTS Output: Speech & Text
Parametric TTS o Text-to-speech for G2P Learning: o Technique: phoneme-by-phoneme concatenation, speech not natural but understandable (Marelie Davel) o Units are based on IPA phoneme examples o PRO: covers languages through simple adaptation o CONS: not good enough for speech applications
o Text-to-speech for Applications: o Common technologies o Diphone: too hard to record and label o Unit selection: too much to record and label
o New technology: clustergen trajectory synthesis o Clusters representing context-dependent allophones o PRO: can work with little speech (10 minutes) o CONS: speech sounds buzzy, lacks natural prosody 49/53
Mono vs. Multilingual Models
MCD Score
10
Mono - M
9 8 7 6
Multi - M
Multi+ - M
5 4 3 2 1 0 CH
DE
EN
Manual speaker selection
JA
CR
PO
RU
SP
SW
TU
All
Languages
⇒ For all languages monolingual TTS performs best ⇒ Multilingual Models perform well … … only if knowledge about language is preserved (Multi+) (only small amount of sharing actually happens) 50/53
SPICE: Afrikaans – English o Goal: Build Afrikaans – English Speech Translation System with SPICE o Cooperation with University Stellenbosch and ARMSCOR o Bilingual PhD visited CMU for 3 month o Afrikaans: Related to Dutch and English, g-2-p very close, regular grammar, simple morphology
o SPICE, all components apply statistical modeling paradigm o o o o
ASR: HMMs, N-gram LM (JRTk-ISL) MT: Statistical MT (SMT-ISL) TTS: Unit-Selection (Festival) Dictionary: G-2-P rules using CART decision trees
o Text: 39 hansards; 680k words; 43k bilingual aligned sentence pairs; Audio: 6 hours read speech; 10k utterances, telephone speech (AST)
51/53
Time Effort o o o o
Good results: ASR 20% WER; MT A-E (E-A) Bleu 34.1 (34.7), Nist 7.6 (7.9) Shared pronunciation dictionaries (for ASR+TTS) and LM (for ASR+MT) Most time consuming process: data preparation → reduce amount of data! Still too much expert knowledge required (e.g. ASR parameter tuning!)
AM (ASR)
Lex
LM (ASR, MT)
TM (MT)
TTS
S-2-S
days
25 20 15 10 5 0
11 5 Data
52/53
3 8
7
5
5
Training
Tuning
Evaluation
Prototype
Conclusion o Intelligent systems to learn language o SPICE: Learning by interaction with the (naive) user o Rapid Portability to unseen languages o Multilingual Systems o Systems and data in multiple languages o Universal language independent models o Other Applications and Issues in Multilinguality o Speech translation (earlier talk) o Multilingual Interfaces o Code switching, non-native / accented speech
o Non-verbal Cue Identification 53/53