Aug 27, 2008 - Applications: digital multimedia libraries, media monitoring, News on. Demand ... Developing multimedia and multilingual indexing and management tools .... Best combination via feature vector concatenation (78 features).
Speech Processing for Audio Indexing Lori Lamel with J.L. Gauvain and colleagues from Spoken Language Processing Group CNRS-LIMSI
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
1
Introduction • Huge amounts of audio data produced (TV, radio, Internet, telephone) • Automatic structuring of multimedia, multilingual data • Much accessible content in the audio and text streams • Speech and language processing technologies are key components for indexing • Applications: digital multimedia libraries, media monitoring, News on Demand, Internet watch services, speech analytics, ...
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
2
Processing Audiovisual Data
Language recognition Speaker recognition Emotions signal
Audio Segmentation
XML
Speech transcription Named entities, ... Translation
Acoustic−phonetic modeling Corpus based studies Perceptual tests
LIMSI
GoTAL
Q&A Audio Analysis Dialog Generation
Synthesis
Gothenburg, Sweden, August 25-27, 2008
signal
3
Some Recent Projects • Conversational interfaces & intelligent assistants: • Computers in the Human Interaction Loop (chil.server.de) Human-Human communication mediated by the machine – – – –
multimodality (speech, image, video) “who & where” speaker identification and tracking “what” linguistic content (far-field ASR) “why & how” style, attitude, emotion
• Speech translation: Technology and Corpora for Speech to Speech Translation (http://www.tc-star.org) Global Autonomous Language Exploitation (DARPA GALE) • Developing multimedia and multilingual indexing and management tools QUAERO (www.quaero.org) • Speech analytics: Infom@gic/Callsurf LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
4
Why Is Speech Recognition Difficult? Automatic conversion of spoken words in text Text: I do not know why speech recognition is so difficult Continuous: Idonotknowwhyspeechrecognitionissodifficult Spontaneous: Idunnowhyspeechrecnitionsodifficult Pronunciation: YdonatnowYspiCrEkxgnISxnIzsodIfIk∧lt YdonowYspiCrEknISNsodIfxk∧l YdontnowYspiCrEkxnISNsodIfIk∧lt YdxnowYspiCrEknISNsodIfxk∧lt Important variability factors: Speaker physical characteristics (gender, age, ...), accent, emotional state, situation (lecture, conversation, meeting, ...)
Acoustic environment background noise (cocktail party, ...) room acoustic, signal capture (microphone, channel, ...)
[after lecture ’Speech Recognition and Understanding’ by Tanja Schultz and John Kominek]
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
5
Audio Samples (1) • WSJ Read
stop
PREVIOUSLY HE WAS PRESIDENT OF CIGNA’S PROPERTY CASUAL GROUP FOR FIVE YEARS AND SERVED AS CHIEF FINANCIAL OFFICER
• WSJ Spontaneous ACCORDING TO KOSTAS STOUPIS AN ECONOMIST WITH THE GREEK GOVERNMENT IT IS IT IS NOT YET POSSIBLE FOR GREECE TO TO MAINTAIN AN EXPORT LEVEL THAT IS EQUIVALENT TO EVEN THE THE SMALLEST LEVEL OF THE OTHER WESTERN NATIONS
• Broadcast News ON WORLD NEWS TONIGHT THIS WEDNESDAY TERRY NICHOLS AVOIDS THE DEATH SENTENCE FOR HIS ROLE IN THE OKLAHOMA CITY BOMBING. ONE OF THE WORLD’S MOST POPULAR AIRPLANES THE GOVERNMENT ORDERS AN IMMEDIATE SAFE TY INSPECTION. AND OUR REPORT ON THE CLONING CONTROVERSY BECAUSE ONE AMERICAN SCIENTIST SAYS HE’S READY TO GO. LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
6
Audio Samples (2) • Conversational Telephone Speech
stop
Hello. Yes, This is Bill. Hi Bill. This is Hillary. So this is my first call so I haven’t a clue as to - - what we’re supposed to do yeah. It’s pretty funny that we’re Bill and Hillary talking on the phone though isn’t it? Ah I didn’t even think about that. {laughter} uhm we’re supposed to talk about did you get the topic. Yes. So we’re supposed to talk about what changes if any you have made since ... • CHIL, spontaneous, non native I just give you a brief overview of [noise] what’s going on in uh audio and why we bother with all these microphones in the the materials of the walls , the carpet and all that ...
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
7
Audio Samples (3) • EPPS
stop
Mister President, in these questions we see the European social model in flashing lights. The European social model is a mishmash which pleases no one: a bit of the free market here and a bit of welfare state there, mixed with a little green posturing.
• EPPS, spontaneous Thank you very much, Mister Chairman. Hum When I came to this room I I have seen and heard homophobia but sort of vaguely ; through T. V. and the rest of it. But I mean listening to some of the Polish colleagues speak here today , especially Roszkowski , Pek , Giertych and Krupa,...
• EPPS, non native My main my main problem with the environmental proposals of the Commission is that they do not correspond to the objectives of the Sixth Environmental Action Plan. As far as the traffic is concerned, the sixth E. A. P. stressed the decoupling of transport growth and G. D. P. growth. LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
8
Indicative ASR Performance Task Dictation
Condition read speech, close-talking mic. read speech, noisy (SNR 15dB) read speech, telephone spontaneous dictation read speech, non-native Found audio TV & radio news broadcasts TV documentaries Telephone conversations Lectures (close mic) Lectures (distant mic) EPPS
LIMSI
GoTAL
Word Error 3-4% (humans 1%) 10% 20% 14% 20% 10-15% (humans 4%) 20-30% 20-30% (humans 4%) 20% 50% 8%
Gothenburg, Sweden, August 25-27, 2008
9
1988
LIMSI 1989 1990 1991 1992 1993
GoTAL 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008
Gothenburg, Sweden, August 25-27, 2008
% %
1
k
R
5
k
(
s
e
0
1
s
g
n
w
X
h
i
l
E
N
y
s
o
i
N
%
0
1
%
)
e
m
n
u
s
g
n
w
s
e
d
t
i
i
l
h
i
l
E
N
e
1
s
g
n
s w
X
h
i
l
E
N
2
0
k
s
r
e
)
L
(
U
h
i
F
T
S
C
s
e
n
o
p
o
r
c
h
i
M
0 1
c
a
r
s
e
w
c
e
e
p
X
i
b
A
N
\
h
S
(
)
h
s
i
l
g
n
E
n
o
N
e
r
a
d
i
V
s
o
g
n
n
n
a
K
P
k
i
i
l
1
0
n
r
a
n
a
s
e
w
X
i
d
M
N
e v
a
r
r
T
A
l
i
=
g
n
e
e
2009
M
H
I
i
t
M
0
n
r
a
n
a
)
L
(
U
i
d
M
T
S
C
c
e
e
p
r
a
u
e
r
a
o
c
w
h
S
r
c
a
l
l
l
C
d
b
h
t
i
S
T
S
C
)
L
(
U
i
b
A
s
a
c
a
o
r
t
B
d
e
e
g
n
i
t
M
M
D
M
r
a
o
c
w
Ȃ
2010 2011 1
%
4
E
c
e
e
p
I
I
d
b
h
t
i
S
h
S
g
n
e
e
e
a
S
i
t
M
R
M
D
Ȃ
0
1
2
W
d
s
g
n
=
n
o
)
h
i
l
E
(
N
c
e
e
p
g
n
e
e
t
M
h
S
i
c
e
e
p
a
n
o
a
s
r
e v
n
o
r
a
o
c
w
t
d
b
h
t
i
S
h
S
l
i
C
1
0
0
%
y
r
o
s
s
e
r
a
m
c
n
e
t
T
B
T
T
S
T
S
I
N
h
t
i
H
k
from NIST Ȃ Ǥǯ͘͟
10
Some Research Topics • Development of generic technology • Speaker independent, task independent (large vocabulary > 200k words) • Robust (noise, microphone, ...) • Improving model accuracy • Processing found audio: varied speaking styles, uncontrolled conditions • Multilinguality: low e-resourced languages • Developing systems with less supervision • Model adaptation: keeping models up-to-date • Annotation of metadata (speaker, language, topic, emotion, style ...)
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
11
Internet Languages
Internet users in Millions per language Increase (%) in Internet users p internetworldstats.com, Nov 30, 2007 language from 2000 to 2007
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
12
ASR Systems Text Corpus
Speech Corpus
Training
Decoding
Normalization
Manual Transcription
Feature Extraction
N−gram Estimation
Training Lexicon
HMM Training
Language Model
Recognizer Lexicon
P(W)
Speech Y Sample
LIMSI
GoTAL
Acoustic Front−end
P(H|W)
X
Decoder
Acoustic Models f(X|H)
W*
Speech Transcription
Gothenburg, Sweden, August 25-27, 2008
13
System Components • Acouctic models: HMMs (10-20M parameters), allophones (triphones), discriminative features, discriminative training • Pronunciation dictionary with probabilities • Language models: statistical N-gram models (10-20M N-grams), model interpolation, connectionist LMs, text normalization • Decoder: multipass decoding, unsupervised acoustic model adaptation, system combination (Rover, cross-system adaptation)
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
14
Multi Layer Perceptron Front-End • Features incorporate longer temporal information than cepstral features • MLP NN projects the high-dimensional space onto phone state targets • LP-TRAP features (500ms), 4 layer bottle-neck architecture (HMM state targets, 3rd layer size is 39), PCA [Hermansky & Sharma, ICSLP’98; Fousek, 2007] • Studied various configurations & ways to combine with cepstral features • Main results: bnat06
WER (%) Features PLP MLPwLP PLP + MLPwLP No adaptation 21.8 20.7 19.2 SAT+CMLLR+MLLR 19.0 18.9 17.8
– – – –
Signifiant WER reduction without acoustic model adaptation Best combination via feature vector concatenation (78 features) Acoustic model adaptation is less effective for MLP Rover (PLP,PLP+MLP) further reduces the WER
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
15
Pronunciation Modeling • Pronunciation dictionary is an important system component • Usually represented with phones + non-speech units (silence, breath, filler) • Well-suited to recognition of read and prepared speech • Modeling variants particularly important for spontaneous speech multiwords “I don’t know” • Explore (semi-)automatic methods to generate pronunciation variants • Explicit use of phonetic knowledge to hypothesize rules anti, multi: /i/ - /ay/ variation inter: /t/ vs nasal flap, retroflex schwa vs schwa stop deletion/insertion y-insertion (coupon, news, super) • Data driven: confront pronunciations with very large amounts of data • Estimate probabilities of pronunciation variants (very important for Arabic) LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
16
Example: coupon
kyupan kupan
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
17
HMM
a12
a 23 2
1
LIMSI
GoTAL
3
f(. |s2)
f( .|s1) x1
a33
a22
a11
x2
x3
f(. |s3) x4
x5
x6
x7
x8
Gothenburg, Sweden, August 25-27, 2008
18
Example: interest
InXIst IntrIst
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
19
Pronunciation Variants - BN vs CTS 100 hours of data Word interest
Pronunciation BN CTS Word IntrIst 238 488 interests IntXIst 3 33 InXIst 0 11
interested IntrIstxd IntXIstxd InXIstxd
Pronunciation BN CTS IntrIss 52 53 IntrIsts 19 30 IntXIsts 3 2 IntXIss 3 1 126 386 interesting IntrIst|G 193 1399 3 80 IntXIst|G 8 314 18 146 InXIst|G 21 463
• Similar but less frequent words: interestingly, disinterest, disinterested, intercom, interconnected, interfere... • Estimate pronunciation probabilities, but no generalization
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
20
Morphological Decomposition • Morphological decomposition proposed to address problem of lexical variety (Arabic, German, Turkish, Finnish, Amharic, ...) • Rule-based vs data-driven approaches • Reduces out-of-vocabulary rate • Improve n-gram estimation of infrequent words • Tendency to create errors due to acoustic confusions of small units • Some proposed solutions: - inhibit decomposition of frequent words - incorporate linguistic knowledge (recognizer confusions, assimilation) • Recognition performace of optimized systems is typically comparable to and can be combined with word based systems
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
21
Continuous Space Language Models • Project words onto a continuous space (about 300) • Estimate projection and n-gram probabilities with neural network • Better generalization for unknown n-grams (compared to general back-off methods) • Efficient training with a resampling algorithm • Only model the N most common words (10-12k) • Interpolated with a standard 4-gram backoff LM • 20% perplexity reduction, 0.5-0.8% reduction in WER across a variety of languages and tasks [Schwenk, 2007]
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
22
Reducing Training Costs - Motivation • State-of-the-art speech recognizers trained on – hundreds to thousands hours of transcribed audio data – hundreds of million to several billions of words of texts – large recognition lexicons • Less e-represented languages – – – –
Over 6000 languages, about 800 written Poor representation in accessible form Lack of economic and/or political incentives PhD theses: Vietnamese, Khmer [Le, 2006], Somali [Nimaan, 2007], Amharic, Turkish [Pellegrini, 2008] – SPICE: Afrikaans, Bulgarian, Vietnamese, Hindi, Konkani, Telugu, Turkish, but also English, German, French [Schultz, 2007]
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
23
Questions • How much data is needed to train the models? • What performance can be expected with a given amount of resources? • Are audio or textual training data more important for ASR in less e-represented languages (languages with texts available)? • Assess the impact on WER of the quantity of Speech material (acoustic modeling) Texts (transcriptions) Texts (newspapers, newswires available on the Web) • Some studies on Amharic (low e-resourced language) [Pellegrini, 2008]
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
24
OOV and WER Rates Amharic, 2 hour development set 50
75
2h 10h 35h
2h 5h 10h 35h
log WER(%)
OOV(%)
50 25
25 0 0
10k 100k 500k 1M Text corpus size (# words)
4.6M
10k
100k 500k 1M Text corpus size (# words)
4.6M
• Audio transcriptions closer to development data • Even with 4.6M texts, more audio transcripts help
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
25
Influence of Transcriptions in LM Amharic, 2 hour development set 100
10h MA10h−ML35h 35h
log WER(%)
75 log WER(%)
50
2h 10h 35h MA2h−ML10h MA2h−ML35h
50
25 25 10k
100k 500k 1M Text corpus size (# words)
4.6M
10k
100k 500k 1M Test corpus size (# words)
4.6M
• 2h of audio training data is not sufficient • Good operating point: 10h audio, 1M word texts • Collecting texts is more efficient than transcribing audio at this operating point LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
26
Reducing Training Supervision • Use of quick transcriptions (less detailed, less costly to produce) − low-cost approximate annotations (commercial transcripts, closed-captions) − raw data with closely related contemporary texts − raw data with complementary texts (only newspaper, newswire,...) • Lightly supervised and unsupervised training using speech recognizer [Zavaliagkos & Colthurst 1998; Kemp & Waibel 1999; Jang & Hauptmann, 1999, Lamel, Gauvain & Adda 2000] • Standard MLE training assumes that transcriptions are perfect, and are aligned with audio • Flexible alignment using full N-gram decoder and garbage model • Can skip over unknown words, transcription errors • Implicit modeling of short vowels in Arabic
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
27
Reducing Training Supervision - Non-speech • Use existing models (trained on manual annotations) to automatically detect fillers and breath noises • Add these to reference transcriptions breath filler Manual (50h) 8.1% 0.6% BN (967h) 4.9% 3.4% BC (162h) 5.4% 6.8% • Small reduction in word error rate
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
28
Reducing Supervision: Pronunciation Generation • Pronunciation generation in Arabic is relatively straightforward from a vocalized word form • Most sites working on Arabic use the Buckwalter morphological analyzer to generate all possible vocalizations • But many words are not handled • Introduced generic vowel (@) model as a placeholder • Developed rules to automatically generate pronunciations with a generic vowel from non-vocalized word form • Enables training with partial information ktb
k@t@b@ k@t@b@n
kt@b@ kt@b@n k@tb@ k@tb@n k@t@b
• Reduces OOV of recognition lexicon LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
29
Alignment Example: li AntilyAs
l
@
n
t @ l y
A s
LBC NEWS 20060601 195801-2063.73+0.49 PLAY
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
30
Towards Applications • Transcription for Translation (TC-STAR) www.tc-star.org • Using the WEB as a Source of Training Data • Indexing Internet audio-video data: Audiosurf www.audiosurf.org • Open Domain Question/Answering www.lsi.upc.edu/˜qast/2008 • Open domain Interactive Information Retrieval by Telephone (RITel) ritel.limsi.fr • Speech Analytics (Infom@gic/Callsurf)
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
31
ASR Transcripts Versus Texts – Spoken language, disfluencies (hesitations, fillers, ...) – No document/sentence boundaries – Lack of capitalization, punctuation – Transcription errors (substitutions, deletions, insertions) Strong correlation between WER and IE metrics (NIST) Slightly more errors on NE tagged words Vocabulary limitation essentially affects person names + Known vocabulary, i.e. no orthographic errors, no need for normalization + Time codes + Confidence measures
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
32
Transcription for Translation (TC-STAR) • Three languages (EN, ES, MAN), 3 tasks (European Parliament, Spanish Parliament, BN) • Improving modeling and decoding techniques – audio segmentation, discriminative features, duration models, – automatic learning, discriminative training, – pronunciation models, NN language model, decoding strategies, ... • Model adaptation methods – speaker adaptive trainining, acoustic and language model adaptation, – cross-system adaptation, adaptation to noisy environments, ... • Integration of ASR and Translation – transcription errors, confidence scores, word graph interface – punctuation for MT – transcription normalization (98 vs. ninety eight, hesitations, ...) LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
33
TC-Star ASR Progress Best TCStar LIMSI OCT04-BN
30 20 15 10 9 8 7 6 5 2004
2005
2006
2007
• More task-specific data • More complex models • Improved modeling and decoding techniques Automatic learning, discriminative training, pronunciation modeling • Demo LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
34
Using the WEB as a Source of Training Data (Work done in collaboration with Vecsys Research)
• How to build a system without speaking the language (e.g Russian) and without transcribed data? • Main steps – Get texts from the WEB, and basic linguistic knowledge for letter-tosound rules – Get BN audio data with incomplete and approximate transcripts – Normalize texts/transcripts and build n-gram LM – Start with seed AMs from other languages – Align incomplete transcripts and audio, discard badly aligned data – Train AMs – Iterate last 2 steps to minimize the overestimated WER
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
35
Results: Approximate WER Russian WER Russian WER select
60 50 40 30 20
10 0
2
4
6
8
10
• Estimated WER using approximate transcripts • Approximate WER reduced from 30% to 10% • Similar experiments underway with Spanish and Portuguese LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
36
AUDIOSURF System Indexing Internet audio/video • Processed data – 259 sources – 80h/day, 33 languages • Transcribing and indexing – 203 sources – 6 languages 2005 2006 2007 2008
LIMSI
English 3889h 5794h 7441h 8173h
French German Arabic Spanish Mandarin All 10735h 964h 1756h 60h - 17404h 14950h 1639h 3383h 1369h - 27135h 18675h 1970h 4634h 2337h 2473 37530h 19943h 1972h 4736h 2401h 2511 39736h
GoTAL
Gothenburg, Sweden, August 25-27, 2008
37
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
38
Open Domain Question/Answering in Audio data • Automatic structurization of audio data • Detect semantically meaningful units • Specific entity taggers (named, task, linguistic entities) - rewrite rules (work like local grammars with specific dictionaries) - named and linguistic entity tagging are language dependent but task independent - task entity tagging is language independent but task dependent • Projects: RITel (ritel.limsi.fr), Chil (chil.server.de), Quaero (www.quaero.org)
• Participation in CLEF track QAst07, QAst08 (www.lsi.upc.edu/˜qast/ 2008) [Rosset et al, ASRU 2007] Compare QA performance on manual and ASR transcripts of varied audio Best accuracy about 0.4-0.5, generally degradation with increasing WER LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
39
RITel System
Example stop LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
40
RITel System Information Retrieval by Telephone [Rosset et al, 2005,2006] ritel.limsi.fr Example stop • Fully integrated open domain Interactive Information Retrieval System • Speech detection and recognition • Non-Contextual analysis (for texts and speech transcripts) • General Information Retrieval • IR on databases and dialogic interactions • Management of dialog history • Natural Language Generation • Data collection platform (natural, real-time) LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
41
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
42
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
43
Question: What is the Vlaams Blok? Manual transcript: the Belgian Supreme Court has upheld a previous ruling that declares the Vlaams Blok a criminal organization and effectively bans it . Answer: criminal organisation Extracted portion of an automatic transcript (CTM file format): (...) 20041115 1705 1735 EN SAT 1 1018.408 0.440 Vlaams 0.9779 20041115 1705 1735 EN SAT 1 1018.848 0.300 Blok 0.8305 20041115 1705 1735 EN SAT 1 1019.168 0.060 a 0.4176 20041115 1705 1735 EN SAT 1 1019.228 0.470 criminal 0.9131 20041115 1705 1735 EN SAT 1 1019.858 0.840 organisation 0.5847 20041115 1705 1735 EN SAT 1 1020.938 0.100 and 0.9747 (...) Answer: 1019.228 1019.858
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
44
Speech Analytics • Statistical analyses of huge speech corpora [Infom@gic/Callsurf] • Client satisfaction, service efficiency, ... • Extraction of metadata (content, topics, quality of interaction ...) • Confidentiality and privacy issues • Anonymized transcriptions
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
45
Speech Analytics Sample Extract • CallSurf A: oui bonjour Monsieur, EDF pro a` l’appareil. C: oui. ah oui oui oui. A: ouais je je vous appelle Monsieur, je vous avais eu il y a quelques ´ ephone, ´ jours au tel C: tout a` fait. ´ e´ vers euh des conseillers euh nouvelle offre A: et je vous avais transfer qui m’ont renvoye´ un mail ce matin, en me disant de vous recontacter puisque la demande de mise en service par le premier conseiller que ´ e´ faite. vous aviez eu a` l’origine n’avait pas et C: ah bon ` A: ouais. donc euh bon c’est pas coupe´ sur place. c’est pas le probleme ... stop LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
46
Detailed Transcription fre 00000051 spk1 16.510 17.144 allo` oui fre 00000051 spk2 17.144 18.140 {fw} Monsieur Dupont fre 00000051 spk1 18.140 18.563 oui fre 00000051 spk2 18.563 20.827 oui Ma(demoiselle) Mademoiselle Brun EDF pro edfp h5 fre 00000051 spk1 21.159 21.612 oui edfp h5 fre 00000051 spk2 21.612 27.769 {breath} {fw} je vous rappelle je viens d’ avoir le service technique {fw} concernant {fw} notre discussion {fw} de ce matin ` verification ´ edfp h5 fre 00000051 1 spk2 28.403 37.035 donc {fw} apres ´ ecesseur ´ ´ ´ erence ´ en fait le pred qui etait donc un client {fw} particulier en ref ´ e... ´ particulier {breath} a et edfp edfp edfp edfp
h5 h5 h5 h5
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
47
Quick Anonymized Transcription A/ Sylviane XX EDF Pro bonjour . C/ allo` bonjour A/ oui ´ erence ´ C/ je vous donne ma ref client ? A/ s’ il vous plait C/ rrrrr A/ rr rrr C/ rrr A/ oui C/ rrr tiret r r
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
48
Training with Reduced Information • Transcriptions assumed to be imperfect • Alignment allowing all reference words to be optionally skipped • Filler words and breath can be inserted between words • Anonymized forms absorbed by a generic model • Training uses a word lattice generated by the decoder • Word error rate currently about 30%
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
49
Some Perspectives • State-of-the-art transcription systems have WER ranging from 10-30% depending upon the tasks • Comparable results obtained on several languages (English, French, German, Mandarin, Spanish, Dutch ...) • Performance is sufficient for some tasks (indexing huge amounts of audio/video, spoken document retrieval, speech-to-speech translation) • Requires substantial training resources but recent research has demonstrated the potential of semi- and unsupervised methods • Processing of more varied data (podcasts, voice mail, meetings) • Dealing with different speaker accents (regional and non-native accents) • Keeping models up-to-date • Developing systems for oral languages (without a written form)
LIMSI
GoTAL
Gothenburg, Sweden, August 25-27, 2008
50
Over 25 Years of Progress Voice commands single speaker 2 − 30 words
1980
Controled dialog speaker indep. 10 − 100 words
1990
Isolated words single speaker 2 − 30 words
LIMSI
GoTAL
2000
Isolated word dictation single speaker 20k words
Connected words speaker indep 10 words
Conversational Telephone Speech speaker indep.
Transcription for indexation TV & radio
Continuous dictation single speaker 60k words
Speech Analytics Audio mining
2008 Unlimited Domain Transcription Internet audio Unlimited Domain Speech−to−speech Translation
Gothenburg, Sweden, August 25-27, 2008
51