Speech Processing for Audio Indexing

8 downloads 73904 Views 2MB Size Report
Aug 27, 2008 - Applications: digital multimedia libraries, media monitoring, News on. Demand ... Developing multimedia and multilingual indexing and management tools .... Best combination via feature vector concatenation (78 features).
Speech Processing for Audio Indexing Lori Lamel with J.L. Gauvain and colleagues from Spoken Language Processing Group CNRS-LIMSI

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

1

Introduction • Huge amounts of audio data produced (TV, radio, Internet, telephone) • Automatic structuring of multimedia, multilingual data • Much accessible content in the audio and text streams • Speech and language processing technologies are key components for indexing • Applications: digital multimedia libraries, media monitoring, News on Demand, Internet watch services, speech analytics, ...

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

2

Processing Audiovisual Data

Language recognition Speaker recognition Emotions signal

Audio Segmentation

XML

Speech transcription Named entities, ... Translation

Acoustic−phonetic modeling Corpus based studies Perceptual tests

LIMSI

GoTAL

Q&A Audio Analysis Dialog Generation

Synthesis

Gothenburg, Sweden, August 25-27, 2008

signal

3

Some Recent Projects • Conversational interfaces & intelligent assistants: • Computers in the Human Interaction Loop (chil.server.de) Human-Human communication mediated by the machine – – – –

multimodality (speech, image, video) “who & where” speaker identification and tracking “what” linguistic content (far-field ASR) “why & how” style, attitude, emotion

• Speech translation: Technology and Corpora for Speech to Speech Translation (http://www.tc-star.org) Global Autonomous Language Exploitation (DARPA GALE) • Developing multimedia and multilingual indexing and management tools QUAERO (www.quaero.org) • Speech analytics: Infom@gic/Callsurf LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

4

Why Is Speech Recognition Difficult? Automatic conversion of spoken words in text Text: I do not know why speech recognition is so difficult Continuous: Idonotknowwhyspeechrecognitionissodifficult Spontaneous: Idunnowhyspeechrecnitionsodifficult Pronunciation: YdonatnowYspiCrEkxgnISxnIzsodIfIk∧lt YdonowYspiCrEknISNsodIfxk∧l YdontnowYspiCrEkxnISNsodIfIk∧lt YdxnowYspiCrEknISNsodIfxk∧lt Important variability factors: Speaker physical characteristics (gender, age, ...), accent, emotional state, situation (lecture, conversation, meeting, ...)

Acoustic environment background noise (cocktail party, ...) room acoustic, signal capture (microphone, channel, ...)

[after lecture ’Speech Recognition and Understanding’ by Tanja Schultz and John Kominek]

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

5

Audio Samples (1) • WSJ Read

stop

PREVIOUSLY HE WAS PRESIDENT OF CIGNA’S PROPERTY CASUAL GROUP FOR FIVE YEARS AND SERVED AS CHIEF FINANCIAL OFFICER

• WSJ Spontaneous ACCORDING TO KOSTAS STOUPIS AN ECONOMIST WITH THE GREEK GOVERNMENT IT IS IT IS NOT YET POSSIBLE FOR GREECE TO TO MAINTAIN AN EXPORT LEVEL THAT IS EQUIVALENT TO EVEN THE THE SMALLEST LEVEL OF THE OTHER WESTERN NATIONS

• Broadcast News ON WORLD NEWS TONIGHT THIS WEDNESDAY TERRY NICHOLS AVOIDS THE DEATH SENTENCE FOR HIS ROLE IN THE OKLAHOMA CITY BOMBING. ONE OF THE WORLD’S MOST POPULAR AIRPLANES THE GOVERNMENT ORDERS AN IMMEDIATE SAFE TY INSPECTION. AND OUR REPORT ON THE CLONING CONTROVERSY BECAUSE ONE AMERICAN SCIENTIST SAYS HE’S READY TO GO. LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

6

Audio Samples (2) • Conversational Telephone Speech

stop

Hello. Yes, This is Bill. Hi Bill. This is Hillary. So this is my first call so I haven’t a clue as to - - what we’re supposed to do yeah. It’s pretty funny that we’re Bill and Hillary talking on the phone though isn’t it? Ah I didn’t even think about that. {laughter} uhm we’re supposed to talk about did you get the topic. Yes. So we’re supposed to talk about what changes if any you have made since ... • CHIL, spontaneous, non native I just give you a brief overview of [noise] what’s going on in uh audio and why we bother with all these microphones in the the materials of the walls , the carpet and all that ...

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

7

Audio Samples (3) • EPPS

stop

Mister President, in these questions we see the European social model in flashing lights. The European social model is a mishmash which pleases no one: a bit of the free market here and a bit of welfare state there, mixed with a little green posturing.

• EPPS, spontaneous Thank you very much, Mister Chairman. Hum When I came to this room I I have seen and heard homophobia but sort of vaguely ; through T. V. and the rest of it. But I mean listening to some of the Polish colleagues speak here today , especially Roszkowski , Pek , Giertych and Krupa,...

• EPPS, non native My main my main problem with the environmental proposals of the Commission is that they do not correspond to the objectives of the Sixth Environmental Action Plan. As far as the traffic is concerned, the sixth E. A. P. stressed the decoupling of transport growth and G. D. P. growth. LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

8

Indicative ASR Performance Task Dictation

Condition read speech, close-talking mic. read speech, noisy (SNR 15dB) read speech, telephone spontaneous dictation read speech, non-native Found audio TV & radio news broadcasts TV documentaries Telephone conversations Lectures (close mic) Lectures (distant mic) EPPS

LIMSI

GoTAL

Word Error 3-4% (humans 1%) 10% 20% 14% 20% 10-15% (humans 4%) 20-30% 20-30% (humans 4%) 20% 50% 8%

Gothenburg, Sweden, August 25-27, 2008

9

1988

LIMSI 1989 1990 1991 1992 1993

GoTAL 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008

Gothenburg, Sweden, August 25-27, 2008

% %

1

k

R

5

k

(

s

e

0

1

s

g

n

w

X

h

i

l

E

N

y

s

o

i

N

%

0

1

%

)

e

m

n

u

s

g

n

w

s

e

d

t

i

i

l

h

i

l

E

N

e

1

s

g

n

s w

X

h

i

l

E

N

2

0

k

s

r

e

)

L

(

U

h

i

F

T

S

C

s

e

n

o

p

o

r

c

h

i

M

0 1

c

a

r

s

e

w

c

e

e

p

X

i

b

A

N

\

h

S

(

)

h

s

i

l

g

n

E

n

o

N

e

r

a

d

i

V

s

o

g

n

n

n

a

K

P

k

i

i

l

1

0

n

r

a

n

a

s

e

w

X

i

d

M

N

e v

a

r

r

T

A

l

i

=

g

n

e

e

2009

M

H

I

i

t

M

0

n

r

a

n

a

)

L

(

U

i

d

M

T

S

C

c

e

e

p

r

a

u

e

r

a

o

c

w

h

S

r

c

a

l

l

l

C

d

b

h

t

i

S

T

S

C

)

L

(

U

i

b

A

s

a

c

a

o

r

t

B

d

e

e

g

n

i

t

M

M

D

M

r

a

o

c

w

Ȃ

2010 2011 1

%

4

E

c

e

e

p

I

I

d

b

h

t

i

S

h

S

g

n

e

e

e

a

S

i

t

M

R

M

D

Ȃ

0

1

2

W

d

s

g

n

=

n

o

)

h

i

l

E

(

N

c

e

e

p

g

n

e

e

t

M

h

S

i

c

e

e

p

a

n

o

a

s

r

e v

n

o

r

a

o

c

w

t

d

b

h

t

i

S

h

S

l

i

C

1

0

0

%

y

r

o

s

s

e

r

a

m

c

n

e

t

T

B

T

T

S

T

S

I

N

h

t

i

H

k

from NIST Ȃ ƒ›Ǥǯ͘͟

10

Some Research Topics • Development of generic technology • Speaker independent, task independent (large vocabulary > 200k words) • Robust (noise, microphone, ...) • Improving model accuracy • Processing found audio: varied speaking styles, uncontrolled conditions • Multilinguality: low e-resourced languages • Developing systems with less supervision • Model adaptation: keeping models up-to-date • Annotation of metadata (speaker, language, topic, emotion, style ...)

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

11

Internet Languages

Internet users in Millions per language Increase (%) in Internet users p internetworldstats.com, Nov 30, 2007 language from 2000 to 2007

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

12

ASR Systems Text Corpus

Speech Corpus

Training

Decoding

Normalization

Manual Transcription

Feature Extraction

N−gram Estimation

Training Lexicon

HMM Training

Language Model

Recognizer Lexicon

P(W)

Speech Y Sample

LIMSI

GoTAL

Acoustic Front−end

P(H|W)

X

Decoder

Acoustic Models f(X|H)

W*

Speech Transcription

Gothenburg, Sweden, August 25-27, 2008

13

System Components • Acouctic models: HMMs (10-20M parameters), allophones (triphones), discriminative features, discriminative training • Pronunciation dictionary with probabilities • Language models: statistical N-gram models (10-20M N-grams), model interpolation, connectionist LMs, text normalization • Decoder: multipass decoding, unsupervised acoustic model adaptation, system combination (Rover, cross-system adaptation)

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

14

Multi Layer Perceptron Front-End • Features incorporate longer temporal information than cepstral features • MLP NN projects the high-dimensional space onto phone state targets • LP-TRAP features (500ms), 4 layer bottle-neck architecture (HMM state targets, 3rd layer size is 39), PCA [Hermansky & Sharma, ICSLP’98; Fousek, 2007] • Studied various configurations & ways to combine with cepstral features • Main results: bnat06

WER (%) Features PLP MLPwLP PLP + MLPwLP No adaptation 21.8 20.7 19.2 SAT+CMLLR+MLLR 19.0 18.9 17.8

– – – –

Signifiant WER reduction without acoustic model adaptation Best combination via feature vector concatenation (78 features) Acoustic model adaptation is less effective for MLP Rover (PLP,PLP+MLP) further reduces the WER

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

15

Pronunciation Modeling • Pronunciation dictionary is an important system component • Usually represented with phones + non-speech units (silence, breath, filler) • Well-suited to recognition of read and prepared speech • Modeling variants particularly important for spontaneous speech multiwords “I don’t know” • Explore (semi-)automatic methods to generate pronunciation variants • Explicit use of phonetic knowledge to hypothesize rules anti, multi: /i/ - /ay/ variation inter: /t/ vs nasal flap, retroflex schwa vs schwa stop deletion/insertion y-insertion (coupon, news, super) • Data driven: confront pronunciations with very large amounts of data • Estimate probabilities of pronunciation variants (very important for Arabic) LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

16

Example: coupon

kyupan kupan

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

17

HMM

a12

a 23 2

1

LIMSI

GoTAL

3

f(. |s2)

f( .|s1) x1

a33

a22

a11

x2

x3

f(. |s3) x4

x5

x6

x7

x8

Gothenburg, Sweden, August 25-27, 2008

18

Example: interest

InXIst IntrIst

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

19

Pronunciation Variants - BN vs CTS 100 hours of data Word interest

Pronunciation BN CTS Word IntrIst 238 488 interests IntXIst 3 33 InXIst 0 11

interested IntrIstxd IntXIstxd InXIstxd

Pronunciation BN CTS IntrIss 52 53 IntrIsts 19 30 IntXIsts 3 2 IntXIss 3 1 126 386 interesting IntrIst|G 193 1399 3 80 IntXIst|G 8 314 18 146 InXIst|G 21 463

• Similar but less frequent words: interestingly, disinterest, disinterested, intercom, interconnected, interfere... • Estimate pronunciation probabilities, but no generalization

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

20

Morphological Decomposition • Morphological decomposition proposed to address problem of lexical variety (Arabic, German, Turkish, Finnish, Amharic, ...) • Rule-based vs data-driven approaches • Reduces out-of-vocabulary rate • Improve n-gram estimation of infrequent words • Tendency to create errors due to acoustic confusions of small units • Some proposed solutions: - inhibit decomposition of frequent words - incorporate linguistic knowledge (recognizer confusions, assimilation) • Recognition performace of optimized systems is typically comparable to and can be combined with word based systems

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

21

Continuous Space Language Models • Project words onto a continuous space (about 300) • Estimate projection and n-gram probabilities with neural network • Better generalization for unknown n-grams (compared to general back-off methods) • Efficient training with a resampling algorithm • Only model the N most common words (10-12k) • Interpolated with a standard 4-gram backoff LM • 20% perplexity reduction, 0.5-0.8% reduction in WER across a variety of languages and tasks [Schwenk, 2007]

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

22

Reducing Training Costs - Motivation • State-of-the-art speech recognizers trained on – hundreds to thousands hours of transcribed audio data – hundreds of million to several billions of words of texts – large recognition lexicons • Less e-represented languages – – – –

Over 6000 languages, about 800 written Poor representation in accessible form Lack of economic and/or political incentives PhD theses: Vietnamese, Khmer [Le, 2006], Somali [Nimaan, 2007], Amharic, Turkish [Pellegrini, 2008] – SPICE: Afrikaans, Bulgarian, Vietnamese, Hindi, Konkani, Telugu, Turkish, but also English, German, French [Schultz, 2007]

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

23

Questions • How much data is needed to train the models? • What performance can be expected with a given amount of resources? • Are audio or textual training data more important for ASR in less e-represented languages (languages with texts available)? • Assess the impact on WER of the quantity of Speech material (acoustic modeling) Texts (transcriptions) Texts (newspapers, newswires available on the Web) • Some studies on Amharic (low e-resourced language) [Pellegrini, 2008]

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

24

OOV and WER Rates Amharic, 2 hour development set 50

75

2h 10h 35h

2h 5h 10h 35h

log WER(%)

OOV(%)

50 25

25 0 0

10k 100k 500k 1M Text corpus size (# words)

4.6M

10k

100k 500k 1M Text corpus size (# words)

4.6M

• Audio transcriptions closer to development data • Even with 4.6M texts, more audio transcripts help

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

25

Influence of Transcriptions in LM Amharic, 2 hour development set 100

10h MA10h−ML35h 35h

log WER(%)

75 log WER(%)

50

2h 10h 35h MA2h−ML10h MA2h−ML35h

50

25 25 10k

100k 500k 1M Text corpus size (# words)

4.6M

10k

100k 500k 1M Test corpus size (# words)

4.6M

• 2h of audio training data is not sufficient • Good operating point: 10h audio, 1M word texts • Collecting texts is more efficient than transcribing audio at this operating point LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

26

Reducing Training Supervision • Use of quick transcriptions (less detailed, less costly to produce) − low-cost approximate annotations (commercial transcripts, closed-captions) − raw data with closely related contemporary texts − raw data with complementary texts (only newspaper, newswire,...) • Lightly supervised and unsupervised training using speech recognizer [Zavaliagkos & Colthurst 1998; Kemp & Waibel 1999; Jang & Hauptmann, 1999, Lamel, Gauvain & Adda 2000] • Standard MLE training assumes that transcriptions are perfect, and are aligned with audio • Flexible alignment using full N-gram decoder and garbage model • Can skip over unknown words, transcription errors • Implicit modeling of short vowels in Arabic

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

27

Reducing Training Supervision - Non-speech • Use existing models (trained on manual annotations) to automatically detect fillers and breath noises • Add these to reference transcriptions breath filler Manual (50h) 8.1% 0.6% BN (967h) 4.9% 3.4% BC (162h) 5.4% 6.8% • Small reduction in word error rate

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

28

Reducing Supervision: Pronunciation Generation • Pronunciation generation in Arabic is relatively straightforward from a vocalized word form • Most sites working on Arabic use the Buckwalter morphological analyzer to generate all possible vocalizations • But many words are not handled • Introduced generic vowel (@) model as a placeholder • Developed rules to automatically generate pronunciations with a generic vowel from non-vocalized word form • Enables training with partial information ktb

k@t@b@ k@t@b@n

kt@b@ kt@b@n k@tb@ k@tb@n k@t@b

• Reduces OOV of recognition lexicon LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

29

Alignment Example: li AntilyAs

l

@

n

t @ l y

A s

LBC NEWS 20060601 195801-2063.73+0.49 PLAY

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

30

Towards Applications • Transcription for Translation (TC-STAR) www.tc-star.org • Using the WEB as a Source of Training Data • Indexing Internet audio-video data: Audiosurf www.audiosurf.org • Open Domain Question/Answering www.lsi.upc.edu/˜qast/2008 • Open domain Interactive Information Retrieval by Telephone (RITel) ritel.limsi.fr • Speech Analytics (Infom@gic/Callsurf)

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

31

ASR Transcripts Versus Texts – Spoken language, disfluencies (hesitations, fillers, ...) – No document/sentence boundaries – Lack of capitalization, punctuation – Transcription errors (substitutions, deletions, insertions) Strong correlation between WER and IE metrics (NIST) Slightly more errors on NE tagged words Vocabulary limitation essentially affects person names + Known vocabulary, i.e. no orthographic errors, no need for normalization + Time codes + Confidence measures

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

32

Transcription for Translation (TC-STAR) • Three languages (EN, ES, MAN), 3 tasks (European Parliament, Spanish Parliament, BN) • Improving modeling and decoding techniques – audio segmentation, discriminative features, duration models, – automatic learning, discriminative training, – pronunciation models, NN language model, decoding strategies, ... • Model adaptation methods – speaker adaptive trainining, acoustic and language model adaptation, – cross-system adaptation, adaptation to noisy environments, ... • Integration of ASR and Translation – transcription errors, confidence scores, word graph interface – punctuation for MT – transcription normalization (98 vs. ninety eight, hesitations, ...) LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

33

TC-Star ASR Progress Best TCStar LIMSI OCT04-BN

30 20 15 10 9 8 7 6 5 2004

2005

2006

2007

• More task-specific data • More complex models • Improved modeling and decoding techniques Automatic learning, discriminative training, pronunciation modeling • Demo LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

34

Using the WEB as a Source of Training Data (Work done in collaboration with Vecsys Research)

• How to build a system without speaking the language (e.g Russian) and without transcribed data? • Main steps – Get texts from the WEB, and basic linguistic knowledge for letter-tosound rules – Get BN audio data with incomplete and approximate transcripts – Normalize texts/transcripts and build n-gram LM – Start with seed AMs from other languages – Align incomplete transcripts and audio, discard badly aligned data – Train AMs – Iterate last 2 steps to minimize the overestimated WER

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

35

Results: Approximate WER Russian WER Russian WER select

60 50 40 30 20

10 0

2

4

6

8

10

• Estimated WER using approximate transcripts • Approximate WER reduced from 30% to 10% • Similar experiments underway with Spanish and Portuguese LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

36

AUDIOSURF System Indexing Internet audio/video • Processed data – 259 sources – 80h/day, 33 languages • Transcribing and indexing – 203 sources – 6 languages 2005 2006 2007 2008

LIMSI

English 3889h 5794h 7441h 8173h

French German Arabic Spanish Mandarin All 10735h 964h 1756h 60h - 17404h 14950h 1639h 3383h 1369h - 27135h 18675h 1970h 4634h 2337h 2473 37530h 19943h 1972h 4736h 2401h 2511 39736h

GoTAL

Gothenburg, Sweden, August 25-27, 2008

37

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

38

Open Domain Question/Answering in Audio data • Automatic structurization of audio data • Detect semantically meaningful units • Specific entity taggers (named, task, linguistic entities) - rewrite rules (work like local grammars with specific dictionaries) - named and linguistic entity tagging are language dependent but task independent - task entity tagging is language independent but task dependent • Projects: RITel (ritel.limsi.fr), Chil (chil.server.de), Quaero (www.quaero.org)

• Participation in CLEF track QAst07, QAst08 (www.lsi.upc.edu/˜qast/ 2008) [Rosset et al, ASRU 2007] Compare QA performance on manual and ASR transcripts of varied audio Best accuracy about 0.4-0.5, generally degradation with increasing WER LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

39

RITel System

Example stop LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

40

RITel System Information Retrieval by Telephone [Rosset et al, 2005,2006] ritel.limsi.fr Example stop • Fully integrated open domain Interactive Information Retrieval System • Speech detection and recognition • Non-Contextual analysis (for texts and speech transcripts) • General Information Retrieval • IR on databases and dialogic interactions • Management of dialog history • Natural Language Generation • Data collection platform (natural, real-time) LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

41

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

42

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

43

Question: What is the Vlaams Blok? Manual transcript: the Belgian Supreme Court has upheld a previous ruling that declares the Vlaams Blok a criminal organization and effectively bans it . Answer: criminal organisation Extracted portion of an automatic transcript (CTM file format): (...) 20041115 1705 1735 EN SAT 1 1018.408 0.440 Vlaams 0.9779 20041115 1705 1735 EN SAT 1 1018.848 0.300 Blok 0.8305 20041115 1705 1735 EN SAT 1 1019.168 0.060 a 0.4176 20041115 1705 1735 EN SAT 1 1019.228 0.470 criminal 0.9131 20041115 1705 1735 EN SAT 1 1019.858 0.840 organisation 0.5847 20041115 1705 1735 EN SAT 1 1020.938 0.100 and 0.9747 (...) Answer: 1019.228 1019.858

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

44

Speech Analytics • Statistical analyses of huge speech corpora [Infom@gic/Callsurf] • Client satisfaction, service efficiency, ... • Extraction of metadata (content, topics, quality of interaction ...) • Confidentiality and privacy issues • Anonymized transcriptions

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

45

Speech Analytics Sample Extract • CallSurf A: oui bonjour Monsieur, EDF pro a` l’appareil. C: oui. ah oui oui oui. A: ouais je je vous appelle Monsieur, je vous avais eu il y a quelques ´ ephone, ´ jours au tel C: tout a` fait. ´ e´ vers euh des conseillers euh nouvelle offre A: et je vous avais transfer qui m’ont renvoye´ un mail ce matin, en me disant de vous recontacter puisque la demande de mise en service par le premier conseiller que ´ e´ faite. vous aviez eu a` l’origine n’avait pas et C: ah bon ` A: ouais. donc euh bon c’est pas coupe´ sur place. c’est pas le probleme ... stop LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

46

Detailed Transcription fre 00000051 spk1 16.510 17.144 allo` oui fre 00000051 spk2 17.144 18.140 {fw} Monsieur Dupont fre 00000051 spk1 18.140 18.563 oui fre 00000051 spk2 18.563 20.827 oui Ma(demoiselle) Mademoiselle Brun EDF pro edfp h5 fre 00000051 spk1 21.159 21.612 oui edfp h5 fre 00000051 spk2 21.612 27.769 {breath} {fw} je vous rappelle je viens d’ avoir le service technique {fw} concernant {fw} notre discussion {fw} de ce matin ` verification ´ edfp h5 fre 00000051 1 spk2 28.403 37.035 donc {fw} apres ´ ecesseur ´ ´ ´ erence ´ en fait le pred qui etait donc un client {fw} particulier en ref ´ e... ´ particulier {breath} a et edfp edfp edfp edfp

h5 h5 h5 h5

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

47

Quick Anonymized Transcription A/ Sylviane XX EDF Pro bonjour . C/ allo` bonjour A/ oui ´ erence ´ C/ je vous donne ma ref client ? A/ s’ il vous plait C/ rrrrr A/ rr rrr C/ rrr A/ oui C/ rrr tiret r r

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

48

Training with Reduced Information • Transcriptions assumed to be imperfect • Alignment allowing all reference words to be optionally skipped • Filler words and breath can be inserted between words • Anonymized forms absorbed by a generic model • Training uses a word lattice generated by the decoder • Word error rate currently about 30%

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

49

Some Perspectives • State-of-the-art transcription systems have WER ranging from 10-30% depending upon the tasks • Comparable results obtained on several languages (English, French, German, Mandarin, Spanish, Dutch ...) • Performance is sufficient for some tasks (indexing huge amounts of audio/video, spoken document retrieval, speech-to-speech translation) • Requires substantial training resources but recent research has demonstrated the potential of semi- and unsupervised methods • Processing of more varied data (podcasts, voice mail, meetings) • Dealing with different speaker accents (regional and non-native accents) • Keeping models up-to-date • Developing systems for oral languages (without a written form)

LIMSI

GoTAL

Gothenburg, Sweden, August 25-27, 2008

50

Over 25 Years of Progress Voice commands single speaker 2 − 30 words

1980

Controled dialog speaker indep. 10 − 100 words

1990

Isolated words single speaker 2 − 30 words

LIMSI

GoTAL

2000

Isolated word dictation single speaker 20k words

Connected words speaker indep 10 words

Conversational Telephone Speech speaker indep.

Transcription for indexation TV & radio

Continuous dictation single speaker 60k words

Speech Analytics Audio mining

2008 Unlimited Domain Transcription Internet audio Unlimited Domain Speech−to−speech Translation

Gothenburg, Sweden, August 25-27, 2008

51

Suggest Documents