TFT: An algorithm for the spectral compression of ...

2 downloads 0 Views 134KB Size Report
A new speech synthesizer is proposed that is suitable for speech syn- thesis by rule. This synthesizer is based on a source-filter model and consists of two ...
3:33

Q5. A speechsynthesizerfor rule-basedsynthesis.Yukio Mitome (C&C InformationTechnologyResearchLaboratories,NEC Corporation4-1-1 Miyazaki, Miyamae-ku, Kawasaki, 213 Japan) and Noriko Umeda (Departmentof Linguistics, New York University,719Broadway,Room 505, New York, NY 10003)

A new speechsynthesizer is proposedthat is suitablefor speechsynthesisby rule. This synthesizeris basedon a source-filtermodel and consistsof two sources--aglottal sourceand a noisesource--and three cascadefilters. Each sourcecan be connectedwith any of the filters dynamically.Eachfilter isusedasa speechcomponent,whichisproducedas a sourcesignalfilteredby a slowvaryingsystemand hasa limited duration. The speechsignalconsistsof somecomponentsoverlappingeach otherin time-frequency domain.The experimental systemwasdeveloped on a 32-bitpersonalcomputer.The softwaresystemhastwo subsystems, a speechsynthesizer with a formatpatterninterpreteranda speechanalysis systemwith a speechdatabase.A formant pattern interpreter readsa formantpatterndefinitionfile, and a controlsspeechsynthesizerroutine. Speechanalysisisbasedonthe maximumentropymethod(MEM), anda databasecontains25 min of recordedmale voicesamples.Experimental resultsshowedthat thissynthesizercaneasilyrealizethe spectraldiscontinuity betweenphonemes.

3:57

Q7. TFT: An algorithm for the spectral compressionof natural speech

signals. Richard R. Hurtig (Department of SpeechPathology and Audiology,Universityof Iowa, Iowa City, IA 52242)

An earlierreport [ R. R. Hurtig, J. Acoust.Soc.Am. Suppl.1 81, S78 (1986) ] usingsynthesized syllablesdemonstrated that naivesubjectscan discriminateand identify spectrallycompressed vowel segmentsunder auditory and vibrotactileconditions.Thesefindingsare consistentwith theviewthat theidentificationof the spectralshapeof thespeechsegment may be independent of its frequencyrange.A computational algorithm wasdevelopedto achievespectralcompression of naturalspeech.The algorithmincludescalculationof an n-pointFFT, paddingtheresultwith thespectrumof a Hammingwindow,calculationof a 2n-pointIFFT, and outputtingthefirsthalfof theresultanttimedomainsignal.Thesizeofthe paddetermines the amountof compression achievedwhiletheplacement of thepaddetermines thedirectionof thefrequency shift.Naivesubjects had no difficultyrecognizingsimplesentences in a closedsetfor speech signals compressed to 2500or 1250Hz bands.After a fewhoursof listening, openset recognitionis achieved.The implementationof the algorithm for sensoryaids will be demonstrated.

4:09

Q8. Statistical tree-based modeling of phonetic segment durations. Michael D. Riley (Department of Linguistics,AT&T Bell Laboratories, Murray Hill, NJ 07974) 3:45

Q6. Improved excitation prediction and quantization in optimal amplitude multipulse coders. Daniel Lin (International Mobile MachinesCorporation,2130Arch Street,Philadelphia,PA 19103)

Improvementsto the excitationpredictionand quantizationprocedurein optimalamplitudemultipulsecodersaredescribed. The newpro.... pulse,...v,,,,•,• gain ceduresuccessively reoptimizes*• •--""*"'• and pre,•ic of the long-delaycorrelationfilter as eachnew pulseis found.Furthermore,theamplitudesandgainarerequantizedat eachstepsothat thenew pulseamplitudesandlocationscancorrectfor the quantizationerrorsin the existingexcitation.To performthejoint amplitude-predictor reoptimization,nonrecursive pitchpredictionstructureis usedin our analysis. Thepredictorisimplemented asanadaptivevectorquantizerwhosecodebookispopulatedwithpastexcitationsequences. Thisstructureallowsfor a computationallyefficientclosed-loopanalysisprocedure.The results showedthatat 9.6kbps,thenewtechniqueachieved anaveragesegmental SNR of over 18.5dB. Informal subjectivetestsindicatedthat the reconstructedspeechis toll quality.

S44

J. Acoust.Soc. Am. Suppl.1, Vol. 85, Spring1989

Segmentaldurationsare affectedby manyfactors:phoneticcontext,

speaking rate,stress, wordandphrasalposition, etc.Regression trees[L. Breimanet al., Classification and Regression Trees(Wadsworth,Monterey,CA, 1984)] are well suitedto capturingtheseeffects,sincethey ( 1) statisticallyselectthe mostsignificant features,(2) permitbothcategorical and continuousfactorsto be considered,(3) provide"honest"estimatesof their performance, and (4) allow humaninterpretationandex•,,,,,,•t,,,,, of t.c, itsuit. In pa,u,.u,a,, transcribeddatabases of 400 utterances froma singlespeakerand4000utterances from400speakers of AmericanEnglishwereusedto build optimaldecisiontreesthat predict segmentdurationsbasedon suchfactors.Over 70% of the durational variancefor the singlespeakerand over60% for the multiplespeakers were accountedfor by this methodwhen usinginformationonly at the word level and below. These terms were used to derive durations for a

text-to-speech synthesizer and werefoundto givemorefaithfulresults thanthe existingheuristicallyderiveddurationrules.Sincetreebuilding and evaluationis rapid oncethe data are collectedand the candidate featuresspecified, the techniquecanbe readilyappliedto otherfeature setsandto otherlanguages.

117th Meeting:AcousticalSocietyof America

S44

Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 69.13.149.197 On: Mon, 24 Oct 2016 01:50:10

Suggest Documents