An unsupervised method for learning to track

0 downloads 0 Views 164KB Size Report
Twelve English vowel categories from the TIMIT database were used. {aa, ae, ah, ao, ax, eh, ... Download to IP: 104.194.92.158 On: Wed, 13 Jan 2016. 12:28:36.
the systemof optimalqualityis investigated. Another objectiveof this studyis to explorewhethera sometimesdifficulttask of free number productionrequiredin magnitudeestimationscould be replacedby cross-modalitymatchesusinglinesof variouslengthsproducedby subjects on a computerscreen.The main concernhere was relatedto the

unknown effectsof the limited width of the computerscreenon the

magnitudeestimationtask. [This researchwas madepossibleby grant

No. 2589fromtheEECEspiritSAM project.] a)Alsoat Univ.of Iowa City, IA.

2:35-2:50 Break

2:50

3:20

7SP7. An unsupervisedmethod for learning to track tongue position from an acoustic signal. John Hogden, Philip Rubin, and Elliot

7SP9. Improved vowel recognition using the discrete cosine transform. D.J. Burr (Comput. SystemsRes. Dept., Bellcore,445

Saltzman (Haskins Labs., 270 Crown St., New Haven, CT 06511)

South St., Morristown, NJ 07960)

A procedurefor learning to recoverthe relative positionsof the articulatorsfrom speechsignalsis demonstrated.The algorithmlearns withoutsupervision, that is, it doesnot requireinformationaboutwhich articulatorconfigurations createdthe acousticsignalsin the trainingset. The procedureconsistsof vectorquantizingshorttime windowsof a speechsignal,thenusingmultidimensional scalingto representquantizationcodesthat weretemporallyclosein the encodedspeechsignalby nearbypointsin a continuitymap.Sincetemporallyclosesoundsmust havebeenproducedby similararticulatorconfigurations, soundswhich were producedby similararticulatorpositionsshouldbe represented closeto eachother in the continuitymap. Using an articulatoryspeech synthesizerto produceacousticsignalsfrom known articulatorpositions, relative articulator positionswere estimatedfrom synthesized acousticsignalsand comparedto the synthesizer's actual articulator positions. High rank-ordercorrelations, rangingfrom0.92 to 0.99,were found betweenthe estimatedand actual articulator positions.Reasonable estimatesof relativearticulatorpositionswere made using32 categoriesof sound,and the accuracyimprovedwhenmoresoundcatego-

Someexperimentsare describedthat usea discretecosinetransform (DCT) to representvowel spectrafor classificationby a neural network. The recognitionaccuracyof a neural network trained by back propagationwas comparedto that of a simple Gaussianclassifier. Twelve Englishvowel categoriesfrom the TIMIT databasewere used

ries were used.

{aa, ae, ah, ao, ax, eh, er, ih, ix, iy, uh, uw}. Trainingwasdonein a speaker-dependent mannerusing152male speakersfrom all geographic regions.Eight 16-msDFT spectrawere computedat 8-ms intervals about the center of each vowel. Sixteen DCT

coefficients were derived

from the 250- to 4250-Hz interval in each 8-kHz spectrum.Average DCT and delta DCT vectorsover the eight frameswere usedas input features.Additional featuresincludedthe first six peak frequenciesof the DCT spectrumandtwo pitchparametersfrom the DCT coefficients. Bestperformancewas obtainedby the neural network with all 40 input features,resultingin 58.2% recognitionaccuracy.This comparesfavorably to a cochleagramrepresentationusing the same vowel classes [Muthusamyet al., ICASSP-90].The DCT alsoappearsto requirefewer coefficients than an equivalentcepstrum-based vowel classifier[Burr, ICASSP-92].

3:05 3:35

7SP8. Description of contextual factors affecting vowel duration. Jan P. H. van Santen (AT&T Bell Labs., 2D-452, 600 Mountain Ave., P.O. Box 636, Murray Hill, NJ 07974-0636)

7SP10. Automatic measurementof intonation. Gerard W. G. Spaai, Arent Storm,and Dik J. Hermes (Inst. for PerceptionRes./IPO, P.O. Box 513, NL 5600 MB, Eindhoven, The Netherlands)

As an initial phaseof a projecton durationmodelsfor predicting segmental durationsfrom contextualfactors,two naturalspeechdatabasesproducedby a maleand a femalespeakerwith morethan 50 000 manually measuredsegmentaldurationswere analyzed.This large quantityof datamadeit possible to performa detailedanalysisof the effectsof severalcontextualfactors,includinglexical stress,word ac-

If speechintonationis only represented by an unprocessed seriesof pitch measurements, the interpretationcan be hamperedby three factors.First, becauseof the presence of unvoicedpartsin an utterance,the continuouslyperceivedpitch contouris disturbedby interruptions.Second, in many cases,speechis characterizedby involuntarypitch per-

cent,the identities of adj. acentsegments, the syllabicstructureof a

turbations that either cannot be heard at all, or do not contribute to the

word, and proximityto a syntacticboundary.Among the key results werethefollowing.( 1) The contextualfactorsaccounted for up to 90%

perceptionof intonation.Third, the perceptualmeaningof a pitch movementdependsuponits positionwithin the syllable,in many cases with respectto the vowel onset.A correct interpretationof a pitch contour, therefore,requiresthe position of the vowel onsetsto be known,too. Theseproblemscan be solvedby interpolatingthe pitch at unvoicedpartsfrom the adjacentpitch measurements, by removingall perceptually irrelevantdetailsfrom the contour,and by indicatingthe vowel onsets.Besidespresentingthe procedures,a systemwill be presentedwhich performsthesetasksin real time, and which is currently usedin an evaluationof its usefulness in teachingintonationto deaf persons.The applicabilityfor the useof thisintonationmeterin training intonationof foreignlanguageswill be indicated.[Work supportedby Instituut voor Doven, St'-Michielsgestel'Netherlands.]

of the variance,and reducedthe within-vowel standard deviation by a

factor of 3. (2) There were complexinteractionsbetweenfactors,in particularbetweenboundaryproximityandpostvocalic consonant identity and betweenlexicalstressand syllabicword structure.(3) The effectsof adjacentsegments werereducibleto the effectsof voicingand mannerof production;effectsof placeof articulationwere negligible. (4) Proximityto a boundaryshouldbe measuredin termsof syllabic and segmentalposition,not in termsof the sum of the intrinsicdurationsof segments betweenthetargetandtheboundary.The resultswere comparedwith data reportedby Crystaland House[J. Acoust.Soc. Am. 83, 1551-1573 and 1574-1585 (1988)].

2443 J. Acoust.Soc. Am., Vol. 91, No. 4, Pt. 2, April 1992 123rd Meeting:AcousticalSocietyof America 2443 Redistribution subject to ASA license or copyright; see http://acousticalsociety.org/content/terms. Download to IP: 104.194.92.158 On: Wed, 13 Jan 2016 12:28:36

Suggest Documents