Automated Screening of Speech Development Issues ...

1 downloads 0 Views 1MB Size Report
In conjunction with expert speech pathologists phoneme dictionaries were ... 6Murdoch Children's Research Institute, Melbourne, Australia. Constrained HMM ...
Automated Screening of Speech Development Issues in Children By Identifying Phonological Error Patterns Lauren Ward1, 2, Alessandro Stefani1, 3, Daniel Smith1, Andreas Duenser1, Jill Freyne4, Barbara Dodd5, Angela Morgan6 1Data61, CSIRO, Hobart, Australia

2Acoustics

4Health

& Biosecurity, CSIRO, Sydney, Australia www.data61.csiro.au

Research Centre, University of Salford, Manchester, UK 5University of Melbourne, Melbourne, Australia

State of Speech Pathology in Australia 1:5000

Ratio of Speech Pathologists to general population

1 in 20

The number of children presenting to Primary School with a Speech Disorder

18%

Percentage of children waiting >1 year for an initial speech assessment in the public sector

3RMIT,

Melbourne, Australia

6Murdoch

Children's Research Institute, Melbourne, Australia

Motivation

Disorder Specific

There is a need for a broad speech assessment tool to triage children for professional assessment

Architecture and Approach

A proof of concept system (Figure 1) was developed to evaluate the feasibility of identifying phonological error patterns in children’s speech using a limited data set.

Input Speech

State of Clinical Speech Recognition

Hierarchical Neural Networks

• Target words are based on a validated screening protocol; • Acoustic Models were trained using a Hierarchical Neural Diagnostic Evaluation of Articulation and Phonology [3], Network with split temporal context features as in [4] and elicited using a picture naming task • Left and Right contexts were classified using independent • The corpus used included 114 Australian child speakers, neural networks before being merged in the third neural age 3-14yrs, with various underlying disorders such as network Cerebral Palsy • 13 MFCC and 500 Hidden Layer Neurons were used • 39.50% of utterances are misarticulated • Training data were segmented into individual phonemes

Most existing tools are disorder specific[1][2], not accommodating broader screening

Data Scarcity

Standard ASR approaches require large speech corpora, which is impractical for small populations

Phonological Error Patterns

Studies show knowledge of the specific error pattern, i.e. the specific substituted phoneme in a substitution error, is a key to predicting future speech and literacy outcomes [3]

Experimental Results Constrained HMM Decoder

The developed decoder’s performance was tested against the HVite decoder for small data sets; • The number of utterances in the training set was varied between 10-70 utterances per word • Half of which exhibited common mispronunciations Figure 3 shows the developed decoder consistently outperforms the reference decoder. Less than 30 utterances is not recommended as at this level overall PER is > 25%.

33.12%

Average Improvement

Constrained Hidden Markov Model (HMM) Decoder

Phoneme Error Rate (PER) achieved by the developed decoder

19.03% Minimum PER

Figure 3: Phoneme error rate (PER) for different data quantities using the HVite HMM and constrained HMM decoders

This decoder integrates both generalised speech pathology knowledge as well as the most likely error patterns for each target word.

Further testing showed phonological error patterns could be detected with an accuracy ranging from 63.12% to 94.12% for training sets ranging in size from 30 to 70 utterances.

In conjunction with expert speech pathologists phoneme dictionaries were created for each target word, for example:

Phonological Error Pattern Detection

A cohort of six evaluation speakers were used to test the system’s clinical accuracy. These speakers had been deemed to have either low-risk speech (correct) or moderate-risk speech (with age-appropriate errors) by an expert speech pathologist.

teeth.dictionary = { 0: [‘T’], 1: [‘IY’], 2: [‘TH’, ‘F’, ‘T’]}

Table 1: Recognition Accuracy for fronting and gliding type phonological error patterns for the Moderate-Risk evaluation speakers

Figure 2 shows an example state transition and the weights used.

Low-Risk Moderate-Risk 92.86% 87.26% Recognition Accuracy for correctly articulated phonemes

These weights were manually optimized using the phoneme dictionary, rather than trained, facilitating the use of small data sets. Optimization of the weights was performed which yielded Figure 2: An example of a state transition in the Constrained Hidden Markov Ws:We = 5:1 and Wu = 1. Model Decoder

Phonological Error Pattern Detection Regions of phoneme substitution are determined by globally aligning the recognized phonemes with the expected target word. Determination of typical is then achieved using a decision tree and the phoneme dictionary.

Error Types

THF

E.g. final phoneme of teeth

Typical Fronting type error (age 3-3yrs; 11mths)

FOR FURTHER INFORMATION

REFERENCES

Andreas Duenser e [email protected]

[1] M. Shahin, B. Ahmed, A. Parnandi, V. Karappa, J. McKechnie, K. J. Ballard, and R. Gutierrez-Osuna, “Tabby talks: An automated tool for the assessment of childhood apraxia of speech,” Speech Communication, vol. 70, pp. 49–64, 2015. [2] T. DiCicco and R. Patel, “Automatic landmark analysis of dysarthric speech.” J. Medical SpeechLanguage Pathology, vol. 16, pp. 213–219, 2008.

THS

Atypical Backing type error (age 3-3yrs; 11mths) Figure 1: Architecture of the proof of concept system

Misarticulated Phonemes

Conclusions

GLIDING N/A 60% 100%

Future Work

The proof of concept system demonstrates that phonological Development of a full screening tool which includes; error patterns can be detected with good accuracy despite • Increased target word set to enable detection of more limited training data. Specifically: varied Phonological Error Patterns • The developed Constrained HMM decoder can achieve a minimum 19.03% PER and average 33.12% improvement • Greater integration of expert Speech Pathologist knowledge, such as developmental norms • Specific Phonological Error Patterns can be detected with • Mobile device compatibility and gamification up to 94.12% accuracy WE ACKNOWLEDGE THE COLLABORATION AND SUPPORT OF

[3] A. Morgan, K. Ttofari-Eecen, P. Eadie, S. Reilly, and B. Dodd, “Speech sound error type at 4 predicts speech outcome at 7: findings from the early language in victoria community study,” (in preparation). [4] P. Schwarz, P. Matejka, and J. Cernocky, “Hierachial structures of neural networks for phoneme recognition,” in Proc. IEEE 2006 Int. Conf. on Acoustics, Speech and Signal Processing, Toulouse, France, 2006, pp. 325–328.

Speaker 4 Speaker 5 Speaker 6

FRONTING 100% 100% 80%

Suggest Documents