Automated Screening of Speech Development Issues in Children By Identifying Phonological Error Patterns Lauren Ward1, 2, Alessandro Stefani1, 3, Daniel Smith1, Andreas Duenser1, Jill Freyne4, Barbara Dodd5, Angela Morgan6 1Data61, CSIRO, Hobart, Australia
2Acoustics
4Health
& Biosecurity, CSIRO, Sydney, Australia www.data61.csiro.au
Research Centre, University of Salford, Manchester, UK 5University of Melbourne, Melbourne, Australia
State of Speech Pathology in Australia 1:5000
Ratio of Speech Pathologists to general population
1 in 20
The number of children presenting to Primary School with a Speech Disorder
18%
Percentage of children waiting >1 year for an initial speech assessment in the public sector
3RMIT,
Melbourne, Australia
6Murdoch
Children's Research Institute, Melbourne, Australia
Motivation
Disorder Specific
There is a need for a broad speech assessment tool to triage children for professional assessment
Architecture and Approach
A proof of concept system (Figure 1) was developed to evaluate the feasibility of identifying phonological error patterns in children’s speech using a limited data set.
Input Speech
State of Clinical Speech Recognition
Hierarchical Neural Networks
• Target words are based on a validated screening protocol; • Acoustic Models were trained using a Hierarchical Neural Diagnostic Evaluation of Articulation and Phonology [3], Network with split temporal context features as in [4] and elicited using a picture naming task • Left and Right contexts were classified using independent • The corpus used included 114 Australian child speakers, neural networks before being merged in the third neural age 3-14yrs, with various underlying disorders such as network Cerebral Palsy • 13 MFCC and 500 Hidden Layer Neurons were used • 39.50% of utterances are misarticulated • Training data were segmented into individual phonemes
Most existing tools are disorder specific[1][2], not accommodating broader screening
Data Scarcity
Standard ASR approaches require large speech corpora, which is impractical for small populations
Phonological Error Patterns
Studies show knowledge of the specific error pattern, i.e. the specific substituted phoneme in a substitution error, is a key to predicting future speech and literacy outcomes [3]
Experimental Results Constrained HMM Decoder
The developed decoder’s performance was tested against the HVite decoder for small data sets; • The number of utterances in the training set was varied between 10-70 utterances per word • Half of which exhibited common mispronunciations Figure 3 shows the developed decoder consistently outperforms the reference decoder. Less than 30 utterances is not recommended as at this level overall PER is > 25%.
33.12%
Average Improvement
Constrained Hidden Markov Model (HMM) Decoder
Phoneme Error Rate (PER) achieved by the developed decoder
19.03% Minimum PER
Figure 3: Phoneme error rate (PER) for different data quantities using the HVite HMM and constrained HMM decoders
This decoder integrates both generalised speech pathology knowledge as well as the most likely error patterns for each target word.
Further testing showed phonological error patterns could be detected with an accuracy ranging from 63.12% to 94.12% for training sets ranging in size from 30 to 70 utterances.
In conjunction with expert speech pathologists phoneme dictionaries were created for each target word, for example:
Phonological Error Pattern Detection
A cohort of six evaluation speakers were used to test the system’s clinical accuracy. These speakers had been deemed to have either low-risk speech (correct) or moderate-risk speech (with age-appropriate errors) by an expert speech pathologist.
teeth.dictionary = { 0: [‘T’], 1: [‘IY’], 2: [‘TH’, ‘F’, ‘T’]}
Table 1: Recognition Accuracy for fronting and gliding type phonological error patterns for the Moderate-Risk evaluation speakers
Figure 2 shows an example state transition and the weights used.
Low-Risk Moderate-Risk 92.86% 87.26% Recognition Accuracy for correctly articulated phonemes
These weights were manually optimized using the phoneme dictionary, rather than trained, facilitating the use of small data sets. Optimization of the weights was performed which yielded Figure 2: An example of a state transition in the Constrained Hidden Markov Ws:We = 5:1 and Wu = 1. Model Decoder
Phonological Error Pattern Detection Regions of phoneme substitution are determined by globally aligning the recognized phonemes with the expected target word. Determination of typical is then achieved using a decision tree and the phoneme dictionary.
Error Types
THF
E.g. final phoneme of teeth
Typical Fronting type error (age 3-3yrs; 11mths)
FOR FURTHER INFORMATION
REFERENCES
Andreas Duenser e
[email protected]
[1] M. Shahin, B. Ahmed, A. Parnandi, V. Karappa, J. McKechnie, K. J. Ballard, and R. Gutierrez-Osuna, “Tabby talks: An automated tool for the assessment of childhood apraxia of speech,” Speech Communication, vol. 70, pp. 49–64, 2015. [2] T. DiCicco and R. Patel, “Automatic landmark analysis of dysarthric speech.” J. Medical SpeechLanguage Pathology, vol. 16, pp. 213–219, 2008.
THS
Atypical Backing type error (age 3-3yrs; 11mths) Figure 1: Architecture of the proof of concept system
Misarticulated Phonemes
Conclusions
GLIDING N/A 60% 100%
Future Work
The proof of concept system demonstrates that phonological Development of a full screening tool which includes; error patterns can be detected with good accuracy despite • Increased target word set to enable detection of more limited training data. Specifically: varied Phonological Error Patterns • The developed Constrained HMM decoder can achieve a minimum 19.03% PER and average 33.12% improvement • Greater integration of expert Speech Pathologist knowledge, such as developmental norms • Specific Phonological Error Patterns can be detected with • Mobile device compatibility and gamification up to 94.12% accuracy WE ACKNOWLEDGE THE COLLABORATION AND SUPPORT OF
[3] A. Morgan, K. Ttofari-Eecen, P. Eadie, S. Reilly, and B. Dodd, “Speech sound error type at 4 predicts speech outcome at 7: findings from the early language in victoria community study,” (in preparation). [4] P. Schwarz, P. Matejka, and J. Cernocky, “Hierachial structures of neural networks for phoneme recognition,” in Proc. IEEE 2006 Int. Conf. on Acoustics, Speech and Signal Processing, Toulouse, France, 2006, pp. 325–328.
Speaker 4 Speaker 5 Speaker 6
FRONTING 100% 100% 80%