An Auditory-Feedback-Based Neural Network Model of Speech Production That Is Robust to Developmental Changes in the Size and Shape of the Articulatory System Daniel E. Callan ATR Human Information Processing Research Laboratories Kyoto, Japan and ATR-I Brain Activity Imaging Center Kyoto, Japan
Ray D. Kent Department of Communicative Disorders University of Wisconsin– Madison
Frank H. Guenther Department of Cognitive and Neural Systems Boston University Boston, MA
Houri K. Vorperian Department of Communicative Disorders University of Wisconsin– Madison
The purpose of this article is to demonstrate that self-produced auditory feedback is sufficient to train a mapping between auditory target space and articulator space under conditions in which the structures of speech production are undergoing considerable developmental restructuring. One challenge for competing theories that propose invariant constriction targets is that it is unclear what teaching signal could specify constriction location and degree so that a mapping between constriction target space and articulator space can be learned. It is predicted that a model trained by auditory feedback will accomplish speech goals, in auditory target space, by continuously learning to use different articulator configurations to adapt to the changing acoustic properties of the vocal tract during development. The Maeda articulatory synthesis part of the DIVA neural network model (Guenther et al., 1998) was modified to reflect the development of the vocal tract by using measurements taken from MR images of children. After training, the model was able to maintain the 11 English vowel targets in auditory planning space, utilizing varying articulator configurations, despite morphological changes that occur during development. The vocal-tract constriction pattern (derived from the vocal-tract area function) as well as the formant values varied during the course of development in correspondence with morphological changes in the structures involved with speech production. Despite changes in the acoustical properties of the vocal tract that occur during the course of development, the model was able to demonstrate motor-equivalent speech production under liprestriction conditions. The model accomplished this in a self-organizing manner even though there was no prior experience with lip restriction during training. KEY WORDS: speech production, development, neural network, auditory, vowels
O
ne challenge that models of perception and motor systems need to address is that of developmental change. Models of motor control must account for ongoing changes in the size, shape, and muscle innervation pattern of the articulators associated with accomplishing some functional goal in the face of alterations in associated sensory cues during development. This issue was addressed by Bernstein Journal of Speech, Language, and Hearing Research • Vol. 43 • ?–? • June 2000 •Callan ©American Speech-Language-Hearing et al.: Auditory-Feedback Model ofAssociation Speech Production 1092-4388/00/4303-0001
1
(1967), in relation to how the functional goal of gait transition from walking to running is accomplished by adaptation of the motor system to alterations in the biomechanical properties of the limbs during development. Bernstein (1967) demonstrated that different articulatory patterns are used throughout development to compensate for ongoing biomechanical change, resulting from alterations in the morphology of the articulators necessary for carrying out the functional goal. Given the individual variability in the size and rate of morphologic growth, it is unlikely that specific innate mechanisms can be used to solve the challenge of development. Rather, an organism must learn through adaptive interaction with the environment to compensate for change in the relevant cues in order to accomplish some functional goal (Edelman, 1987; Sporns & Edelman, 1993; Thelen, 1995). With regard to speech motor control, the developmental challenge can be stated as the necessity for a model to be able to account for how functional speech goals are accomplished under conditions in which the size, shape, and muscle innervation pattern of the articulators undergo considerable developmental restructuring (Honda, 1998; Kent, 1981, 1984; Thelen, 1991). There are quite remarkable differences between the structures involved with speech production of the infant and those of the adult (see Goldstein, 1980; Kent, 1999; and Kent & Vorperian, 1995 for an extensive review of the development of the structures involved with speech production). The infant vocal tract is not just a miniaturized version of the adult vocal tract. There are significant differences in the shape, size, innervation, and functional aspects of the structures involved with speech production in the infant and adult. It is important to note that the various structures develop following different time courses. The changes in the structures associated with speech production that occur during development will all have varying degrees of influence as to their acoustic consequences. These changes may require that the child constantly learn to use different articulatory configurations to achieve the same speech goals. Not only must a model of speech production be able to demonstrate the ability to compensate for developmental changes in the size and shape of the vocal tract, it must also be able to demonstrate motor-equivalent behavior throughout development. In everyday life, children demonstrate motor equivalence in their ability to produce speech with various foreign objects in their mouth by utilizing differing articulatory configurations. It has been demonstrated that children 5 and 8 years of age show articulatory compensation to the same vowels under normal and bite-block conditions (Baum & Katz, 1988). (For a more extensive discussion of motor equivalence see Berstein, 1967; Folkins & Abbs, 1975; Kelso, 2
Tuller, Vatikiotis-Bateson, & Fowler, 1984; Lindblom, Lubker, & Gay, 1979; Sporns & Edelman (1993); Turvey, Fitch, & Tuller, 1982; and Whiting, 1984). The individual specific changes in the size and shape of the structures related to speech that occur during development preclude the utility of an innately coded, invariant mapping that could be used to move the articulators to accomplish some functional goal. The speech production system must establish a mapping that is able to move the articulators in a manner to reach learned targets, taking into account the current context of the system. Auditory feedback of self-produced speech may serve as an adaptive signal that could establish a mapping that guides the movements of the articulators in order to reach auditory targets (Guenther, Hampson, & Johnson, 1998; Perkell et al., 1997). For a discussion of an alternative view that maintains that the targets of speech production are articulatory gestures, see Browman and Goldstein (1990) and Saltzman and Munhall (1989). It has been pointed out that it is not clear how a mapping that guides the movement of the articulators in order to reach gestural targets could be established during development unless one considers auditory feedback as a training signal (Guenther et al., 1998). The purpose of the current study is to demonstrate that self-produced auditory feedback is sufficient to train a mapping between auditory target space and articulator space under conditions in which the structures of speech production undergo considerable developmental restructuring. There are currently no models of speech production that attempt to solve the problem of how functional goals are maintained under these conditions. Some models are not concerned with how establishment of the mapping from target reference space to articulator space can be learned (Browman & Goldstein, 1990; Saltzman & Munhall, 1989). Rather, for these models (the gestural and task dynamic models), the mapping between constriction target space and articulator space is hand set by the experimenters. Other models of speech production that use a neural-network-based approach and an auditory target reference space (Guenther et al., 1998; Johnson, 1998; Markey, 1994), and in some cases a visual target space as well (Bailly, 1997), are able to account for the learning of the mapping from target reference space to articulator reference space. However, these models, as they are currently implemented, do not attempt to deal with the developmental problem of how the speech production system can accomplish the same functional goals under conditions of considerable morphological restructuring. As such, the model proposed here that extends the DIVA (Directions in auditory space to Velocities in Articulator space) model (Guenther et al., 1998) is a unique contribution in that it is one of the first models that attempts to accomplish this goal. The
Journal of Speech, Language, and Hearing Research • Vol. 43 • ?–? • June 2000
morphological restructuring of the speech production system is modeled by altering the vocal tract dimensions of the Maeda (1990) articulatory part of the DIVA model during the course of developmental training. The method used to alter these dimensions will be discussed further below.
Figure 1. Overview of the DIVA model (adapted from Guenther et al., 1998). Learned mappings are indicated by asterisks. See text for details.
The DIVA Model of Speech Production The DIVA model consists of a training (babbling) stage and a performance stage. The DIVA model is able to produce vowels during the performance stage by means of the acquisition of three learned mappings during the training stage (see Figure 1). These three mappings are the articulator-to-auditory mapping, the auditory-toarticulatory directional mapping, and the phoneme-toauditory mapping. These three mappings are learned using a training (babbling) process in which the positions of the seven-parameter articulation model (Maeda, 1990; see Figure 2 for description) are randomly set and the acoustic consequence of the vocal tract shape is determined and used as feedback. The first two mappings (the articulator-to-auditory mapping and the auditoryto-articulatory directional mapping) are learned by means of neural networks that use aspects of auditory feedback as a training signal. The phoneme-to-auditory mapping is learned by means of a Hebbian neural network that correlates activation of nodes in the speechsound map with movement directions in auditory planning space (Planning Direction Vector) under conditions in which a random babbled articulator position produces an auditory signal that falls within one of the 11 vowel target regions (the Miller, 1989, formant log ratio transform serves as the auditory target reference frame). The DIVA model is able to produce a user-defined vowel during the performance stage in the following manner: Given the current position of the articulators and the corresponding position in the auditory target reference frame (determined by either auditory feedback or the articulator-to-auditory mapping) the DIVA model iteratively moves the articulators in a direction in which to minimize the distance to the vowel target. The vowel targets are defined by hyperplanes in auditory reference space. The distance to the vowel target is measured from the distance of the current position of the model in the auditory target reference frame to the nearest point of the vowel hyperplane in the auditory target reference frame. The relationship between the direction of movement of the articulators in auditory space is determined by the auditory-to-articulatory directional mapping. After several iterations through this process the DIVA model is able to produce the target vowel. The auditoryto-articulatory directional mapping and the use of hyperplanes as targets instead of points allow for motorequivalent speech production. In the case of a bite block,
Figure 2. Direction of movement for the 7 Maeda (1990) articulation parameters derived from a factor analysis of cineradiographic and labiofilm data. The 7 articulatory parameters can be individually shifted between –3 and +3 standard deviations to derive different vocal tract shapes. It is important to note that the Maeda (1990) model is a two-dimensional model working in the midsagittal plane. The cross-sectional area is determined using a scaling factor (Maeda, 1990). The area function of the vocal tract shape is used to determine the acoustic output of the model; this is the source of auditory feedback that is used to determine formant values for the speech-recognition system of the DIVA model.
in which the jaw height parameter is fixed, the other articulators are able to spontaneously compensate by moving in a direction that reduces the distance to the nearest region of the hyperplane of the target phoneme in the speech-sound map. For an extensive discussion of the DIVA model and all of its components see Callan (1998), Guenther (1995), Guenther et al. (1998), and Johnson (1998).
Callan et al.: Auditory-Feedback Model of Speech Production
3
Developmental Measures Used to Modify the Dimensions of the Maeda Model In order to simulate the developmental restructuring of the speech-production system, the vocal-tract dimensions of the Maeda (1990) model were altered during the course of training. Measures of vocal-tract dimensions were approximated from mid-sagittal MRI slices of four males at ages 3, 7, 15, 24, 36, and 45 months; 216 months represents the adult Maeda articulation dimensions. The MRI scans of the children were acquired as part of a medical examination at the University of Wisconsin–Madison Hospital. The semipolar coordinate gridlines (shown in Figure 3) were made using programs developed by Mark Tiede at Advanced Telecommunications Research Institute that run on the public domain image analysis software Scion Image (http://www.scioncorp.com/). Coordinate grids are spaced by 0.5 cm in the two linear dimensions and by 11.2 degrees in the polar region. It can be seen by comparison of A and B in Figure 3 that there are considerable differences in the geometrical configuration of the four reference points that are used to define the placement of the semipolar grid. This results in a different number and respective ratio of gridlines within each section during the course of development (see Figure 4, A–C). Developmental curves used to modify the dimensions of the Maeda articulation model were approximated from measures taken from the MR images as well as data given in Kent and Vorperian (1995) (see Figure 4, A-C). Spline
interpolation was used to smooth the curves. It is important to note that the same ratio is used to calculate the cross-sectional diameter within each respective region (Palatal-Dental, Velar, and Pharynx). From the various developmental curves, the coordinate system defining the midsagittal outline of the vocal tract was modified to reflect the corresponding dimensions for each age. The factor patterns determining the respective shape that results from manipulation of the Maeda (1990) articulatory parameters were not altered during the course of development; adult-like movement patterns of the articulators were used. It is recognized that these developmental changes used to alter the dimensions of the Maeda (1990) articulatory model are only gross approximations of actual developmental changes that occur in the structures involved with speech production. However, for the purpose of demonstrating that auditory feedback is sufficient to train a mapping between auditory reference space and articulatory reference space under conditions in which the structures of speech production undergo considerable morphological restructuring during the course of development, the changes made in the dimensions of the Maeda (1990) articulatory model are adequate.
Specific Predictions of the Extended DIVA Model With Reference to the Goals of This Study 1. The model will be able to maintain vowel targets during the course of development, under conditions of
Figure 3. A. Vocal tract of 24-month-old. B. Vocal tract of 216-month-old, based on coordinates given in the Maeda (1990) articulation model. Semipolar coordinate grids (shown in white) used to make measurements of midsagittal MRI slices of four males at ages 3, 7, 15, 24, 36, and 45 months. These measurements were later used to construct developmental curves for various features of the vocal tract. Reference points (shown in white) used to define the coordinates of the semipolar grid were placed at (a) the bottom of the alvelar ridge, (b) the maxilla above the hard palate, (c) the rear of the pharyngeal wall, and (d) the rear of the pharyngeal wall above the aryepiglottic tissue. The three regions are defined by the semipolar coordinates: Palatal-Dental Region (Linear), Velar Region (Polar), Pharynx Region (Linear).
4
Journal of Speech, Language, and Hearing Research • Vol. 43 • ?–? • June 2000
Figure 4. Developmental curves used to modify the dimensions of the Maeda (1990) model. A. Change in the length of the three different regions from 3 to 60 months of age. Max corresponds to the value of the section with the maximum width for that region. Min corresponds to the value of the section with the minimum width for that region. B. Change in the length of the three different regions by the addition of new sections spaced by .5 cm from 3 to 50 months of age. C. Change in the ratio to the adult size for lip height, lip protrusion, lip width, and the larynx dimensions from 3 to 60 months of age. See text for details.
considerable morphological restructuring of the speech production system, by means of the use of self-produced auditory feedback as a training signal. This will be accomplished by adaptation of the weights in the various modifiable mappings of the DIVA model. Vowels produced by the network will show an orderly arrangement in formant and ratio space, with formant values consistent with those produced by children. 2. The model will utilize different articulatory configurations to produce the same functional goals during the course of development to accommodate for changes
in the acoustical properties of the vocal tract that result from morphological change. 3. The model will be able to account for motor equivalent speech production under conditions in which one or more of the articulators is restricted in a self-organizing manner even though it did not encounter these conditions during training. The lip parameters that are influential in forming typical vocal-tract constriction patterns for the vowel /u/ are restricted. The Maeda (1990) lip-aperture parameter is fixed in an open position (SD = 1.0), and the Maeda (1990) lip-protrusion parameter Callan et al.: Auditory-Feedback Model of Speech Production
5
is fixed in a neutral position (SD = 0.0). In order for the model to produce the proper formant values for the vowels under the lip-restricted condition it will require that somewhat different constriction patterns be used. These conditions are similar to the lip tube experiment conducted by Savariaux, Perrier, and Orliaguet (1995). This may have some implications for speech production and perception models that are dependent on constrictionbased targets.
Training of the Neural Networks Training was carried out for 11 English vowels on vocal tracts of 12-month-olds to 60-month-olds in steps of 3 months. The weights from the previous vocal-tract step in development are used as initial values for training on each successive vocal-tract step in development. By 60 months of age the shape of the child vocal tract has reached essentially adult proportions (Kent, 1999). The 11 English vowels used consist of /i/, /I/, /e/, /”/, /œ/, /A/, /o/, /Á/, /u/, /Ø/, and /E±/. Most English-speaking children are able to produce these English vowels by 60 months of age (Kent, 1992). The only vowel at this age that typically may be difficult is /E±/ (Kent, 1992). Vowel target regions are based on adult male formant values for the 11 vowels taken from Peterson and Barney (1952) with the exception of /e/ and /o/, which are taken from Hillenbrand, Getty, Clark, and Wheeler (1995). In this network the auditory-target reference frame consists of adult ratios, calculated using the Miller (1989) transform, with a constant fundamental frequency of 100 Hz. Using an adult fundamental frequency for the target reference frames is thought to be appropriate because children are likely to learn the vowel targets from hearing adult productions. This method of training was used to ensure that the learned mappings in the model reflect the goal of producing the targets of adult speech. Age-appropriate fundamental frequency as reported in Kent (1997) is used to test the network during the performance phase (see Figure 6, bottom). Using an auditory space that can normalize between adult and child speech is an essential part of this model. The model presupposes that children learn the rudimentary target regions in auditory perceptual space for the various aspects of speech before the acquisition of speech production can be accomplished. There are several lines of evidence that suggest that infants are able to perceive the rudimentary aspects of speech before intentional speech production is acquired (Jusczyk, 1993; Kuhl, 1983; Kuhl & Meltzhoff, 1996; Polka & Werker, 1994; Werker & Polka, 1993). These studies demonstrate that infants are able to discriminate, and in some cases to show categorical perception of, many speech sounds at a very early age before speech production begins. It should be noted that there is a considerable degree of evidence 6
suggesting that the phonetic categories of children change significantly as language is acquired (Stager & Werker, 1997; Werker & Polka, 1993). Furthermore, it has been demonstrated that children younger than 6 months of age can normalize between adult male/female and child speech (Kuhl, 1979, 1983; Kuhl & Meltzoff, 1996). Developmental learning rate and decay rate parameters are modeled after data given by Huttenlocher (1993). The decay and learning rate parameters are used to determine the magnitude of change of the weights of the three learned mappings during training. The values based on Huttenlocher (1993) were used as a plausible constraint for changes that occur in plasticity during the course of development. It is recognized that many other factors may be involved with the degree of plasticity than are accounted for here.
Results and Discussion of the DIVA Neural Network Simulations Testing the Performance of the Extended DIVA Neural Network Model The production performance of the extended DIVA model was evaluated for each of the learned vowels in 3-month intervals between 12 and 60 months of age. The performance of the model was tested using ageappropriate fundamental frequencies (see Figure 6, bottom). Performance of the model was evaluated on the basis of formant and ratio values, articulator configuration patterns, and vocal-tract constriction patterns (area functions). A neutral articulatory configuration pattern (the SD of all 7 Maeda articulatory parameters were set to 0.0; see Figure 2) was used as the starting position of the articulators to test the performance of the model on each of the 11 vowels. The asterisks in Figure 5(A & B) denote the position of the neutral articulatory configuration in formant and ratio space. Figure 6 (bottom) indicates a decrease in formant frequencies as the vocal tract increases in size during development. This is consistent with what one would expect given the changes in the acoustic properties of the vocal tract as it increases in size during development.
The Production of Vowel Formants During the Course of Development The first objective of this study was to demonstrate that vowel targets can be maintained during the course of development under conditions in which the structures involved with speech production undergo considerable morphological restructuring by means of self-produced auditory feedback as a training signal. Figure 5(A & B) shows the simulation results for each of the 11 target
Journal of Speech, Language, and Hearing Research • Vol. 43 • ?–? • June 2000
Figure 5. A. Results of the network for each of the 11 vowels in F1 by F2 formant space for each of the stages of development, 3 to 60 months of age (small circles). Ellipses circle the main clustering for each vowel produced by the network as well as the child formant values (large circles). The formant values corresponding to the neutral articulatory configuration are displayed as small black asterisks. The big circles represent mean child formant values taken from Peterson and Barney (1952) with the exception of /e/ and /o/ taken from Hillenbrand et al. (1995). B. Results of the network for each of the 11 vowels in R1 by R2 by R3 ratio auditory target space from 3 to 60 months of age (small circles). The ratio values corresponding to the neutral articulatory configuration are displayed as small, black asterisks The rectangles represent the hyperplane target regions derived from using the Miller (1989) transform of formant values. SR is a ratio transform of the fundamental frequency defined by Miller (1989) to normalize between male and female child speech.
Callan et al.: Auditory-Feedback Model of Speech Production
7
vowels for the various stages throughout development. Figure 6 shows the formant frequencies in hertz produced by the network for each of the 11 vowels from 3 to 60 months of age. The formant frequencies produced by the network are comparable to those of children and show a tendency to decrease (in correspondence with acoustics) as the vocal tract increases in size and the fundamental frequency decreases (see Figure 5[A]: ellipses circle the main clustering for each vowel produced by the network; also see Figure 6). Throughout the course of development, the network produced vowels that are tightly clustered with only minor overlap and with the same respective configuration as child and adult vowels. Given the lowering of formant frequency with development it is not surprising that there is some degree of overlap in F1 by F2 space. High vowels produced by the network tend to show greater correspondence with child formant values than do low vowels. The low vowels /œ/ and /a/ tend to show lower F1 formant values than is expected given the “age” of the network. It has been published elsewhere (Guenther et al., 1998) that the DIVA model has difficulty producing the vowel /a/, possibly as a result of restrictions of the Maeda articulation model. Figure 5(B) shows the results of the 11 vowels produced by the network throughout development in the Miller (1989) log ratio auditory-target reference space. It can be seen that most of the vowels produced by the network fall within the hyperplane targets with the exception of vowels /ei/, /œ/, /a/, and /E±/. For the vowels /ei/ and /œ/ the error results from an undershoot in R1 space. For the vowel /a/ there appears to be some degree of error in both R1 [log(F1/SR)] and R2 [log(F2/ F1)] space. SR is a ratio transform of the fundamental frequency defined by Miller (1989) to normalize between child and male and female adult speech. The error for the vowel /E±/ is mostly in R3 [log(F3/F2)] space. The values shown as black asterisks in Figure 5(A & B) represent the position in acoustic/auditory space of the neutral articulatory configuration during the course of development. The formant values of the neutral articulatory configuration show a tendency to decrease during the course of development. One limitation of the model is that it learned too well in some respects. It learned to produce vowels earlier in development than is commonly found in children. At 12 months of age, the model was already able to produce 9 of the 11 vowels with fairly good accuracy. This type of mastery is not commonly seen in children until around 36 months of age (Kent, 1992). It is hypothesized that properties reflecting biophysical constraints of the developing vocal tract and aspects of motor control as well as cognitive and linguistic factors need to be incorporated for more accurate modeling of the course of acquisition for the various vowels. It is interesting to note, however, that the vowels /e/, /œ/, and /E±/ that show a fair degree of error in 8
reaching ratio targets are also vowels that are acquired somewhat later in life by children (Kent, 1992).
Developmental Articulator Configuration Patterns and Vocal Tract Constriction Patterns The second objective of the study was to demonstrate that different articulatory configurations are used during the course of development by the model to compensate for changes in the acoustical properties of the vocal tract that result from morphological change. It can be seen in Figure 5(A & B) and Figure 6 that the position in auditory planning space corresponding to the neutral articulator parameter values changes considerably during the course of development. Changes in the starting position will cause the DIVA model to follow differing trajectories to reach the closest region of a hyperplane target of a particular speech goal. Therefore, because the nearest region of the hyperplane target in relation to the position in auditory planning space corresponding to the neutral vocal-tract configuration (used as the starting position) changes during the course of development, it is expected that there will be some change in the direction (and the corresponding final position) that the articulators are moved to reach the vowel target. This will result in variability in the configuration of the various articulatory parameters used to produce a speech goal during the course of development. Variability in speech production during the course of development has been noted in studies conducted by Green (1998) as well as Sharkey and Folkins (1985). The values of the 7 Maeda articulator parameters used to produce each vowel during the course of development are displayed in Figure 7. These values denote the movement deviation of the 7 Maeda articulatory parameters which range from –3 to +3, with 0 being the neutral value. One should refer to Figure 2 for the definition and movement patterns of each of the 7 Maeda articulators. It can be seen for each of the 11 vowels in Figure 7 that there is variability in one or more articulatory parameters during the course of development. Front vowels are predominantly characterized by variation in lip protrusion (LP) and larynx height (LH) during the course of development (Articulator Deviation values for LP and LH in Figure 7); whereas back vowels are predominantly characterized by variability in jaw height (JH), tongue body shape (TBS), tongue tip position (TTP), LP, and LH (see Articulator Deviation values for JH, TBS, TTP, LP, and LH in Figure 7). The greater degree of variability for back vowels may reflect a greater flexibility of the movement patterns that can be used to make a backward pharyngeal constriction that would produce area functions corresponding to the desired acoustics of the target vowels.
Journal of Speech, Language, and Hearing Research • Vol. 43 • ?–? • June 2000
Figure 6. Formant frequencies in hertz produced by the network for each of the 11 vowels from 3 to 60 months of age. Also shown is a plot of the formant frequencies for the neutral articulator position from 3 to 60 months of age as well as the fundamental frequency value used during the performance phase for each step of development. Empty regions at the initial part of the plots indicate that the phoneme-toauditory map for that vowel has not been acquired yet. Plot of vowels is made in F1 by F2 space.
Callan et al.: Auditory-Feedback Model of Speech Production
9
Figure 7. Articulator deviation patterns in standard deviation units (JH = Jaw Height; TBP = Tongue Body Position; TBS = Tongue Body Shape; TTP = Tongue Tip Position; LA = Lip Aperture; LP = Lip Protrusion; LH = Larynx Height) produced by the network for each of the 11 vowels from 3 to 60 months of age. Empty regions at the initial part of the plots indicate that the phoneme-to-auditory map for that vowel has not been acquired yet. A negative standard deviation (SD) value of jaw height (JH) corresponds to lowering of the jaw, a negative SD value of tongue body position (TBP) corresponds to an anterior position, a negative SD value of tongue body shape (TBS) corresponds to a flat tongue, a negative SD value of tongue tip position (TTP) corresponds to a posterior tongue tip position, a negative SD value of lip aperture (LA) corresponds to a closure of the lips, a negative SD value of lip protrusion (LP) corresponds to a retraction of the lips, and a negative SD value of larynx height (LH) corresponds to a lengthening of the larynx. Plot of vowels is made in F1 by F2 space.
10
Journal of Speech, Language, and Hearing Research • Vol. 43 • ?–? • June 2000
Visual inspection of the results of the DIVA model permits the following general observations regarding speech production performance. (a) The DIVA model had more stable articulatory position values for front vowels than for back vowels during the course of development. Back vowels had more open JH values as well as more posterior tongue body position (TBP) values than did front vowels. This is consistent with what is seen in adult speakers (Kent, Dembowski, & Lass, 1996). (b) Another general characteristic of the DIVA model is that for front vowels the tongue body position moves from anterior to posterior as one goes from high vowels to low vowels. Consistent with constriction patterns found in adult speech (Wood, 1979; see Figure 8 for a summary of the constriction patterns of the various vowels arranged in F1 by F2 space), the DIVA model produced front vowels characterized by relatively more anterior vocal tract constrictions than back vowels. High back vowels are also characterized by tight constriction at the lips. In contrast to vocal tract constriction patterns found in adult speech (Wood, 1979), the DIVA model produced high back vowels with narrow constrictions which progressively become less narrow in the direction of low back vowels. Visual inspection of the formant values and the vocal tract constriction patterns during the course of development reveals that there is a considerable degree of change that is reflective of the developmental measures used to modify the Maeda articulation model (see Figures 6 & 8 and Figure 4, A-C). As the vocal tract increases in size, it has the effect of lowering the formant frequencies. In addition, as the configuration of the vocal tract changes during development, the location of the nodes and antinode resonant properties of the vocal tract shift. This causes shifts to occur in the vocal-tract constriction pattern of the DIVA model responsible for producing the various vowels. It is important to note that the DIVA model is able to produce stable constriction patterns throughout the course of development without any internal rules specifying the degree and location of constriction. The only feedback the model receives regarding the area function of the vocal tract is indirectly through the produced acoustic signal. This serves as strong evidence that even though consistent constriction patterns are produced throughout development, they might not serve as target values but rather just be a product of area functions that produce acoustic properties that reflect desired auditory targets (see also Guenther et al., 1998).
Motor-Equivalent Speech Production Under Restricted Articulator Movement One condition in which motor equivalence is demonstrated by the DIVA model during the course of development utilizes fixing the LA in an open position (SD
= 1.0) and fixing LP in a neutral position (SD = 0.0) so that a labial constriction cannot be made to produce the vowel /u/. This is similar to the experiment conducted by Savariaux et al. (1995) in which it was demonstrated that when typical constriction patterns to produce the vowel /u/ cannot be used because of restriction of the articulators by a lip tube some individuals are able to compensate by using atypical vocal-tract constriction patterns. This type of restriction differs from other demonstrations of motor-equivalent behavior, such as perturbation caused by a bite block, in that it requires not only that different articulatory configurations be used to produce the appropriate speech goal but also requires that a different constriction pattern be used. It can be seen that LA values in the unrestricted condition reflect lips that are more tightly closed than in the liprestricted condition, in which the lips are fixed in an open position (see Figure 9[B]). The degree of LP in the normal unrestricted condition fluctuates, being greater and less than the lip-restricted condition throughout the course of development (see Figure 9[B]). The productions of the model appear to be fluctuating between the rounded and unrounded variants of the vowel /u/ during the course of development. It can be seen in Figure 9(B) that there is a considerable degree of articulatory compensation in the remaining five Maeda articulatory parameters for the vowel /u/ under the lip-restricted condition. The JH is more tightly closed, the TBP is more posterior, the TBS is more bunched, the TTP is more posterior, and the LH is for the most part lower in the lip-restricted condition than in the normal speech condition. Under normal, unrestricted conditions, the area function shows tight constrictions at the lips and the midpalate region and maximum areas in the alveolar ridge region and the pharynx region of the vocal tract (see Figure 9[A]). As one would expect, there is a fair degree of variability in the vocal tract configuration that corresponds to the fluctuations (trading relations) in the articulatory configurations that occur during the course of development (see Figure 9[A]). In the lip-restricted condition, it is not possible for a tight lip constriction to be made. The model compensates for this by producing an area function with a tighter degree of constriction in the midpalate region and a steeper maximum area in the alveolar ridge region and the pharynx region (see Figure 9[A]). The area at the glottis is also considerably larger in the lip-restricted condition than in the normal unrestricted condition. This is also characterized by a lower LH in the restricted condition than in the normal unrestricted condition (see Figure 9[B]). Although the constriction at the lips and the glottis are considerably larger than in the normal speech condition the area function formed by the resultant articulatory configuration is sufficient to produce nearly identical formant values throughout the course of development despite changes Callan et al.: Auditory-Feedback Model of Speech Production
11
Figure 8. Vocal tract constriction pattern (area function) produced by the network for each of the 11 vowels from 3 to 60 months of age. Empty regions at the initial part of the plots indicate that the phoneme-to-auditory map for that vowel has not been acquired yet. Plot of vowels is made in F1 by F2 space. The graphs displaying the vocal tract constriction pattern are constructed by interpolating the vocal tract area function into 17 divisions. Therefore, the x axis only represents the relative position along the vocal tract, not the actual length of the vocal tract. By using this method the relative location of constriction and location of the maximum area can be compared throughout development. See text for details.
12
Journal of Speech, Language, and Hearing Research • Vol. 43 • ?–? • June 2000
Figure 9(A-E). Results of the network under the lip restricted condition (LA = 1.0; LP = 0.0) for the vowel /u/.
Callan et al.: Auditory-Feedback Model of Speech Production
13
in the size of the vocal tract (Figure 9[C]). This is consistent with the Wood (1986) study demonstrating that for lip-rounding for the Swedish vowel /y/ larynx depression is used in a complementary maneuver to allow for better control of resonant conditions. Conversely, in a similar manner under the lip-restricted condition the network decreases the depression at the larynx, which allows for better control of resonant conditions, resulting in a low F2 value (see Figure 9[A & C]). Figure 9(D & E) demonstrates that the use of compensatory articulatory values by the network are much closer to the original formant values produced by the network than are formant values produced using the original articulator deviation values with the exception of the two fixed-lip parameters (LA and LP). The results of the network for the vowel /u/ under the lip-fixed condition demonstrate that somewhat different constriction patterns (area functions) can be used to produce the same vowel target. This is consistent with the findings reported by Savariaux et al., 1995.
Conclusion The results of the neural network simulations demonstrate that a model using auditory perceptual targets to plan articulation can account for how some of the same speech goals can be met despite the changes that the vocal tract configuration undergoes during development. The model accomplishes this task by planning the movement direction of the articulators with reference to the direction in auditory space needed to reach the auditory targets. Even though log ratio values based on adult formant targets were used to train the extended DIVA model, it is able to produce the 11 American English vowels with a considerable degree of consistency throughout the course of development. The ability to produce formant frequencies for vowels that closely approximate values of children even though the model was not trained with child formant values is made possible by means of normalization in the auditory planning space. The auditory space used in this model, the Miller (1989) space, uses a log ratio transform to normalize the formant values of adult and child speech. This auditory representation is shown to be sufficient to allow for an orderly arrangement of formant frequencies for the various vowels that approximate values children produce throughout the course of development. This model provides a first attempt to solve the problem of how functional goals are maintained throughout development under conditions in which the structures of the speech production system are undergoing considerable morphological restructuring. It also suggests that the production of speech goals may be accomplished throughout development by means of establishing a mapping between the direction of movement in articulator planning 14
space (defined by the 7 dimensions of the Maeda articulation model) needed to produce a direction of movement in auditory planning space. The results of the simulations demonstrate that during the course of development the learned mappings in the extended DIVA model are able to produce motor equivalent speech under conditions of lip restriction in a self-organizing manner even though the model did not encounter these conditions during training. Of particular interest is the finding that under the lip-restriction condition somewhat different constriction patterns (area functions) are used to produce the same vowel as opposed to the unrestricted condition. This finding has implications for speech production models that are dependent on constriction-based targets such as the one proposed by Saltzman and Munhall (1989). It also has implications for speech perception models that are dependent on constriction-based targets. Many researchers have suggested that speech perception is mediated by an ability to directly pick up information pertaining to the articulatory gesture responsible for production (Fowler, 1986) or alternatively transform the acoustic signal in reference to the articulatory gesture responsible for the production (Liberman & Mattingly, 1985). Both of these positions, the direct realist theory (Fowler, 1986) and the revised motor theory (Liberman & Mattingly, 1985) maintain that the articulatory gestures responsible for speech production do not vary considerably for the same speech goal. However the results of the experiment conducted by Savariaux et al. (1995), which are consistent with the results of the simulations reported here, suggest that multiple constriction patterns (articulatory gestures) can be used to produce the same functional speech goals. There are several limitations of the present neural network simulations that require the DIVA model to be extended in future projects. Given the simple nature of the vocal-tract model used—the 7-parameter Maeda articulatory model—it is unlikely that a direct comparison to individual performance can be made. In the current implementation, the movement patterns of the articulators and their relation to the other articulators is based on the adult values of the Maeda articulation model. To be able to directly compare the performance of the model with those of individuals it is necessary to incorporate a more accurate individual specific vocaltract model (by means of MRI data) that includes aspects of muscle control in order to better characterize patterns of speech development in infants and children. Another limitation of the neural network model presented in this paper is that the pattern of babbling used for training is too simplistic. In the model presented in this paper, babbling was simulated by randomly setting the value of the 7 Maeda articulatory parameters and
Journal of Speech, Language, and Hearing Research • Vol. 43 • ?–? • June 2000
then determining the resultant formant values. The mappings learned in the model are based on the statistical regularities between articulation and the corresponding auditory consequences. This suggests that different patterns of babbling will have effects on the type of mappings that are learned by the model. It would be interesting to restrict the pattern of babbling that occurs during development into front, neutral, and back frames, as is suggested by MacNeilage and Davis (1990). It would then be possible to explore the consequences that differing babbling types have on the acquisition of the production of vowels. More extensive research needs to be conducted regarding the relationship between speech perception and speech production in order to discern possible underlying processes. The results of the simulations carried out here suggest that speech production may be carried out by a mapping between directions in auditory reference space and directions in articulator reference space needed to reach auditory speech goals. The ability of an organism to learn to accomplish functional goals during the course of development despite morphological change may provide a considerable degree of flexibility in the mappings formed to carry out these goals under various conditions even in the adult. This research suggests that a better understanding of possible mappings that may be used by humans to drive speech production will come from experiments designed to investigate the relationship between auditory and articulatory space.
Acknowledgments This work was supported in part by NIDCD grant 1R29 DC02852 and also in part by NIH grant 5R01 DC00319. We would like to thank Shinji Maeda and Mark Tiede for the use of their code.
References Bailly, G. (1997). Learning to speak: Sensori-motor control of speech movements. Speech Communication, 22, 251–267. Baum, S. R., & Katz, W. F. (1988). Acoustic analysis of compensatory articulation in children. Journal of the Acoustical Society of America, 84, 1662–1668. Bernstein, N. (1967). The coordination and regulation of movements. Oxford, U.K.: Pergamon. Browman, C. P., & Goldstein, L. (1990). Gestural specification using dynamically-defined articulatory structures. Journal of Phonetics, 18, 411–424. Callan, D. E. (1998). An auditory-feedback based model of speech production in the developing child. Dissertation. University of Wisconsin–Madison. Edelman, G. M. (1987). Neural Darwinism: The theory of neuronal group selection. New York: Basic Books. Folkins, J. W., & Abbs, J. H. (1975). Additional observations on responses to resistive loading of the jaw. Journal of
Speech and Hearing Research, 18, 207–220. Fowler, C. (1986). An event approach to the study of speech perception. Journal of Phonetics, 14, 3–28. Green, J. R. (1998). Physiologic development of speech motor control: Articulatory coordination of lips and jaw. Dissertation. University of Washington. Goldstein, U. G. (1980). An articulatory model for the vocal tracts of growing children. Dissertation. Massachusetts Institute of Technology. Guenther, F. H. (1995). Speech sound acquisition, coarticulation, and rate effects in a neural network model of speech production. Psychological Review, 102, 594–621. Guenther, F. H., Hampson, M., & Johnson, D. (1998). A theoretical investigation of reference frames for the planning of speech movements. Psychological Review, 105, 611–633. Hillenbrand, J., Getty, L., Clark, M., & Wheeler, K. (1995). Acoustic characteristics of American English vowels. Journal of the Acoustical Society of America, 97, 3099–3111. Honda, K. (1998). Form and function: Another view of speech production. ATR Technical Report, TR-H-235. Huttenlocher, P. (1993). Morphometric study of human cerebral cortex development. In M. Johnson (Ed.), Brain development and cognition: A reader (pp. 112–124). Cambridge MA: Blackwell. Johnson, C. D. (1998). Investigations of formant and wavelet representations for speech movement planning. Dissertation. Boston University. Jusczyk, P. W. (1993). From general to language-specific capacities: The WRAPSA Model of how speech perception develops. Journal of Phonetics, 21, 3–28. Kelso, J. S., Tuller, B., Vatikiotis-Bateson, E., & Fowler, C. A. (1984). Functionally specific articulatory adaptation to jaw perturbations during speech: Evidence for coordinative structures. Journal of Experimental Psychology, 10, 812–832. Kent, R. D. (1981). Sensorimotor aspects of speech development. In R. N. Aslin, J. R. Alberts, & M. R. Peterson (Eds.), Development of perception (Vol. 2, pp. 161–189). New York: Academic Press. Kent, R. D. (1984). Psychobiology of speech development: Coemergence of language and a movement system. American Journal of Physiology, 246, R888–894. Kent, R. D. (1992). The biology of phonological development. In C. A. Ferguson, L. Menn, & C. Stoel-Gammon (Eds.), Phonological development: Models, research, implications (pp. 65–90). Timonium, MD: York Press. Kent, R. D. (1997). The speech sciences. San Diego, CA: Singular Publishing Group. Kent, R. D. (1999). Motor control: Neurophysiology and functional development. In A. J. Caruso & E. A. Strand (Eds.), Clinical management of motor speech disorders in children (pp. 29–71). New York: Thieme Medical and Scientific Publishers. Kent, R. D., Dembowski, J., & Lass, N. J. (1996). The acoustic characteristics of American English. In N. J. Lass (Ed.), Principles of experimental phonetics (pp. 185–225). St. Louis: Mosby.
Callan et al.: Auditory-Feedback Model of Speech Production
15
Kent, R. D., & Vorperian, H. K. (1995). Anatomic development of the craniofacial-oral-laryngeal systems: A review. Journal of Medical Speech-Language Pathology, 3, 145–190.
Saltzman, E. L., & Munhall, K. G. (1989). A dynamical approach to gestural patterning in speech production. Ecological Psychology, 1, 333–382.
Kuhl, P. (1979). Speech perception in early infancy: Perceptual constancy for spectrally dissimilar vowel categories. Journal of the Acoustic Society of American, 66, 1668–1679.
Savariaux, C., Perrier, P., & Orliaguet, J. P. (1995). Compensation strategies for the perturbation of the rounded vowel [u] using a lip tube: A study of the control space in speech production. Journal of the Acoustical Society of America, 98, 2428–2842.
Kuhl, P. (1983). Perception of auditory equivalence classes for speech in early infancy. Infant Behavior and Development, 6, 263–285. Kuhl, P., & Meltzoff, A. (1996). Infant vocalizations in response to speech: Vocal imitation and developmental change. Journal of the Acoustical Society of America, 100, 2425–2438. Liberman, A. M., & Mattingly, I. G. (1985). The motor theory of speech perception revised. Cognition, 21, 1–36. Lindblom, B., Lubker, J., & Gay, T. (1979). Formant frequencies of some fixed-mandible vowels and a model of speech motor programming by predictive simulation. Journal of Phonetics, 7, 147–161. MacNeilage, P. F., & Davis, B. F. (1990). Acquisition of speech production: Frames, then content. In M. Jeannerod (Ed.), Attention and performance XIII: Motor representation and control (pp. 453–476). Hillsdale, NJ: Lawrence Erlbaum. Maeda, S. (1990). Compensatory articulation during speech: Evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model. In W. Hardcastle & A. Marchal (Eds.), Speech production and speech modeling (pp. 131–149). The Netherlands: Kluwer Academic Publishers. Markey, K. L. (1994). Acoustic-based syllabic representation and articulatory gesture detection: Prerequisites for early childhood phonetic and articulatory development. In A. Ram & K. Eiselt (Eds.), Proceedings of the 16th Annual Conference of the Cognitive Science Science Society (pp. 595–600). Hillsdale, NJ: Lawrence Erlbaum Associates. Miller, J. (1989). Auditory-perceptual interpretation of the vowel. Journal of the Acoustical Society of America, 85, 2114–2134. Perkell, J., Matthies, M., Lane, H., Guenther, F., Wilhelms-Tricarico, R., Wozniak, J., & Guiod, P. (1997). Speech motor control: Acoustic goals, saturation effects, auditory feedback and internal models. Speech Communication, 22, 227–250. Peterson, G., & Barney, H. (1952). Control methods used in a study of the vowels. Journal of the Acoustical Society of America, 24, 175–184. Polka, L., & Werker, J. F. (1994). Developmental changes in perception of nonnative vowel contrasts. Journal of Experimental Psychology: Human Perception and Performance, 20, 421–435.
16
Sharkey, S. G., & Folkins, J. W. (1985). Variability of lip and jaw movements in children and adults: Implications for the development of speech motor control. Journal of Speech and Hearing Research, 28, 8–15. Sporns, O., & Edelman, G. M. (1993). Solving Bernstein’s problem: A proposal for the development of coordinated movement by selection. Child Development, 64, 960–981. Stager C. L., & Werker, J. F. (1997). Infants listen for more phonetic detail in speech perception than in word-learning tasks. Nature, 388, 381–382. Thelen, E. (1991). Motor aspects of emergent speech: A dynamic perspective. In N. A. Krasnegor, D. M. Rumbaugh, R. L. Schiefelbusch, & M. Studdert-Kennedy (Eds.), Biological and behavioral determinants of language development (pp. 339–362). Hillsdale, NJ: Lawrence Erlbaum Associates. Thelen, E. (1995). Motor development: A new synthesis. American Psychologist, 50, 79–95. Turvey, M. T., Fitch, H. L., & Tuller, B. (1982). The Bernstein perspective: I. The problems of degrees of freedom and context-conditioned variability. In J. S. Kelso (Ed.), Human motor behavior: An introduction (pp. 239– 252). Hillsdale, NJ: Lawrence Erlbaum Associates. Werker, J. F., & Polka, L. (1993). Developmental changes in speech perception: New challenges and new directions. Journal of Phonetics, 21, 83–101. Whiting, H. T. A. (Ed.). (1984). Human motor actions: Bernstein reassessed. Amsterdam: North-Holland. Wood, S. (1979). A radiographic analysis of constriction locations for vowels. Journal of Phonetics, 7, 25-43. Wood, S. (1986). The acoustical signicance of tongue, lip, and larynx maneuvers in rounded palatal vowels. Journal of the Acoustical Society of America, 80, 391–401. Received March 9, 1999 Accepted October 21, 1999 Contact author: Daniel E. Callan, PhD, ATR Human Information Processing Research Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gun, Kyoto 619-0288, Japan. Email:
[email protected]
Journal of Speech, Language, and Hearing Research • Vol. 43 • ?–? • June 2000