a neural network model for the simulation of word ...

International Journal of Neural Systems, Vol. 0, No. 0 (April, 2000) 00–00 c World Scientific Publishing Company

A NEURAL NETWORK MODEL FOR THE SIMULATION OF WORD PRODUCTION ERRORS OF FINNISH NOUNS

Antti J¨ arvelin∗ Department of Computer Sciences, University of Tampere University of Tampere, FIN-33014, Finland E-mail: [email protected] Martti Juhola Department of Computer Sciences, University of Tampere University of Tampere, FIN-33014, Finland E-mail: [email protected] Matti Laine ˚bo Akademi University Department of Psychology, A ˚ Abo, FIN-20500, Finland E-mail: [email protected]

Received (to be inserted Revised by Publisher) We constructed a connectionist model with multilayer perceptron networks for the simulation of word production errors in the Finnish language. Such errors occur as slips of the tongue in healthy subjects and even more often in aphasic patients suffering from brain damage. Word production errors are of theoretical interest as they open an avenue to inner workings of the language production system. Here we present our coding schemes for semantic and phoneme information of words, investigate the general properties of the model, and compare the results of the model against empirical naming error data from ten aphasic patients. The model was reasonable in simulating word production errors in this heterogeneous group of aphasic patients.

1. Introduction Word production is a multistaged process where a speaker transforms a semantic representation of a target concept to its phonological representation and finally articulates it. Most models highlight two major stages in this process.1−4 At the first stage, the conceptual representation of a word to be produced is mapped into its abstract syntactic-semantic representation, lemma (lemma access). At the second stage, a lemma is transformed into its phonological form, lexeme (phonological access). The in∗ Corresponding

author.

tricate word production system is also quite sensitive to impairment and therefore word-finding difficulty (anomia) is a common concomitant of left hemisphere damage. Besides halting or empty spontaneous speech, anomia can be discerned in a picture naming task. Here the types of errors produced are of particular interest, since they may inform us about the underlying causes of the patient’s anomia. Commonly encountered error types include semantic errors (mouse → cheese), formal errors (mouse → house), neologistic (nonword) errors (mouse → mees) and omission errors (patient says nothing or

A Neural Network Model for the Simulation of Word Production Errors of Finnish Nouns

”I don’t know” etc.). Computational models are needed to better understand the inner dynamics of word production and its problems. Neural networks have been applied to this modeling task5−8 , since the distributed computation seems to be efficient for this purpose and their topology or structure loosely corresponds to the functioning of neural cell assemblies. With regard to patient simulations, the ”lesioning” or ”damaging” of a neural network model can reveal a way as to how the underlying word production system should be organized in order to give rise to a specific symptomatology during malfunction. In general, models of lexicalization can be divided into two classes depending on the interaction which they allow between the two major stages of lexicalization. The discrete two-stage models6−7 assume that lemma and phonological access are independent processes: all lexical-semantic processing precedes phonological processing, and only a single lemma is translated into the corresponding lexeme. In interactive activation models5,8 , the lexical-semantic processing and phonological processing overlap in time and can influence each other. Nonetheless, some criticism was presented against the preceding models concerning their result analysis and structures.9 The most influential neural network models of lexicalization have been the Weaver model7 and the Interactive Activation model,10 which represent socalled localist connectionist models. We have also studied a localist Weaver-type model of lexicalization to model naming error data from Finnish-speaking aphasic patients11-14 . Although the abovementioned neural network models of lexicalization have been successful in simulating various aspects of word production, there are some important features, like word learning in children and re-acquisition of naming ability through training, which cannot be modelled with traditional non-learning models of lexicalization. To address these features we have developed a multilayer perceptron based model of lexicalization which is able to – in the sense of machine learning – learn semantic and phonological features of words in the given input data corpus. The developed model resembles the Miikkulainen’s SOM -based DISLEX-model for simulating dyslexic and category-specific aphasic impairments15 . His model included three Self-

Organizing Maps (SOM) which modeled lexical semantic, orthographic and phonological processing. The semantic concept map was connected to both the orthographic map and the phonological map with bidirectional connections. Therefore, it is possible to start lexical processing in DISLEX model by activating either semantic, phonological or orthographical representation of a target word in corresponding SOM. Miikkulainen showed how it would be possible to generate different kind of dyslexic and aphasic errors by lesioning the suitable connections between the SOMs, or the input connections of the SOMs. However, Miikkulainen did not attempt to model the naming distributions of actual aphasic or dyslexic patients. Our goal has been to build a model of lexicalization to simulate the naming data of actual aphasic patients. Since we are consentrating on modeling of aphasic picture naming errors, our model does not include an orthographic processing layer. This processing layer is typically excluded from the models of lexicalization. Another difference to Miikkulainen’s model is that in our model the processing proceeds always from lexical-semantic representation of the target word towards its phonological representation. Therefore, e.g., effects of phonological cueing cannot be simulated with our model. In the next sections we first describe the developed model and present some simulation results against empirical naming error data from a heterogeneous group of aphasic individuals. We begin by briefly reviewing our solution to the problem of semantic and phonological coding of words described in detail in our previous work16 , since it has a crucial role in the model building. We then present our model in action and describe how word production errors are created by “lesioning” the model. Ultimately, we investigate general properties of the model and present simulation results of the data of ten aphasic subjects. 2. Semantic and phonological coding of input words In neural network modeling, symbolic representations such as words have to be transformed into a numerical form. In our case, this entails transforming semantic features and phonemes of target words into input and output vectors. Our training set con-


sisted of 279 Finnish nouns, a part of which is listed in Table 1. Most of them were from our empirical data set, but 89 words were translated from the corpus of Snodgrass and Vanderwart17 into Finnish.

Table 1. A sample of the 279 nouns applied in the simulations. aasi (donkey) eskimo (Eskimo) hauki (pike (fish)) huilu (flute) k¨ a¨ arme (snake) katto (roof, ceiling) kirsikka (cherry) kurpitsa (pumpkin) leip¨ a (bread) lusikka (spoon)

nen¨ a (nose) paita (shirt) peruna (potato) p¨ oll¨ o (owl) riikinkukko (peacock) sipuli (onion) takki (coat) tuuba (tuba) viinilasi (wine glass) vy¨ o (belt)

2.1. Semantic encoding In theory, the simplest way to code semantics of words would be to apply local coding, where every word would be coded to its own dimension18 . This would incur an input vector with the dimension of 279 for the current word set, which would lead to too many network nodes. Also the so-called feature based coding of semantics19 would produce dozens of semantic features (dimensions of input vectors) for word set of 279 nouns and therefore we were forced to seek another kind of coding scheme for the semantics. To overcome the semantic encoding problem, we developed a new distributed coding system by using a hierarchical semantic tree and random vectors. The coding scheme is addressed at length in our previous work16 and therefore we will only review it here shortly. A semantic tree is comprised of superordinate terms and subordinate terms that hierarchically define categories to which words are sorted. The root of a tree is the superordinate term of all other nodes of the tree. The deeper in the tree a node is located, the narrower the corresponding concept is. Every word in our training data set was sorted to such a node of the semantic tree that best defines its location in the tree. This was determined as a relation between the parent node of a node and, on the other hand, its children. There is a section of the semantic tree in Fig. 1.

Fig. 1. A segment of the semantic tree generated from the used corpus.

Based on the semantic tree, we implemented an algorithm that constructs semantic coding by recursively applying K-means cluster algorithm20 to a set of random input vectors. Because in theory the dimension of the input vectors for the algorithm can be chosen freely, it is possible to produce very compact semantic representations. We used in our simulations three dimensional random vectors. The basic idea of the algorithm is as follows. First a set S of random vectors is assigned into the root of the semantic tree. This set forms the basis for the semantic representations in the whole tree. Next the semantic codes are formed for all the words located in the root. For this, let us assume that there are w such words. To generate the semantic codes for each word, the set S is clustered into w clusters with Kmeans algorithm. The final cluster centers produced by the K-means algorithm are used as the semantic codes of the words. After the codes for the words in root have been created, the codes of the words in the root’s children has to be formed. This is done recursively for each child of the root. Let us assume that the number of children of the root is c. To create an input set for each child, set S is clustered again with K-means algorithm into c subsets Si ⊂ S, where i ∈ {1, . . . , c}. Now the semantic codes of the words in the children can be produced by applying recursively the algorithm for each child with corresponding subset Si . As a result of the algorithm each word in the semantic tree has been associated with vector pre-


sentation. The semantic codes of the words locating close to each others in the semantic tree should also locate close to each other, because the algorithm has picked the codes for those words from the same or near by cluster. However, it is obvious that the developed coding scheme is very coarse and limited, since it does not allow modeling, e.g., associative relations between the words (like ”dog → bone”). We analyzed the correlations between the semantic representations by calculating the average distances of each word’s sematic representation to all other semantic representations. The minimum average distance was 0.52 and the maximum was 0.98. The average of the average distances was 0.65 with standard deviation of 0.08. Therefore, there exists some unwanted correlations between the semantic representations of the words (i.e. some words will be more error prone than the others on the average), which will effect the error distribution of the model. However this effect is minor, because of the small standard deviation. 2.2. Phonological encoding The phonological coding was straightforward because the Finnish language has practically fully regular grapheme-phoneme correspondence. The phonemes can be classified on basis of their features like voiced / voiceless and place and manner of articulation.21 As an additional simplification, phonemes /c, q, w, x, z, ˚ a/, which occur in Finnish only in loan words or proper nouns, were excluded. Hence, 23 phonemes (or actually graphemes) suffice to encode all the relevant words in our study. Thyme22 presented a coding system for the Finnish phonemes. We adopted that coding for the consonants (Table 2), but for the vowels we devised a novel way, since Thyme’s coding does not allow us to model Finnish vowel harmony in a systematic way. According to the Finnish vowel harmony, no simple wordform (e.g. non-compound word) is allowed to contain both front vowels /y, o ¨, a ¨/ and back vowels /u, o, a/.21 As the vowel harmony is retained in normals’ slips of the tongue and even aphasic patients,6 we developed a method that prevents vowel harmony violations. Instead of front vowels and back vowels, we coded their archiphonemes /A, O, U/ and also neutral vowels /i, e/. Archiphoneme /A/ is for vowels /a, a ¨/, /O/ is for /o, o ¨/, and /U/ is for /u, y/.

Table 2. The coding scheme for the consonants adapted from Thyme.22 Phoneme /b/ /d/ /f/ /g/ /h/ /j/ /k/ /l/ /m/ /n/ /p/ /r/ /s/ /t/ /v/

Voiced 1 1 0 1 0 1 0 1 1 1 0 1 0 0 1

Manner of articulation 1 1 1 1 1 0 1 1 0 1 0 1 1 1 0 1 0 0 0 0 1 1 0 1 1 0 1 1 1 0

Place of articulation 1 1 1 0 1 1 0 0 0 0 0 0 0 0 0 1 1 1 1 0 1 1 1 0 1 0 1 0 1 1

Only archiphonemes are taught to the vowel network. When it is known whether a target noun includes either front vowels or back vowels, a correct surface form is selected during a simulation process according to the following rules. If a word includes front vowels, archiphonemes /A, O, U/ are interpreted as /¨ a, o ¨, y/ and neutral vowels /i, e/ are processed as unchanged. If a word includes back vowels, /A, O, U/ are interpreted as /a, o, u/ and neutral vowels are processed as unchanged. If there are only neutral vowels, /A, O, U/ generated by the vowel network are interpreted as front vowels /¨ a, o ¨, y/ and /i, e/ are used as such. These rules guarantee that the front vowels and back vowels cannot get mixed up with each other. For example, for the word “aasi” (donkey) the vowel network gives /AAsi/ and for the word “k¨ aa¨rme” (snake) it gives /kAArme/. The given morphophonemes can be coded with two attributes: roundness and narrowness (see Table 3).

Table 3. The archiphoneme based coding scheme for the vowels. Archiphoneme or neutral vowel /A/ /e/ /i/ /O/ /U/

Roundness 0 0 0 1 1

Narrowness 0 0.5 1 0.5 1

To investigate the correlations between the con-


sonants, we calculated the average distances of each consonant to all other consonants (see Table 4). The differences between the average distances between the consonants are only slight and therefore there should not be major unwanted correlations between the consonants that would effect the error distribution of the model.

lexical-semantic network as the first stage and two separate phoneme networks (vowels and consonants) as the second stage. All neural networks of the model are autoassociative, i.e., their target output is identical to their input.

Table 4. The average Euclidean distances between the consonants. Phoneme /b/ /d/ /f/ /g/ /h/ /j/ /k/ /l/ /m/ /n/ /p/ /r/ /s/ /t/ /v/

Average distance 1.46 1.39 1.62 1.5 1.63 1.55 1.56 1.64 1.63 1.57 1.52 1.46 1.57 1.45 1.56

As with the consonants, the correlations between the vowels were also analyzed by calculating the average distances of each vowel to all other vowels (see Table 5). The average distances in Table 5 show that /e/ correlates most with the other phonemes and is therefore more error prone in a noisy environment. For the same reason also /i/ and /O/ should be more error prone than /A/ and /U/.

Table 5. The average Euclidean distances between the archiphonemes. Archiphoneme or neutral vowel /A/ /e/ /i/ /O/ /U/

Average distance 1.01 0.78 0.9 0.93 1.01

3. Architecture of the model We constructed the model as a discrete two-stage system shown in Fig. 2. The model contains a

Fig. 2. The lexicalization model consisting of three neural networks: the lexical-semantic network and two separate phoneme networks.

The first stage of the model (lexical-semantic network) receives an input word, which is coded into a numerical form. An output of the lexical-semantic network is decoded into a string of graphemes that are input forward one by one to either the consonant network or vowel network. A final output is obtained from them as a string of phonemes. The lexical-semantic network includes three input and output nodes, since three-dimensional input and output vectors are utilized in the simulation. Henceforth, only a number of hidden layers and especially hidden nodes could be used when attempting to optimize the size of the network topology in relation to the success of network training. The sum of squared errors was used as a cost function of the back propagation learning algorithm. Network training was defined to be successful when the sum of squared errors produced by the network decreased below 0.005, because at that point the training network did not make errors when simulating the searching of lemma (the output was close enough to the input). During every training epoch the whole corpus of 279 nouns was presented to the network. Testing of several network topologies led to an alternative, which had two hidden layers and 10 hidden nodes at both the layers. This network structure had the best average success rate in percents of network training, when 100 network candidates of a similar structure


were trained with no more than 3000 iterations per one network. The structure of the two phoneme networks followed the structure of the lexical-semantic network. Five input and output nodes are necessary to code consonants (see Table 2) and two nodes, respectively, to code vowels (Table 3). Thus, we constructed a consonant network with one hidden layer of five hidden nodes and a vowel network with one hidden layer of two nodes. The three multilayer perceptron networks of the model were computationally applied as usual.23 However, to produce a “lesion” or disorder into the model, we inserted random noise parameters (normally distributed pseudo random numbers from N(0, 1)) αL and αP to the weights of the lexicalsemantic and phoneme networks. Manipulation of these parameters enabled modeling of various degrees of severity of anomia. Moreover, to exclude outputs of the lexical-semantic network that were far away (in the sense of Euclidean distance) from any word used in the lexicon, we applied a threshold value τ between the lexical-semantic and phoneme networks. For large threshold values we determined an undefined result corresponding to an omission (no response). The effect of the threshold can be defined more formally as follows. Let o be an output of lexicalsemantic network and n the closest vector to o in the model’s lexicon. Let d = ko − nk be the Euclidean distance between the vectors o and n. The thresholded output r of the lexical-semantic network is calculated with  if d = 0  n 1 n if d 6= 0 ∧ τ ≤ c·d , (1) r=  undefined otherwise where c is a scaling factor. In the simulations we set c = 100. The effect of Eq. 1 is as described above: if the inverse of the distance between vectors o and n becomes smaller than threshold τ , the output of the lexical-semantic network is discarded and an omission occurs. In sum, the outcome of a network in the model is computed as usual in multilayer perceptrons, but with the possible addition of random noise to simulate aphasic mis-namings and threshold to yield omissions. The final output of a network was the closest vector in the target vector set to the output

vector unless the result was an omission. For the lexical-semantic network, a target vector was formed for each of the 279 nouns used in the simulations. For the phoneme networks, a target vector was formed for each phoneme used, either consonant or vowel. Therefore, phoneme networks were executed as many times as an outcome of the lexical-semantic network included phonemes, i.e., successively one by one for each phoneme. Closeness of output and target vectors was evaluated by Euclidean distances. 4. Properties of the model The capability of the model to simulate word production errors was evaluated in two ways. First, we examined the influence of simulation parameter selections on error distributions (frequencies of different error types) of the model. Second, we explored the adaptation of the model to aphasic patient data published by Laine’s et al.6 The networks in the model were trained with the backpropagation learning algorithm and the sum of squared errors was the cost function of the learning algorithm. The programs were implemented with the Matlab 6.0 software. We tested the effect of simulation parameters on error distributions at three stages. At first, we investigated the behavior of the lexical-semantic network. We then investigated the behavior of the two phoneme networks and finally, the analysis of the dynamics of the entire model was conducted. 4.1. Properties of lexical-semantic network The analysis of the lexical-semantic-network was performed by regulating random noise αL and threshold τ . For this test we created ten instances of the lexical-semantic network and each of them was tested ten times with different pairs of (αL , τ ). As a consequence, altogether 100 tests were made. Every target word was used once for each test. The values of αL were increased with 0.005 from 0 to 0.1, and those of τ were raised with 0.05 from 0 to 1. Both of the parameters thus received 21 values, and, all in all, there were 441 parameter combinations and entire runs of the lexical-semantic networks in one test iteration (441 · 100 = 44100 test iterations altogether).


these responses as omissions. A third-degree curve is drawn in Fig. 3 to represent a boundary, at which the ratio of correct responses is circa 2 %: αL = 0.463τ 3 + 0.694τ 2 − 0.369τ + 0.078.

Fig. 3. Correct responses of the lexical-semantic network in percent as a function of the model parameters αL and τ . The third-degree curve represents the boundary at which the ratio of correct responses is circa 2 %.

We present the standard deviations of the correct responses as contour lines in Fig. 4. The greatest standard deviation was found for the random noise interval αL ∈ [0.005, 0.015], since here different network instances behaved in different ways. When the noise exceeds 0.015, differences between the network instances smooth out because of the increased randomness in the weight values of the neural network. When the noise is equal to zero, an increase of the threshold τ has no influence on standard deviations, since the network results into an omission only if it is noisy. If both noise and threshold are tuned, standard deviations increase to maximum values in the area of αL ∈ [0.005, 0.01] and τ ∈ [0, 0.3]. Thereafter, for increasing parameter values standard deviations decrease down to zero (no correct responses) because of more omissions. The maximum standard deviation is approximately 26 % for αL = 0.005 and τ = 0.20.

Fig. 4. Standard deviation of the correct responses in percent as contour lines for the model parameters αL and τ .

Fig. 3 shows the effect of the simulation parameters αL and threshold τ to the accuracy of the lexicalsemantic network. For the sake of clarity, only results for the intervals αL ∈ [0, 0.05] and τ ∈ [0, 0.5] are shown. If the threshold τ is equal to zero and only random noise is tuned, it is possible to drop the ratio of correct responses well below 10 %. Almost all errors are then semantic. When both parameters are raised at the same time, the ratio of correct responses rapidly collapses down to zero. The main reason to this is the increasing noise, which makes outputs of the network more diverse from the targets, and accordingly, a larger threshold eliminates

Fig. 5. Semantic errors of the lexical-semantic network in percent as a function of the model parameters αL and τ.

Results in Fig. 5 present semantic errors in percents. When noise is equal to zero, no semantic errors are present, since the threshold tunes omissions. If the threshold is equal to zero and noise is raised, the ratio of semantic errors increases up to 98 %. The network is sensitive to noise, and already αL = 0.01 gives about 55 % of semantic errors.


When the threshold value is increased, the amount of semantic errors diminishes, whilst the amount of omissions increases. If the threshold value is above 0.25, semantic errors disappear, because the threshold then discards responses distorted by noise resulting into omissions.

the ratio of omissions rapidly grows close to 100 %.

Fig. 7. Omissions of the lexical-semantic network in percent as a function of the parameters αL and τ .

Fig. 6. Standard deviation of the semantic errors in percent as contour lines for the model parameters αL and τ.

Contour lines are again applied to show standard deviation of semantic errors in Fig. 6. There is no standard deviation when noise αL = 0, because there are no semantic errors. When the threshold value is equal to zero, standard deviations increase in the course of greater noise up to 0.020. Thereafter, standard deviations reduce because increasing noise smoothens out the differences between the network instances. If the threshold value is increased, standard deviations rapidly reduce because of more omissions. Nevertheless, when noise is less than 0.015, the change is slower, since in this situation some of the network instances tolerate better the influence of thresholding. When the threshold value is above 0.15, standard deviations disappear. The maximum standard deviation is about 20 % for αL = 0.01 and τ = 0. Next, we consider the ratio of omissions in Fig. 7. Of course, if the threshold is equal to zero, there are no omissions. Interestingly, also slight noise is necessary to generate omissions in general. Even without noise the ratio of omissions would naturally be possible to heighten up to 100 % if threshold values above one were applied. When both threshold and noise are increased,

Fig. 8. Standard deviation of the omissions in percent as contour lines for the parameters αL and τ .

Standard deviations of the omissions are addressed on the basis of contour lines in Fig. 8. Of course, when thresholding is necessary for omissions, there is no standard deviation for the threshold value of zero. If there is no noise, standard deviations are very small independent of threshold values. Standard deviations rise strongly when the threshold is just above 0.1 and noise above zero. The topmost area is for αL ∈ [0.005, 0.015] and τ ∈ [0.1, 0.4] due to noise which polarizes the effect of the threshold τ into the different instances of the model. For greater parameter values semantic errors disappear and omissions become more frequent. Ultimately,


with more extensive values there occur only omissions and, as a result, the standard deviation disappears. The maximum standard deviation is about 23 % for αL = 0.005 and τ = 0.2. On the basis of the preceding analysis, we are able to draw conclusions on parameter selections for the average behavior of the lexical-semantic network. The observation of the magnitude of standard deviations for great noise and threshold values is particularly important, because we can classify the instances of the lexical-semantic network according to their noise tolerance. Furthermore, the error distributions of the lexical-semantic network suggest that it is reasonable to the modeling of aphasic patient data. 4.2. Properties of the phoneme networks Next, we studied the properties of the phoneme networks by regulating random noise parameter αP . In the test we used six noise values, i.e, αP ∈ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. Again ten instances were created for both phoneme networks and every phoneme was tested for 100 times for each noise value. This resulted in a total of 1000 test runs for each noise value. Results are listed in Tables 6 and 7. A noisy consonant network behaved quite regularly – the three most error prone consonants were /g/, /b/ and /r/. These consonants had also smaller average distance to the other consonants than the consonants had on the average (see Table 4) which explains the error proness. The vowel network was more unstable. Particularly /e/ tended to include errors. The reason for this is again the smallest average distance between /e/ and the other vowels (see Table 5), which renders it more sensible to the noise than the other vowels. Albeit errors of single phonemes are rather improbable, the phoneme networks yield more phoneme errors than the errors of single phonemes would suggest. This comes from the fact that the networks sequentially produce phonemes of a word. The probability of a word with a certain noise value is equal to the product of the probabilities of its phonemes. This is illustrated by the word ’vy¨ o’ (belt) (αP = 0.40) with the total probability when the probabilities of the phonemes are from Tables 6 and 7: P (/vy¨ o/) =

P (/v/) · P (/y/) · P (/¨ o/)

=

0.939 · 0.961 · 0.883

≈

0.797.

Table 6. The effect of the noise αP to the correct outputs of the consonant network. The results are based on 1000 test runs for each noise value.

Phoneme /b/ /d/ /f/ /g/ /h/ /j/ /k/ /l/ /m/ /n/ /p/ /r/ /s/ /t/ /v/

Correct consonants % αP 0.00 0.20 0.40 0.60 100.0 98.8 92.2 80.7 100.0 100.0 97.7 92.1 100.0 99.3 91.9 81.2 100.0 97.7 90.4 77.5 100.0 100.0 99.7 97.6 100.0 100.0 99.7 97.5 100.0 100.0 99.4 97.8 100.0 99.5 95.3 89.3 100.0 99.9 99.0 96.2 100.0 99.8 98.5 94.9 100.0 98.9 94.3 88.4 100.0 99.7 94.0 84.8 100.0 100.0 99.4 95.0 100.0 99.7 92.3 85.6 100.0 99.4 93.9 82.8

0.80 72.3 82.2 72.8 63.0 95.4 93.8 95.1 82.1 90.2 89.4 83.7 75.4 89.7 76.7 76.0

1.00 63.5 73.6 69.6 59.9 89.5 89.1 89.4 74.2 88.6 82.1 74.4 67.7 82.7 70.8 68.7

Table 7. The effect of the noise αP to the correct outputs of the vowel network. The results are based on 1000 test runs for each noise value. Archiphoneme /A/ /e/ /i/ /O/ /U/

0.00 100.0 100.0 100.0 100.0 100.0

Correct vowels % αP 0.20 0.40 0.60 100.0 99.5 98.4 97.2 79.3 62.2 100.0 98.6 93.1 98.9 88.3 75.7 100.0 96.1 87.0

0.80 93.6 49.6 85.4 73.8 76.9

1.00 89.6 42.4 77.8 65.7 69.4

Consequently, ‘vy¨ o’ is quite-error prone with noise level αP = 0.40, since the probability of an error is about 1 − 0.797 = 0.203. In addition, ‘hampurilainen’ (hamburger) obtains with the same noise value the probability of 0.593 giving 0.407 for an error. Generally, longer words receive errors more frequently than short ones on the basis of the independent sequential processing of the phonemes. Although, in certain types of aphasia word length is correlated to the occurrence of errors in phonological output, our model does not really model this effect, since statistical independence (which our model is as-


suming in the phoneme production) is much stronger relationship than mere statistical correlation found from the patient data. 4.3. Properties of the whole model Thirdly, we examined the behavior of the whole model while tuning the noise αL of the lexicalsemantic network and the noise αP of the phoneme networks. To reduce running times, the threshold was discarded (τ = 0), but its influence is noted later on. We tested ten instances of the networks for ten times for every pair (αL , αP ), which equals 100 tests runs with the whole word set for each (αL , αP ) pair. The noise αL of the lexical-semantic network included values αL ∈ {0.000635, 0.00125, 0.0025, 0.005, 0.01, 0.03, 0.05} and αP of the phoneme networks included αP ∈ {0.0, 0.2, 0.4, 0.6, 0.8, 1.0}. Thus, there were 42 simulation parameter combinations, which were used to each test run. Table 8. Error distributions in percents of the model with the two least and greatest values of noise parameter αL (but τ = 0). The results are based on 100 runs of the whole word set for each parameter value pair (N = 27900). αP

Correct

0.0 0.2 0.4 0.6 0.8 1.0

98.9 91.4 66.5 43.8 28.5 18.9

0.0 0.2 0.4 0.6 0.8 1.0

95.8 88.6 64.8 42.6 27.9 18.1

0.0 0.2 0.4 0.6 0.8 1.0

12.4 11.7 8.7 5.6 3.9 2.3

Semantic errors Phonological errors αL = 0.000625, τ = 0 1.1 0.0 0.8 7.8 0.5 32.9 0.4 55.8 0.3 71.1 0.2 80.9 αL = 0.00125, τ = 0 4.2 0.0 3.7 7.7 2.3 32.9 1.5 55.8 1.0 71.0 0.7 81.2 ··· αL = 0.05, τ = 0 87.6 0.0 81.0 7.2 58.2 33.1 37.6 56.8 24.3 71.9 16.2 81.5

The results are partially given in Table 8, which

contains the results with the two lowest values of αL and those with the greatest value of αL . Since we did not apply the threshold value and the phoneme processing does not deal with semantic processing, the only factor affecting phoneme errors is the noise αP . Therefore, the distributions of the phoneme errors are quite independent of αL . If large noise values αL are used, most phoneme errors arise after semantic errors, which yield semantic-phonological errors, but this was not called for in the simulations against the aphasic data. The effect of the threshold can be estimated on the basis of Table 8, because it principally changes other response types to omissions. For example, if we adopt αL = 0.005 and αP = 0.20, 72.3 % of the responses of the network are correct, 20.2 % are semantic errors, and 7.5 % are phonological errors on average. If a threshold were used to produce omissions of 20 %, approximately two thirds of the former errors would become from the correct responses, one fifth from the semantic errors, and the remaining cases from the phonological errors. 5. Adaptation to patient data

5.1. Adaptation algorithm We prepared an algorithm for the adaptation of the model to patient data. To enable entirely automatic classification of the errors, we applied a slightly simplified categorization of error types to three following classes. An outcome was an omission if the lexicalsemantic network gave no response to an input. It was a semantic error if the lexical-semantic network generated a word that was dissimilar to the original input. An output was a phonological error if the phoneme networks output a different result from the output of the lexical-semantic network. If the model first gave a semantic error and then phonological error followed, it was simply classified as phonological error. Let the semantic noise of the lexical-semantic network be αL (n), its threshold τ (n), and αP (n) the phonological noise of the phoneme networks for the adaptation iteration n. Moreover, let ds , ys (n), dp , yp (n), do and yo (n) be the numbers of semantic errors, phonological errors and omissions of the patient d and simulation y. Error values es , ep , and eo be-


tween the distributions of the patient and simulation data are then defined as their differences. Finally, a constant t is used to calculate the learning rate (coefficient) η in the adaptation algorithm (Algorithm 1). Algorithm 1 Adapt the model to the patient data Input: Initial learning rate η0 , time parameter t > 0 to calculate the learning rate η, initial model parameters αL (0), τ (0) and αP (0), and target quantity value χ2target . Output: A naming distribution computed with the model. 1: begin 2: n ← 0, χ2 ← ∞. 3: while χ2 > χ2target do 4: Evaluate the model with the parameter values αL (n), τ (n) and αP (n). 5: {Compute the errors:} 6: es ← ds − ys (n). 7: ep ← dp − yp (n). 8: eo ← do − yo (n). 9: Let χ2 to be the test value between the distributions of the patient data (ds , dp , and do ) and simulated data (ys (n), yp (n), and yo (n)). 10: if χ2 > χ2target then η0 11: η ← 1+n/t . 12: αL (n + 1) ← αL (n) + η · es . 13: τ (n + 1) ← τ (n) + η · eo . 14: αP (n + 1) ← αP (n) + η · ep . 15: fi 16: n ← n + 1. 17: od 18: end

The adaptation algorithm first establishes an error distribution for given starting values of the model parameters. Thereafter, it computes differences between the patient data and outputs of the model, and updates the model parameters along with the differences. The algorithm is run until the distributions of the patient data and model outputs are sufficiently similar. The adaptation algorithm can be seen as a supervised machine learning task. The similarity of the patient data and simulated data is measured with the statistical χ2 test, which is the cost function of the training algorithm. This is calculated with Eq.

(2) χ2 =

k 2 X (Oi − Ei ) i=1

Ei

,

(2)

in which Oi (observed) equals the frequencies of the category i, Ei (expected) is the expected frequencies of category i and k is equal to the number of the categories. 5.2. Patient data simulations To test the model’s actual capabilities to simulate naming error patterns of aphasic patients we evaluated the model with the data of three anomic aphasics, two Broca’s aphasics, two conduction aphasics, and three Wernicke’s aphasics reported by Laine et al.6 Because there are slight variations between the weight values of the network and other randomness present in the model, we trained eleven instances of the model for each patient with different random starting weights. Every model instance was once adapted to the data of each patient. During one adaptation iteration the model generated an output for each of its input words for five times for the simulated patient. Thus 1395 words (5·279) were processed, which was enough to smooth the results of the noisy model. The differences between patient data and the simulated data were tested with the χ2 test. The results presented here correspond to the results of the instances which had medians of the χ2 test values. We set χ2target = 3.0, which aids to eliminate too exceptional model distributions. In fact, this causes that there will be no statistically significant differences between the patient data and the model outputs. A smaller value of χ2target would markedly increase an elapsed running time of the adaptation algorithm. For the adaptation algorithm we set η0 = 0.5 and t = 5. For the two Broca’s aphasics Table 9 shows that omissions were their main problem. Also for the three anomic aphasics omissions were the primary error category (see Table 9). Phonological errors were the most frequent type for the conduction aphasics. With regard to the Wernicke’s aphasics, prominent error types were phonological errors and omissions. Patient W3 had the highest rates of omissions while W2 produced also quite a few semantic errors. As mentioned above, the adaptation algorithm applied the χ2 test to guarantee no statistically significant


difference between the patient data and the simulations. 0.5

Table 9. Actual picture naming error patterns of individual aphasic patients (P) and corresponding averaged simulation results (M) in percent. B1 and B2 are cases with Broca’s aphasia, A1, A2 and A3 with anomic aphasia, C1 and C2 with conduction aphasia, and W1, W2 and W3 with Wernicke’s aphasia. The number of words used in the simulations was 1395. Patient B1 B2 A1 A2 A3 C1 C2 W1 W2 W3

Correct % P M 77.1 77.5 87.3 87.7 51.7 53.3 77.1 76.5 39.2 39.2 65.7 66.4 84.3 85.4 54.8 53.8 42.9 44.2 36.7 36.7

Semantic % P M 3.0 2.5 3.0 2.7 4.2 4.4 8.4 8.5 0.6 0.4 3.6 3.2 0.6 0.6 4.8 5.1 13.9 13.9 0.6 0.6

Omission % P M 19.3 19.4 8.4 8.7 44.1 42.4 13.9 14.1 59.6 59.8 7.2 7.6 4.8 4.0 20.5 20.6 21.1 21.0 37.3 39.0

Phonological % P M 0.6 0.6 1.2 0.9 0.0 0.0 0.6 0.9 0.6 0.6 23.5 22.8 10.2 9.9 19.9 20.6 22.2 20.9 25.3 23.7

To compare the patient case simulations, we calculated medians of the simulation parameters over all instances of the model for each patient (see Table 10).

Table 10. The medians of the simulation parameters αL , τ , and αP . Patient B1 B2 A1 A2 A3 C1 C2 W1 W2 W3

αL 0.0036 0.0032 0.0051 0.0051 0.0030 0.0039 0.0020 0.0047 0.0078 0.0026

τ 0.2247 0.1910 0.2434 0.1525 0.4247 0.1584 0.2498 0.1814 0.1211 0.3351

αP 0.1415 0.1361 0.0000 0.1469 0.1666 0.3381 0.2844 0.3442 0.3526 0.4206

In Fig. 9 the ten simulated patient cases and an artificially constructed ‘median case’ patient M are plotted regarding to the medians of their simulation parameters of Table 10. The parameters of the ‘median case’ patient correspond to the medians of the simulation parameters. Patients B1, B2, A2 and C2 in this order are the closest to the ‘median case’.

0.4 0.3

α

P

0.2 0.1

M

0 0.5 0.4

0.01 0.008

0.3

0.006 0.004

0.2

τ

0.1

0.002 0

α

L

Fig. 9. The patient cases explored with the model parameters αL , αP and τ . The Broca’s aphasics are marked with crosses, the anomic aphasics with diamonds, the conduction aphasics with circles, and the Wernicke’s aphasics with squares. M is the artificially constructed ‘median case’.

Finally, we calculated medians and ranges over all instances and patients. For the semantic noise αL the median was 0.0037 and the range [0.0003, 0.0171], for the threshold value τ the median was 0.20 and the range [0.12, 0.49], and for the phonological noise αP the median was 0.19 and the range [0, 0.55]. Note that the ranges of the phonological noise and threshold were relatively large, but the range of semantic noise was small. This shows that the lexicalsemantic network is more sensitive to the noise than the phoneme networks. 6. Concluding remarks The present paper describes a perceptron based neural network model for simulating the naming errors of Finnish-speaking aphasic patients. The critical difference between the developed model and the previous models of lexicalization is its capability to learn the semantic and phonological representations of the words in the given input data corpus. Therefore the model could be utilized to simulate e.g. word learning in children and re-acquisition of naming ability after brain damage. These features cannot be modelled with non-learning models of lexicalization. The performed analysis showed the model provided equally good fit to aphasic data as our earlier non-learning model did.6 In the future, we shall extend our research on more patient cases and add a


more detailed error categorization. We need more patient data to specify patient subclasses according to the model parameter values like in Fig. 9. The specified error classification would give further information on the model’s actual abilities to simulate the presented patient data. It would also be interesting to investigate if the model’s parameter values could be used to identify the different aphasia types of the patients. Naturally enough, the learnability feature of the model calls for word learning and rehabilitation studies. In the meantime, one should not forget other types of neural network architectures that are valuable for the modeling of lexicalization. Kohonen’s networks (self-organizing maps) might be suitable especially for the modeling of the receptive language skills, because they are known to be effective for various classification and pattern recognition tasks. In addition, as stated earlier, they have successfully been applied to modeling of dyslexia patients.15 In theory, it would also be possible to extend the model to other words than nouns and also to other forms than the basic form. For example, in Finnish there are about 15 cases plus additional suffixes, yielding about 2000 possible forms for each noun. With regard to clinical work, an interesting possibility could be to ’turn everything upside down’, i.e. to develop a pattern recognition algorithm, which recognizes and classifies utterances of patients into various error types and would thus be an intelligent software tool for neuropsychologists and speech therapists. Acknowledgments The first author wishes to thank the Academy of Finland for financial support (project # 78676). References 1. D. Chilant, A. Costa and A. Caramazza, “Models of naming,” in: A.E. Hillis, ed., The handbook of adult language disorders, 123–142, Psychology Press, New York, (2002) 2. M. Goldrick and B. Rapp, “A restricted interaction account (RIA) of spoken word production: the best of both worlds,” Aphasiology 16:2 0–55 (2002). 3. T. Harley, “The Psychology of Language, From Data to Theory”, Psychology Press, New York (2001). 4. B. Rapp and M. Goldrick, “Discreteness and interactivity in spoken word production,” Psychol. Rev.,

107, 460–499 (2000). 5. G. S. Dell, M. F. Schwartz, N. Martin, E. M. Saffran and D. A. Gagnon, “Lexical access in aphasic and nonaphasic speakers,” Psychol. Rev., 104, 801– 838 (1997). 6. M. Laine, A. Tikkala and M. Juhola, “Modeling anomia by the discrete two-stage word production model,” J. Neurolinguist, 11, 275–294 (1998). 7. W. J. M. Levelt, A. Reolofs and A. S. Meyer, “A theory of lexical access in speech production,” Behav. Brain Sci., 22, 1–38 (1999). 8. D. Foygel and G. S. Dell, “Models of impaired lexical access in speech production,” J. Mem. Lang. 43, 182–216 (2000). 9. W. Ruml and A. Garamazza, “An evaluation of a computational model of lexical access: comments on Dell et al. (1997),” Psychol. Rev., 107, 609–634 (2000). 10. G. S. Dell, “A spreading-activation theory of retrieval in sentence production,” Psychol. Rev., 93, 283–321 (1986). 11. M. Juhola, A. Vauhkonen and M. Laine, “Simulation of aphasic naming errors in Finnish language with neural networks,” Neural Networks, 8, 1–9 (1995) . 12. A. Tikkala and M. Juhola, “A neural network simulation method of aphasic errors: properties and behaviour,” Neural. Comput. Appl. 3, 192–201 (1995). 13. A. Tikkala and M. Juhola, “A neural network simulation of aphasic naming errors: network dynamics and control,” Neurocomput., 13, 11–29 (1996). 14. A. Vauhkonen and M. Juhola, “Convergence of a spreading activation neural network with application of simulating aphasic naming errors in Finnish language,” Neurocomp. 30, 323–332 (2000). 15. R. Miikkulainen, “Dyslexic and category-specific aphasic impairments in a self-organizing feature map model of the lexicon,” Brain Lang., 59, 334–366 (1997). 16. A. J¨ arvelin, M. Juhola and M. Laine, “Neural network modelling of word production in Finnish: coding semantic and non-semantic features,” Neural. Comput. Appl., in press. 17. J. G. Snodgrass and M. Vanderwart, “A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity and visual complexity,” J. Exp. Psychol. Human Learning Memory, 6, 174–215 (1980). 18. K. Swingler, “Applying Neural Networks, A Practical Guide”, Academic Press, San Diego, CA 1996. 19. G.E. Hinton and T. Shallice, “Lesioning an attractor network: investigations of acquired dyslexia,” Psychol. Rev., 98, 74–95 (1991). 20. D. Hand, H. Mannila and P. Smyth, “Principles of Data Mining,” MIT Press, Cambridge, MA, (2001). 21. F. Karlsson, “Suomen kielen ¨ a¨ anne- ja muotorakenne (Finnish phonology and morphology)”, WSOY, Juva, Finland, (1983). 22. A. E. Thyme, “A connectionist approach to nomi-


nal inflection: paradigm patterning and analogy in Finnish,” Ph. D. Thesis, University of California, San Diego, (1993).

23. S. Haykin, Neural Networks, “A Comprehensive Foundation”, Prentice-Hall, London, (1999).

a neural network model for the simulation of word ...

a neural network model for the simulation of word ...

Suggest Documents

3D-CFD SIMULATION AND NEURAL NETWORK MODEL FOR THE ...

Development of a neural network model for

Artificial neural network model for simulation of water ... - SciELO

Simulation of biomass gasification with a hybrid neural network model

An artificial neural network model for flood simulation ...

A Neural Network Based Model for Detecting

A Neural Network Model for Transference and

A Deep Convolutional Neural Network Model for

A neural network model - CiteSeerX

The Use of a Bayesian Neural Network Model for ... - CiteSeerX

A Neural Network Model for the Prediction of Musical ...

A neural network model for bankruptcy prediction - Neural Networks ...

Vectorised algorithms for spiking neural network simulation

Artificial Neural Network Model for Evaluation of

A Neural Network Model of Visual Object

A Neural Network Model of Ventriloquism Effect

A neural-network reinforcement-learning model of

A Deep Convolutional Neural Network for Word Spotting in ... - arXiv

A Neural Network Simulation of the Outgroup ... - Semantic Scholar

A Neural Network Approach to the Validation of Simulation Models

Application of a Neural Network Model for Prediction of Wear

A Neural Network Model for Explaining the Asymmetries between

A Cyclostationary Neural Network Model for the ... - Google Sites

A genetic algorithm-based artificial neural network model for the ...