Diagnosis of Aphasia Using Neural and Fuzzy

Diagnosis of Aphasia Using Neural and Fuzzy Techniques Jan Jantzen, Hubertus Axer* and Diedrich Graf von Keyserlingk* Technical University of Denmark Dept. of Automation, bldg 326, DK-2800 Kongens Lyngby, Denmark Phone: +45-4525-3561, Fax: +45-4588-1295 email: [email protected] *RWTH Aachen Dept. of Anatomy I, Pauwelsstr. 30, D-52057 Aachen, Germany Phone: +49-241-8089100, Fax: +49-241-8888431 email: {hubertus, keyser}@cajal.medizin.rwth-aachen.de ABSTRACT: The language disability Aphasia has several sub-diagnoses such as Amnestic, Broca, Global, and Wernicke. Data concerning 265 patients is available in the form of test scores and diagnoses, made by physicians according to the Aachen Aphasia Test. A neural network model has been built, which is available for consultation on the World Wide Web. The neural network model is in this paper compared with a fuzzy model. Rather than concluding which method provides the best approximation, the paper acts as an example solution useful for other benchmark studies. KEYWORDS: Amnestic, Anomic, Broca, Wernicke, AAT, perceptron, MLP, FCM

INTRODUCTION Aphasia is the disability to use or comprehend words usually resulting from brain damage. For example, if a patient uses wrong words and empty phrases, it can be a sign of a moderate to severe Aphasia. Other signs can be difficulties with communication (utterances) or articulation (melody, rhythm). Damages in brain regions related to language cause aphasia (Figure 1). A damage can be due to a stroke, a head injury, or a cerebral tumour. There are several types of Aphasia, and a skilled clinician can diagnose what type of Aphasia it is.

Figure 1: Magnetic resonance (MR) image of a human head, taken frontally through the ears, nose pointing out of the paper plane. The markers (upper right) indicate the location of a damage (Broca). This paper is a result of the ERUDIT network of excellence, a European network for fuzzy and intelligent technologies (www.erudit.de). The so-called Human, Medical and Healthcare committee supplied the data and domain knowledge while the Training and Education committee built a model. The dual objective is i) to teach medical students about soft computing, and ii) to teach engineering students about a medical application of soft computing. A companion paper describes the aphasia problem from the medical point of view (Axer, Jantzen, Berks, Südfeld, and Keyserlingk, 2000). Medical researchers have developed different models of aphasia for more than 100 years, the most outstanding by Wernicke in 1874. Nowadays, vast neuronal networks model the brain and some of these networks are responsible for language functions. Although different damages cause different types of aphasia, single functions, such as the

expression and comprehension of speech, cannot possibly be assigned to specific locations. Moreover, not only the deficiency of the damaged brain region can affect the aphasic symptoms, but also compensatory reactions of the intact brain tissue. Thus, the cerebral location of aphasic syndromes is rather indistinct, contrary to the classical anatomical concept (Willmes and Poeck, 1993), although some coarse patterns of typical brain lesions can be distinguished. A physician makes the clinical diagnosis of the aphasia type in a free interview. The aphasic syndromes can be described verbally, for example: Broca’s aphasia. Non-fluent speech with laboured, slow and impaired articulation, telegram style Wernicke’s aphasia. Fluent speech, but almost meaningless sentences and words. Global aphasia. Severe disturbance, often communication is impossible. Anomic (= amnestic) aphasia. Difficulties with finding words. Conduction aphasia. The patient has difficulties repeating words spoken to him. To compare the severity of aphasic syndromes of different patients and to evaluate the progression over time, it is necessary to quantify the defects and standardise the examinations. Major language tests in English speaking countries are the Western Aphasia Battery (WAB) and the Boston Diagnostic Aphasia Examination (BDAE). In German speaking countries, the Aachen Aphasia Test (AAT) is the commonly used test battery. The latter test AAT (Huber, Poeck, and Weniger, 1983; 1984), consisting of six sub-tests, is the origin in our case (Table I). The first sub-test is an assessment of spontaneous speech. The second sub-test, the token test, is a general test of comprehension of language. The patient has to choose the right token out of a set of tokens different in shape, colour or size. The token test has five sub-tests of increasing levels of difficulty. The third sub-test is a test of repetition; the patient has to repeat different sounds, words or sentences. The fourth sub-test is a written language test, an evaluation of reading and writing functions. The fifth sub-test, the confrontation naming test, is an assessment of the ability to describe things, situations or actions with adequate words. The last test, the comprehension test, assesses the patient’s understanding of words or sentences. Each test results in a numerical score within the ranges given in Table I. Code P0 P1 P2 P3 P4 P5 T0 T1 - T5 N0 N1 N2 N3 N4 N5 C0 C1 C2 C3 B0 B1 B2 B3 B4 V0 V1 V2

Test Spontaneous speech Communicative behaviour Articulation and prosody (melody of speech) Automatised language (e.g. stereotypes, automatisms) Semantic structure (e.g. verbal paraphasias, word retrieval difficulties) Phonologic structure (e.g. literal paraphasias, neologisms) Syntactic structure (structure of sentences, grammar) Token test Token sub-tests Repetition Single phonemes Monosyllabic nouns Loan and foreign words Compound words Sentences Written language Reading aloud Selecting/combining on dictation Writing on dictation Confrontation naming Nouns Colour terms Compound nouns Sentences Comprehension Auditory for words and sentences Reading for words and sentences Table I:. All AAT sub-tests, 30 scores in total.

Range of score 0 – 5 [points]

0 – 100 [%] 0 – 10 [points] 0 – 100 [%] 0 – 30 [points]

0 – 100 [%] 0 – 30 [points]

0 – 100 [%] 0 – 30 [points]

0 – 100 [%] 0 – 60 [points]

Figure 2: The aphasia database is available on the Web. Columns outside of the window area include, for each patient, the test scores of Table I. A database of AAT profiles, collected by the Department of Anatomy since 1986, contains 265 aphasic patients (Figure 2). The database contains an expert’s diagnosis of aphasia type, disease, and the AAT scores. We have omitted personal information, such as name, age, and sex in order to protect the patients’ privacy. The anatomical lesions of 146 of those patients were analysed with computed tomography and standardised (Keyserlingk, Naujokat, Niemann, Huber and Thron, 1997; Niemann, Keyserlingk and Wasel, 1988). This way the lesion data, independently of individual brain size and technical settings of the image equipment, could be projected onto 16 equidistant magnetic resonance imaging slices of the brain of a healthy volunteer. The aphasia database resides on the so-called TEDServer, the training and education server for fuzzy and intelligent technologies (http://fuzzy.iau.dtu.dk/aphasia.nsf). Ideally, we are aiming at a model, which can learn from the aphasia database how to make the right diagnosis (Figure 3). The inputs are the numerical scores from the AAT, and the output is a diagnosis (e.g., Broca or Wernicke). AAT scores

Aphasia type Model

Figure 3: An automated classifier should decide on the correct aphasic syndrome to an acceptable degree of accuracy. As mentioned earlier, the location of aphasic syndromes is rather indistinct. On the other hand, the aphasia database contains related symptoms, diagnoses, and regions. The problem is therefore to model approximately the relationship between symptoms and diagnoses of the database. Since there are 30 test scores for each patient, an additional problem is to select a smaller, yet adequate set of AAT scores for training the model. The model output should accommodate a blend of aphasic syndromes for a given input, since the localisation of aphasic syndromes is indistinct and a lesion can cover several regions of the brain. The work plan has therefore been: 1. 2. 3.

To investigate the data by inspection. To try a fuzzy approach. To compare with a neural network approach.

The first step is in order to acquire insight and domain knowledge. The second step is to see if fuzzy logic can accommodate the indistinct nature of the problem. The third step is in order to achieve a reference model for comparisons.

DATA ANALYSIS BY INSPECTION In principle, we wish to see and compare cases in the database, which are similar. Assuming that patients with similar syndromes receive similar AAT scores, our first step towards revealing clusters in the data is simply to plot them. In order to compare scores, the data are scaled using the limits of the ranges in Table I, such that all scaled numbers are in the interval [0, 1]. It would now seem logical that each diagnosis has a characteristic profile with respect to the test scores in the sense that, say, a Broca aphasia shows large scores in tests x, y, z, while Global shows large scores in tests p, q, r. If the data are consistent, i.e., we trust the expert’s diagnosis, we should be able to find characteristic peaks and vallies in the profiles by taking the mean value of the scores within a diagnosis group. If the patients with Broca consistently score high values for x, y, z, then the mean values would also be high. Figure 4 shows a characteristic plot for each of the diagnoses: Conduction (L), anomic (A), Wernicke (W), Broca (B), and global (G). We decided to disregard patients with the diagnoses 0, N, R and T in the database for the following reasons. • • • •

0 means that there was no aphasia at all; not interesting. N means Not classified. The expert was not able to characterise the aphasia type as one of the classical ones (G, B, W, A). Thus, the group is heterogeneous. R means Restaphasie, which means that the patient had aphasia in the past, but the symptoms have improved to nearly normal language skills. This group also is heterogeneous. T means Transcortical aphasia. This diagnosis does not exist in English speaking countries. It is rare and the theoretical background of this type is inconsistent.

The other five aphasia types are interesting. Some further observations: • • • • • • •

W, L, and A are fluent aphasias while B and G are non-fluent aphasias. This difference occurs only in P0 and P1 (communicative behaviour and articulation). G is generally bad in all sub-tests. W and G: bad comprehension (shows up in V0-V2 and T0-T5). B: relatively good comprehension L and A: good comprehension L: repetition relatively bad (N0-N5). A: word retrieval problems (P3 relatively bad).

Figure 4: Profiles of diagnoses. Horizontal axes carry the 30 AAT sub-tests in sequence. Vertical axes represent the mean values of the scores within each class A, B, G, W, and L.

Notice that in practice, the physician performs the first AAT sub-test, spontaneous speech in a free interview, while the remaining sub-tests are comprehensive and may take several hours to complete. It is therefore of practical interest if there are cases where the first sub-test is sufficient to make a diagnosis, in order to save time and effort for the physician and the patient. Based on the plots in Figure 4 we have characterised verbally the relationship between scores P0, P1, …, P5 and the classical aphasias (A, B, G, and W) see Table II. The table shows that A and W can be difficult to distinguish from each other, likewise B and G. On the other hand, if P0 and P5 are high the diagnosis is A or W. Similarly, if P0 and P1 are Low the diagnosis is B or G. In order to simplify, the conduction aphasia (L) is left out; we concentrate on just the classical aphasias. Diagnosis A B G W

P0 High Low Low ?

P1 ? Low Low High

P2 ? ? ? ?

P3 Low ? Low Low

P4 ? ? Low Low

P5 High Low Low ?

Table II:. Verbal characterisation of the classical diagnoses (A, B, G, W) with respect to the first AAT sub-test, spontaneous speech (P0 – 5). Positions containing a question mark (?) can take any value (= “don’t care”). An intermediate conclusion, based on the observations above, is to design two models: 1) a simple model based on the spontaneous speech sub-test only (P0 – P5), and 2) a more comprehensive model that may use test scores from any subtest in the hopes that it will improve the accuracy.

NEURAL NETWORK MODELS In both models, the engineering problem is to approximate a neural network to the input-output relationship given by the aphasia database. We used multi-layer perceptrons and a commercial software package to generate the networks. Before building the comprehensive model, we had to select a small number of significant features with the aid of a software package. The spontaneous speech model achieved an accuracy of 86 percent, while the more comprehensive model achieved an accuracy of 92 percent. SPONTANEOUS SPEECH MODEL The objective is to obtain a decent classifier based on the spontaneous speech test only, because the test is easy to perform. It follows that such a classifier is easy to consult, because it requires only a few inputs that are relatively easy to obtain. Inputs to the classifier are the six test scores P0, P1, …, P5. The outputs are the diagnoses A, B, G, W, i.e., four classes. The classifier should approximate the data with the highest accuracy, or in other words, a certain degree of error is to be expected. In order to estimate the accuracy of the classifier, the remaining 146 patient records were split roughly in two halves, one for training and one for testing. They were split such that each set had roughly the same number of diagnoses from each class A, B, G, and W. The measure of accuracy is the percentage of correct diagnoses when applying the classifier to the (unseen) test data set. A single-layer perceptron network (see for instance the textbook by Haykin, 1994), consisting of an input layer and a layer of output neurons, is a simple choice of network topology. It can only approximate classes, separable by straight lines, or hyperplanes in the multi-dimensional case. We therefore chose a multi-layer perceptron, which theoretically is able to approximate any functional relationship with arbitrary accuracy. The design choices we made are as follows. Network topology. A multi-layer perceptron with input layer, a hidden layer, and an output layer. Inputs: Spontaneous speech test scores P0, P1, …, P5. Outputs: The four classes A, B, G, and W. Neurons. We used 6-5-4 neurons read layer-wise from input layer to output layer, with linear-sigmoid-sigmoid activation functions. Learning method. Back-propagation with momentum. Software. DataEngine (MIT, 1997).

Figure 5: Learning curve for spontaneous speech model. The error (red line) stabilises quite early. The RMS error on the test data set was 14 percent or 86 percent accuracy.

Figure 6: Learning curve for comprehensive model. The error (red line) stabilises faster than the first model. The RMS error on the test data set was 7.6 percent 92.4 percent accuracy. We tested other topologies and neuron activation functions on a trial and error basis, but the choices above seemed to achieve the highest accuracy on the test data set. With those design choices, the learning progressed as shown in Figure 5. In summary, the accuracy is 86 percent correct diagnoses. The reliability depends on our particular split between training data and test data; another set of training data and/or test data may result in a different accuracy. In addition, it might be possible to achieve a better accuracy with other network topologies. We purposely constrained ourselves to spontaneous speech test scores, and the question is now whether other test scores will improve the accuracy. COMPREHENSIVE MODEL The objective is to build a classifier with a better accuracy than the previous classifier. The choice of network topology remains the same for simplicity, but we will remove the constraint on the inputs and use any test scores necessary. Too

many inputs, however, will deteriorate the accuracy due to noise in the data. A major problem, then, is to select a small, but good set of inputs. In principle, we could build a classifier for each possible combination of the 30 inputs, and then select the best. Since that is a complex task, we used the software package FeatureSelector; it tries out many input combinations and compares error rates. The search is not exhaustive, however, and there is no guarantee of finding the optimal combination of inputs. Apparently there exists a good set of four test scores: P1 (melody of speech), P5 (grammar), N0 (repetition), and C1 (reading aloud). Other combinations are also good according to the feature selector — for example P5, N4, B0, and V1 — and they provide a similar accuracy. This time the design choices were as follows. Network topology. A multi-layer perceptron with an input layer, a hidden layer, and an output layer. Inputs: Test scores P1, P5, N0, and C1. Outputs: The four classes A, B, G, and W. Neurons. We used 4-5-4 neurons read layer-wise from input layer to output layer, with linear-sigmoid-sigmoid activation functions. Learning method. Back-propagation with momentum. Software. The DataEngine package with the FeatureSelector add-in. With those design choices, the learning progressed as shown in Figure 6. In summary, the accuracy is 92 percent correct diagnoses. The result was based on the same sets of training data and test data, that we used in the spontaneous speech classifier. The accuracy decreases if we include the conduction aphasia (L) as a fifth diagnosis class.

FUZZY MODEL Fuzzy logic is a way to accommodate partial membership of several classes at a time. This seems appropriate, because of the indistinct origin of the problem; a combination of diagnoses each associated with a degree of belief is natural. The fuzzy c-means algorithm (FCM) provides a classification with degrees of membership (see for instance Bezdek in Jang et al., 1997). The algorithm is usually applied in order to find clusters in data, that is, given a number of clusters to look for, it will find the cluster centres and a membership of every object to every class. In our case the clusters are already given. The training data records contain the correct diagnosis, so the mean values of the scores within a class form the cluster centre. A very simple classifier would then assign any new object, from the test data set for instance, to the nearest cluster centre. This procedure remains to be attempted. Assuming the objects are scattered and intermixed, such that the cluster centres are unclear, another approach is to assign a class membership similar to the nearest neighbours, a fuzzy nearest neighbour algorithm we could call it (FNN). The idea is to assign a class membership depending on the neighbours, such that the classifier will be independent of the cluster shape. If there are many neighbours of class A, the test point will be assigned to A, but if there are also class B in the neighbourhood, the class assignment will be shared, each with a degree less than one. Training points far away from the test point influence the class assignment very little. The weighting function, which depends on distance, can be the same function used to assign membership in the FCM algorithm,

mik =

1  dik    ∑   j =1  d jk  c

2 / ( q −1)

.....(1)

Here mik is a membership in the interval [0, 1] of object i to cluster k, and dik is the Euclidian distance between the ith cluster centre and the kth data point, d ik = u k − c i .....( 2 )

The factor q, often referred to as the fuzziness factor, is a weighting exponent in the interval [1, LWGHILQHVPRUHRU less the extent of the neighbourhood. Thus, the FNN algorithm is an FCM algorithm with each training point as a cluster centre:

Step 1. Assign c cluster centres ci (i = 1, 2, …, c), where c is the number of objects in the training set, and each object is one centre ci. Step 2. Compute the memberships mik using (1). The design choices were as follows: Inputs: Test scores P1, P5, N0, and C1. Outputs: The four classes A, B, G, and W. Fuzziness factor: q = 1.1 Software. Matlab. With the same training and test data as previously, and the four test scores from the comprehensive test, the result is shown in Figure 7. The accuracy is 91 percent, almost the same as the neural network classifier. The fuzziness factor was q = 1.1, which is equivalent to a narrow neighbourhood; apparently neighbours disturb the classifier. That also explains why the classifier assigns one diagnosis only to the majority of the cases, rather than the expected blend of diagnoses. The FNN algorithm has the advantage of only one design parameter q. In comparison, the network model has many more design choices (network topology, number of neurons, type of activation functions, training method).

Figure 7: Fuzzy nearest neighbour classification. The bottom plot is the expert’s diagnosis, and the top plot is the result of running the classifier on the test data set. The four classes A, B, G, and W are represented by the shades of grey, respectively, from left to right in the bottom plot.

CONCLUSIONS The dual objectives were i) to teach medical students about soft computing, and ii) to teach engineering students about a medical application of soft computing. Making the models available on the World Wide Web meets the first objective. The neural network classifiers can be consulted there, the site contains a tutorial, and together with this paper the neural network side is covered. The fuzzy nearest neighbour model remains to be placed on the Web. The second objective is met by the same material, as well as a companion paper on the medical aspects of the application. The next step is to apply other methods, including machine learning and evolutionary computing, to the same database in order to make a collection of test cases for benchmark studies.

REFERENCES Axer, H., J. Jantzen, G. Berks, D. Südfeld and D. Graf von Keyserlingk (2000), The Aphasia Database On The Web: Description Of A Model For Problems Of Classification In Medicine. Proc. ESIT 2000 (accepted). Haykin, S. (1994). Neural Networks: A Comprehensive Foundation, Macmillan College Publishing Company, Inc., 866 Third Ave, New York, NY 10022. Huber, W., Poeck, K., Weniger, D. (1983), Aachener Aphasie Test (AAT). Hogrefe, Göttingen. Huber, W., Poeck, K., Weniger, D. (1984), The Aachen Aphasia Test. In: Rose, F.C., Advances in Neurology. Vol. 42: Progress in Aphasiology. Raven, New York. Jang, J.-S. R., Sun, C.-T. and Mizutani, E. (1997), Neuro-Fuzzy And Soft Computing, isbn 0-13-261066-3, Prentice Hall, Upper Saddle River, NJ, USA. Keyserlingk, A.G.v., Naujokat, C., Niemann, K., Huber, W., Thron, A., 1997, Global aphasia - with and without hemiparesis. A linguistic and CT scan study. European Neurology 38, pp. 259-267. MIT (1997). Data Engine: Part II, Tutorials, MIT GmbH, Promenade 9, D-52076 Aachen, Germany. Niemann, K., Keyserlingk, D.G.v., Wasel, J. (1988), Superimposition of an averaged three-dimensional pattern of brain structures on CT scans. Acta Neurochirurgica 93, pp. 61-67. Willmes, K., Poeck, K. (1993), To what extent can aphasic syndromes be localized ? Brain 116, pp.1527-1540.