Vertigo or dizziness and other balance disorders are a common nuisance and can be symptoms of a serious disease. These problems are researched in ...
eHealth Beyond the Horizon – Get IT There S.K. Andersen et al. (Eds.) IOS Press, 2008 © 2008 Organizing Committee of MIE 2008. All rights reserved.
211
On Machine Learning Classification of Otoneurological Data Martti JUHOLA Department of Computer Sciences, 33014 University of Tampere, Finland
Abstract. A dataset including cases of six otoneurological diseases was analysed using machine learning methods to investigate the classification problem of these diseases and to compare the effectiveness of different methods for this data. Linear discriminant analysis was the best method and next multilayer perceptron neural networks provided that the data was input into a network in the form of principal components. Nearest neighbour searching, k-means clustering and Kohonen neural networks achieved almost as good results as the former, but decision trees slightly worse. Thus, these methods fared well, but Naïve Bayes rule could not be used since some data matrices were singular. Otoneurological cases subject to the six diseases given can be reliably distinguished. Keywords: Machine learning, Classification
Introduction Vertigo or dizziness and other balance disorders are a common nuisance and can be symptoms of a serious disease. These problems are researched in otoneurology [1] to find their causes, to devise treatments and to pre-empt accidents originated from such harms. For this purpose, computational classification methods can be used to identify a patient’s disease and separate disease cases from each other. There are several classification methods using various approaches to group objects to different classes. Traditionally, discriminant analysis, cluster analysis, nearest neighbour searching, Naïve Bayesian rule and decision trees have been applied [2]. Afterwards, new methods have been developed such as neural networks [3]. It is infrequently, however, studied whether these methods have different potentials to classify various datasets. The aim of the current study was to define how different machine learning methods are able to classify otoneurological data. The general aim of this classification is to support otoneurological diagnostics. These diseases are often difficult for even experienced specialists to diagnose and distinguish from each other. Earlier we investigated the classification of otoneurological data on the basis of multilayer perceptron neural networks [4,5]. We then encountered problems stemmed from a biased class distribution and a relatively small number of cases in the dataset. Small disease classes with a small number of cases were more or less lost in classification. In other words, large classes dominated the shares of the small classes in data distribution. These difficulties are fairly common in medical datasets: there are frequent and infrequent diseases. When the dataset included six diseases and their share
212
M. Juhola / On Machine Learning Classification of Otoneurological Data
was not 1/6, but clearly greater or less, this biased class distribution made it difficult for perceptron neural networks to learn small classes. In principle, hidden nodes and thus a number of connections (weights) could be increased in a neural network to learn accurately enough the features of a dataset, but a relatively scarce number of the learning cases restricted this. To overcome these structural problems of the dataset, we later constructed a set of multilayer perceptron neural networks [5], a separate network for each disease class, when each network was trained employing both data of a disease and artificial data of its counterpart in variable space, but the process was complicated. In the present study a comparison of machine learning methods was performed and, secondly, the data was preprocessed with principal component analysis (mainly for perceptron networks) to address problems caused by the structural features of the data. 1. Data The dataset of 815 cases consisted of 38 attributes as replies to queries regarding patients’ symptoms, medical history, results of laboratory measurements and clinical findings. These attributes were found to be the most important of a larger attribute set in our earlier research [6] and infrequently contained missing values. There were 11 % missing values in the whole dataset. They were imputed using modes for 11 binary attributes and one nominal attribute and using medians for 10 ordinal and 16 quantitative attributes. The only, four-valued nominal attribute was still substituted by three binary attributes to enable the use of Euclidean measure in its strictest and most reliable way, as this measure cannot be applied to nominal attributes, except binary ones. Imputation was carried out class by class being important to follow a class-wise distribution since it is, of course, essential that there are differences between classes, which is actually the footing of any classification. Using modes and medians in imputation is straightforward, but sufficient in this case, because, all in all, the number of the missing values was pretty small and we earlier observed [7] that the more sophisticated techniques of linear regression and Expectation Maximisation (EM) gave no better results while using discriminant analysis for the classification of otoneurological data. Another plausible cause was that the physicians who collected the data did not clarify some attributes seeing these less important for some certain cases. The dataset was gathered at the Department of Otorhinolaryngology, Helsinki University Central Hospital, Finland. The six diseases and their frequencies are presented in Table 1. The subset of Menière’s disease is far larger than two small subsets of sudden deafness and traumatic vertigo. Table 1. Frequencies of the six diseases in the dataset. Disease
Number
%
vestibular schwannoma
130
16
benign positional vertigo
146
18
Menière’s disease
313
38
sudden deafness
41
5
traumatic vertigo
65
8
vestibular neuritis
120
15
total
815
100
M. Juhola / On Machine Learning Classification of Otoneurological Data
213
2. Methods The following methods were examined for the comparison: k-nearest neighbour searching, discriminant analysis, Naïve Bayesian decision rule, k-means clustering, decision trees, multilayer perceptron neural networks and Kohonen networks (selforganised maps). Nonetheless, data matrices required by the Bayesian method were inevitably singular since they incorporated such attributes that included purely zeroes for some classes and these were crucial variables impossible to eliminate. Singular matrices would rule out their inversions required. Thus, the Bayesian method was excluded. Correspondingly, discriminant analysis suffered from the similar complication in the situation of Mahalanobis (generalised Euclidean) or quadratic discriminant function. Therefore, only linear discriminant analysis was executed. All implementations were programmed with Matlab using various values of control parameters like numbers of clusters or network hidden nodes. To test the dataset was divided into ten pairs of a learning subset (90 % of cases) and a test subset (10 %) in accordance with 10-fold cross-validation so that every case was incorporated in exactly one training set. The selection into the subsets was performed randomly, but in accordance with the class distribution. Sensitivity, specificity and total accuracies were computed. Most of machine learning methods involve random initialisations. Therefore, ten runs for each cross-validation pair and, thus, 100 runs altogether were repeated with an exception. Two of the test subset pairs were discarded for discriminant analysis, because they consisted of matrices, which were not positive definitive. Accordingly, 80 runs were executed for it. The results are shown in a condensed form in the following section.
3. Results Tests of nearest neighbour searching were computed for one, three and five nearest neighbours. Their average results are presented in Table 2. An increase of nearest neighbours k did not improve otherwise satisfactory results which is possible depending on data. The values of vestibular schwannoma (acoustic neurinoma) and benign positional vertigo varied depending on k, which partially represents that their cases were sometimes mixed between the two classes. When the total accuracy value of each class was weighted with its number of the cases, mean weighted total accuracies were 93.5±2.3, 93.0±2.9 and 93.2±2.8 % for k equal to 1, 3 and 5. Linear discriminant analysis executed is characterised in Table 3. This was exceptionally successful for the class of sudden deafness, which can be hard to distinguish medically on the one hand and being the smallest class on the other hand. The other classes were also detected efficiently. The mean weighted total accuracy of 95.5±1.8 % was obtained. Table 4 introduces results yielded by k-means clustering, where from several values of k three are given. The minimum was six because of the number of the classes. Apart from sudden deafness the number k of clusters between 12 and 40 produced approximately as high average results. For sudden deafness, 20 clusters sufficed to evolve sensitivities up to 89 %. Benign positional vertigo remained the poorest recognised class here. Mean weighted total accuracies of 90.9±3.8, 92.8±3.5 and 92.9±3.6 % were achieved for k equal to 6, 12 and 20.
214
M. Juhola / On Machine Learning Classification of Otoneurological Data Table 2. Means and standard deviations of nearest neighbour searching in percents for 100 test runs.
Number of neighbours
Static
vestibular schwannoma
benign positional vertigo
Menière’s disease
sudden deafness
traumatic vertigo
vestibular neuritis
1-nearest
Sensitivity
88.5±9.8
80.7±19.7
87.6±6.5
92.5±12.1
82.6±19.7
86.7±13.1
Total accuracy
96.4±1.6
91.8±3.6
90.4±4.8
99.4±1.0
97.8±2.2
96.6±2.1
Sensitivity
89.2±10.4
77.3±23.6
87.9±5.6
92.5±12.1
78.8±16.5
84.2±13.9
Total accuracy
96.0±2.5
91.2±4.3
89.6±5.2
99.4±1.0
97.6±2.4
96.5±2.5
Sensitivity
79.2±16.2
83.3±18.9
89.5±5.8
90.5±12.3
80.2±16.4
90.0±11.7
Total accuracy
96.4±2.7
92.2±3.4
89.1±6.0
99.4±0.9
97.9±1.7
97.1±2.2
3-nearest
5-nearest
Table 3. Means and standard deviations of linear discriminant analysis in percents for 80 test runs. vestibular schwannoma
Static
benign positional vertigo
Menière’s disease
sudden deafness
traumatic vertigo
vestibular neuritis
Sensitivity
87.5±8.6
87.5±15.5
89.6±4.8
100.0±0.0
90.5±13.3
95.8±8.4
Total accuracy
97.5±1.5
93.2±2.4
93.4±3.1
99.5±0.6
98.8±1.2
98.2±1.7
Table 4. Means and standard deviations of k-means clustering in percents for 100 test runs. Number k of clusters
Static
vestibular schwannoma
benign positional vertigo
Menière’s disease
sudden deafness
traumatic vertigo
vestibular neuritis
6
Sensitivity
81.2±14.2
66.2±33.4
92.8±7.3
12.0±32.7
53.8±44.4
84.8±20.2
Total accuracy
93.7±3.2
88.1±6.0
88.0±6.7
95.5±1.6
95.3±3.4
94.9±5.1
Sensitivity
80.8±13.1
71.7±28.5
93.3±6.2
68.3±40.0
87.2±16.3
86.3±14.4
Total accuracy
95.7±2.7
92.5±4.9
88.7±6.2
98.2±2.1
97.8±2.6
96.4±3.2
Sensitivity
81.2±13.3
74.9±26.0
91.2±8.1
88.8±17.5
85.7±18.4
87.2±13.6
Total accuracy
96.0±2.4
92.3±5.4
88.6±6.4
99.3±1.0
97.9±2.2
96.8±2.2
12
20
Table 5. Means and standard deviations of decision trees in percents for 100 test runs. Static
vestibular schwannoma
benign positional vertigo
Menière’s disease
sudden deafness
traumatic vertigo
vestibular neuritis
Sensitivity
72.7±22.4
63.8±32.5
87.4±10.7
43.6±39.4
81.1±17.3
80.1±17.0
Total accuracy
95.3±3.5
89.0±4.5
82.6±4.8
94.9±2.4
96.6±2.1
95.4±3.1
Next decision trees (Table 5) were exploited pruning leaves to estimate the best tree size according to residual variance. The best size gained was 36.2 leaves on average. On average, decision trees were not effective for benign positional vertigo and
215
M. Juhola / On Machine Learning Classification of Otoneurological Data
especially sudden deafness although their variances were large, i.e. some of trees were very good but others poor. A weighted mean total accuracy of 89.4±2.5 % was gained. Multilayer perceptron networks as such were incapable to classify the dataset tending to put most cases to the largest Menière’s disease class and to lose the two smallest classes entirely. Hence, principal component analysis was computed, results of which was used as input data. Every perceptron network comprised 40 input nodes (attributes) and six output nodes after the diseases. The only free parameter in the network topology was the number of hidden nodes. Adaptive learning and momentum coefficient were used with the backpropagation training algorithm, which was run no more than 200 epochs to prevent overlearning. Tests were conducted using 4-16 hidden nodes. The known recommendation states that the number of connections (weights) in a perceptron neural network should not be greater than one tenth of a training set so that efficient learning would succeed. Eight hidden nodes was such an upper bound, but the networks with 6-16 hidden nodes produced valid results (Table 6). Benign positional vertigo was the most difficult to detect. Mean weighted total accuracies of 92.9±3.5, 94.9±2.9 and 95.0±3.0 % were achieved for 4, 6 and 10 hidden nodes. Table 6. Means and standard deviations of multilayer perceptron networks in percents for 100 test runs. Number of hidden nodes 4
6
10
Static
vestibular schwannoma
benign positional vertigo
Menière’s disease
sudden deafness
traumatic vertigo
vestibular neuritis
Sensitivity
86.5±15.7
77.6±21.3
91.9±5.9
32.8±47.0
58.6±43.2
87.5±20.3
Total accuracy
95.2±4.5
91.6±5.1
90.9±5.9
96.4±2.2
95.3±3.1
94.9±4.2
Sensitivity
89.3±9.6
79.0±20.8
91.53±5.5
93.2±23.4
86.9±16.7
91.9±10.9
Total accuracy
97.7±1.8
93.0±4.1
91.7±5.4
99.3±1.2
98.2±1.6
97.3±2.4
Sensitivity
89.2±9.0
80.3±18.6
91.8±5.7
99.8±2.5
88.9±12.9
92.8±10.6
Total accuracy
97.7±1.5
93.5±4.0
92.3±5.3
99.7±0.5
98.4±1.4
97.6±2.0
Table 7. Means and standard deviations of Kohonen networks in percents for 100 test runs. Number of nodes
Static
vestibular schwannoma
benign positional vertigo
Menière’s disease
sudden deafness
traumatic vertigo
vestibular neuritis
33
Sensitivity
78.5±17.2
66.1±27.2
94.1±6.6
32.2±28.5
13.9±27.6
84.5±16.5
Total accuracy
94.3±2.8
86.6±4.9
87.2±7.0
93.9±2.3
91.6±2.9
96.2±2.8
Sensitivity
82.2±14.1
71.4±27.3
90.1±7.1
57.4±32.3
80.0±19.2
82.8±15.4
Total accuracy
95.1±2.5
91.1±5.0
88.2±5.8
96.3±2.4
96.8±2.4
96.1±2.7
Sensitivity
86.2±12.9
74.6±24.5
88.1±7.7
79.4±26.6
84.8±18.2
84.6±15.4
Total accuracy
95.9±2.5
91.5±5.2
88.7±5.9
98.4±1.8
97.7±2.3
96.4±2.6
55
88
Ultimately, Kohonen neural networks were assessed varying their sizes from 33 to 99 nodes. The latter was in the nature of the maximum, since there were 81 or 82 cases in each training set. Hexagonal neighbourhood pattern was employed as the
216
M. Juhola / On Machine Learning Classification of Otoneurological Data
topological structure with link distance. For every network 400 learning epochs were completed. According to the results, 77 nodes were sufficient for the present data and as small as 55 for the other than sudden deafness. Table 7 includes a breakdown for three network sizes. Benign positional vertigo and sudden deafness obtained slightly poorer results compared to the remaining four classes. Mean weighted total accuracies of 90.3±3.7, 92.1±3.5 and 92.7±3.3 % were attained for the three sizes. The running time (100 tests with a 2.6 GHz processor) of discriminant analysis was approximately 10 s. It was about 100 s for clustering and nearest neighbour searching, 5.5 min for decision trees and multilayer perceptron networks and 26 h 8 min for Kohonen 88 networks.
4. Discussion and conclusion By adopting the best results from Tables 2-7, linear discriminant analysis was best for the current data by correctly classifying with the mean weighted accuracy of 95.5 %, though the differences were small between the methods. After feeding the principal components computed from the data, the perceptron neural networks of one hidden layer with 10 hidden nodes reached the next highest level of 95.0 %. Moreover, nearest neighbour searching of 1-neighbour achieved 93.5 %, k-means clustering of 20 clusters obtained 92.9 %, Kohonen neural networks of 88 nodes produced 92.7 % and decision trees 89.4 %. There were statistically significant differences between the first three mentioned and between them and the others according to t test (0.001 significance level). The running times favoured all excluding the Kohonen networks. On the other hand, the training phase was its time consuming part, which is only once accomplished or infrequently updated for practical purposes. Other methods than multilayer perceptron networks did not benefit from using principal components as input. Perhaps this came from situation that the results were very good even without them. In fact, for linear discriminant analysis the results would be identical independent of this preprocessing for the sake of its linear character related to the principal component procedure. The results cannot be generalized for arbitrarily taken medical datasets. Still, they showed how the simple discriminant analysis was highly effective in accuracy and time.
References [1] [2] [3] [4]
[5]
[6] [7]
J. Furman (ed.), Otoneurology, in: Neurologic Clinics, vol. 23-2, Elsevier, Netherlands, 2005. A.R. Webb, Statistical Pattern Recognition, 2nd edition, John Wiley & Sons, Chichester, England, 2002. K.J. Cios, W. Pedrycz, R.W. Swiniarski and L.A Kurgan, Data Mining, A Knowledge Discovery Approach, Springer Science+Business Media, New York, NY, USA, 2007. M. Juhola, K. Viikki, J. Laurikkala, I. Pyykkö and E. Kentala, On classification capability of neural networks: a case study with otoneurological data, in: Proc. of Medical Informatics 2001, London, UK, IOS Press, Amsterdam, Netherlands, 2001, 474-474 M. Siermala, M. Juhola, J. Laurikkala, K. Iltanen, E. Kentala and I. Pyykkö, Evaluation and classification of otoneurological data with new data analysis methods based on machine learning, Information Sciences 177 (2007) 1963-1976. E. Kentala, I. Pyykkö, Y. Auramo and M. Juhola, Database for vertigo, Otolaryngology – Head and Neck Surgery 112 (1995) 383-390. J. Laurikkala, E. Kentala, M. Juhola, I. Pyykkö and S. Lammi, Usefulness of imputation for the analysis of incomplete otoneurological data, International Journal of Medical Informatics 58-59 (2001) 235-242.