ON THE CLASSIFICATION OF MENTAL TASKS: A PERFORMANCE COMPARISON OF NEURAL AND STATISTICAL APPROACHES Guilherme A. Barreto, Rewbenio A. Frota and F´ atima N. S. de Medeiros Department of Teleinformatics Engineering, Federal University of Cear´ a Campus do Pici, 60455-760, Fortaleza, Cear´ a, Brazil Phone: +55 85 288 9467, Fax: +55 85 288 9468 E-mails: rewbenio, fsombra,
[email protected] Abstract. Electroencephalogram (EEG) signals represent an important class of biological signals whose behavior can be used to diagnose anomalies in brain activity. The goal of this paper is to find a concise representation of EEG data, corresponding to 5 mental tasks performed by different individuals, for classification purposes. For that, we propose the use of Welch’s periodogram as a powerful feature extractor and compare the performance of SOMand MLP-based neural classifiers with that of standard Bayes optimal classifier. The results show that the Welch’s periodogram allow all classifiers to achieve higher classification rates (73%-100%) than those presented so far in the literature (≤ 71%).
1. INTRODUCTION The EEG signal is a useful tool in medical clinic and research. For instance, it can be used to determine the global activity of the cerebral cortex and, to some extent, to locate abnormal activity in relatively small cortical areas. It also serves as an important auxiliary source of information for the diagnosis of sleep disturbances and epilepsy, and to differentiate between coma and brain death [9]. In engineering-oriented scenarios, EEG signals are used for the classification of mental tasks performed by subjects [3, 2, 12] and the design of man-machine interfaces [11, 12]. For a suitable utilization by the aforementioned applications, it is worth having a good representation of EEG data, which have been obtained, for example, by principal component analysis [15], autoregressive (AR) models [2], wavelet transform [4] and power spectral density (PSD) analysis [12, 7]. All of them have provided acceptable results in extracting and classifying different patterns from EEG signals. However, especially for the discrimination of several mental tasks, the classification rates are not satisfactory. This is mainly due to the noisy and
nonstationary nature of EEG signals, which are very often disturbed by electric interferences, caused by power-line interference, movements of eyes and electrodes in the scalp of the subject, as well as by vocalization of thoughts and loss of concentration during recording of brain activity [4]. In this paper we argue that a simple but powerful preprocessing method capable of handling both the noisy and nonstationary natures of EEG signals, while maintaining the “useful” information, can alleviate the burden placed on the classifier design. For this, we use the Welch’s periodogram [18], a classical PSD estimation method, and analyze obtained benefits by comparing the performance of SOM- and MLP-based neural classifiers with that of standard Bayes optimal classifier. The results show that the Welch’s periodogram allow all classifiers to achieve higher classification rates than those so far presented in the literature. The remainder of the paper is organized as follows. In Section 2, the EEG data acquisition process is described. In Section 3, we briefly present Welch’s averaged modified periodogram method. The classifiers whose performances are analyzed in this paper are in Section 4. In Section 5 we present the simulation results, and discuss the achieved results with respect to those reported in the literature, focusing on the pros and cons of the proposed approach. We conclude the paper in Section 6.
2. DATA ACQUISITION The data set used in this study comprises EEG signals from five subjects performing five different mental tasks. The subjects are seated in an Industrial Acoustics Company sound controlled booth with dim lighting and noiseless fans (for ventilation). An Electro-Cap elastic electrode cap is used to record EEG signals from positions C3, C4, P3, P4, O1 and O2, defined by the 10-20 system of electrodes placement [5]. The impedance of all electrodes is kept below 5 kΩ. These recordings have been used before by [2, 8, 12] and are available online at http://www.cs.colostate.edu/~anderson. Fig. 1 shows the electrode placement and the measurement procedure, that is made with reference to electrically linked mastoids, A1 and A2. The electrodes are connected through a bank of amplifiers (Grass 7P511), whose band-pass analog filters are set at 0.1 to 100 Hz. The data are sampled at 250 Hz with a Lab Master 12-bit A/D converter mounted on a computer. Before each recording session, the system is calibrated with a known voltage. Signals are recorded for 10 seconds during each task and each task is repeated for a varying number of sessions, held on different weeks. The sampling rate is 250 Hz, so each EEG signal provides 2500 samples per channel. Thus, each mental task is described by the signals obtained from each one of the electrodes (also called channels). In a session, each subject is requested to perform the following five mental tasks: • Baseline Task: the subject is asked to relax as much as possible, make as few movements as possible and think of nothing in particular.
Figure 1: Signal acquisition.
• Letter Task: the subject is asked to mentally compose a letter to a known person (e.g. father, mother or friend) without vocalizing or making any movements. • Multiplication Task: the subject is asked to solve nontrivial multiplication problems, such as 87 times 69, without vocalizing or making any movements. • Visual Counting Task: the subject is asked to visualize black arabic numerals on a white background, sequentially in ascending order, the previous numeral being erased before the next being written. • Geometric Figure Rotation Task: the subject is asked to visualize three-dimensional figures rotating about an axis. Due to its noisy nature, it is difficult to classify brain activity just by visual inspection. Furthermore, it is well known that EEG signals are highly nonstationary [14], ruling out the use of most classical frequency domain techniques. Most of time-domain techniques, like AR models, are also prone to the same criticism, since they strongly rely on the hypothesis of stationarity of the signal [4]. This difficulty can be alleviated if we assume piecewise stationarity of EEG signals [2]. In this paper we also make use of the assumption of piecewise stationarity to justify the successful application of Welch’s periodogram method [18] as a preprocessing method for EEG data classification. This is possible because, to apply Welch’s procedure, the original nonstationary EEG signal is segmented into shorter (quasi)stationary sequences. The periodogram of each sequence is computed and then averaged for the final result. A by-product of the application is a reduction in the length of the sequences to be processed by the classifiers, as we briefly describe next.
3. FEATURE EXTRACTION VIA WELCH’S PERIODOGRAM Welch’s procedure [18] to estimate the PSD of a stochastic signal combines windowing and averaging in order to obtain a smooth spectrum estimation without random fluctuations resulting from the estimation process itself [16].
The original data sequence of each channel is divided into a number K of possible overlapping segments. A window v[n] is defined over each of these segments and the corresponding periodograms are computed and then averaged. If x(k) [n] represents the sample x[n] of the k th data segment (of length N ), then the modified periodogram for that segment is defined as 1 Pˆx(k) (ω) = N
N −1 2 X v[n]x(k) [n]e−jωn ,
k = 1, . . . , K
(1)
n=0
where ω = 2πf (in rad/s) is the angular frequency, and the window v should PN −1 obey the following (normalization) property: (1/N ) n=0 v[n]2 = 1. Then the estimate of the PSD of the signal, for each frequency ω, is taken as K 1 X ˆ (k) Sˆx (ω) = Px (ω) K
(2)
k=1
After preprocessing a total of six EEG signals, corresponding to channels C3, C4, P3, P4, O1 and O2, containing 2500 samples each, we obtain six periodograms of length N2t + 1=129, where Nt = 256 is the number of points of the FFT1 used to calculate the PSD estimates. The feature vectors for training and testing the classifiers are then composed of periodogram samples collected at a given frequency ω: x(ω)
T
= [x1 (ω) x2 (ω) x3 (ω) x4 (ω) x5 (ω) x6 (ω)] h iT = SˆxC3 (ω) SˆxC4 (ω) SˆxP 3 (ω) SˆxP 4 (ω) SˆxO1 (ω) SˆxO2 (ω)
(3)
Since we are interested in the classification of 5 mental tasks and 129 samples per periodogram are available, we have 5 × 129 = 645 feature vectors per subject. Thus, once we collected data from 5 subjects, a set of 645×5 = 3225 feature vectors is made available for the design of the classifiers.
4. NEURAL AND STATISTICAL CLASSIFIERS For all the classifiers to be described next the goal is to classify an incoming feature vector into 1 out of 5 classes, corresponding to the 5 mental tasks of interest. Then, a given feature vector x ∈
gi (x),
∀i 6= k
(4)
where gi (·) is the discriminant function associated to class Ci [17]. In order to help the evaluation of the classifiers we develop an analysis based on the type of discriminant function each one implements. 1 FFT=Fast Fourier Transform. Note that since N < N , zero-padding is allowed for t the purpose of FFT computations.
Thus, in statistical pattern recognition, the condition in (4) is generally implemented through the Fisher’s criterium for optimal classification: p(Ck |x) > p(Ci |x),
∀i 6= k
(5)
where p(Ci |x) is a posterior density function defining the probability that, given the feature vector x, it belongs to class Ci . By means of Bayes rule, the condition (5) can be written as: p(x|Ck )p(Ck ) > p(x|Ci )p(Ci ),
∀i 6= k
(6)
where p(x|Ci ) is the likelihood function of class Ci , which gives the probability that, given a certain class, it is this class that better “explain” the vector x. The density function p(Ck ) gives the prior probabilities of selecting class Ci . 4.1 The Quadratic Gaussian Classifier (QGC). A classifier designed according to (6) is called a Bayes Optimal Classifier [13]. Assuming equal probability and Gaussian likelihood functions for all classes, we get: 1 1 T −1 (7) p(x|Ci ) = − (x − µi ) Ci (x − µi ) q 1 exp 2 (2π) 2 |Ci | 2 where µi = E[x|Ci ] is the mean vector, Ci = E[(x − µi )(x − µi )T ] is the covariance matrix of a given class Ci , and |Ci | is the determinant of Ci . Taking the natural logarithm of both sides of (7) and eliminating terms that are independent of the index i since they do not have influence on the decision rule in (4), we can write the discriminant function of class Ci as: 1 1 gi (x) = ln p(x|Ci ) = − ln(|Ci |) − (x − µi )T C−1 i (x − µi ) 2 2
(8)
If we further assume a diagonal form for Ci and a common variance σ 2 for all components of x, i.e. Ci = σ 2 I, the discriminant function reduces to: gi (x) = −
1 1 (x − µi )T (x − µi ) = − 2 ||x − µi ||2 2 2σ 2σ
(9)
where ||·|| is the Euclidean vector norm. It is well known that the discriminant function in (8) generates quadratic decision surfaces between classes, while the one in (9) generates linear decision surfaces [13, 17]. 4.2 The Multilayer Perceptron (MLP). The MLP classifier, using the logistic function for nonlinearity, implements very general nonlinear discriminant functions and computes posterior probabilities in condition (5) directly [13], provided that the size of the training set is large enough and the learning process does not get stuck at a local minimum. For training the MLP, a 5-dimensional target vector represents the desired class Ck of a given feature vector: (class 1)-[1 0 0 0 0], (class 2)[0 1 0 0 0], . . ., (class 5)-[0 0 0 0 1]. However, to speed up convergence,
we offset the lower and upper limits of the target vectors by some amount ε. Then we make the following replacements: 0 → ε and 1 → 1 − ε, where we adopt ε = 0.05. For testing purposes, assuming that an output neuron k is indeed approximating the a posteriori class probability density function p(Ck |x), we use condition (5) to decide for the class assigned to current input vector X. 4.3 The Self-Organizing Map (SOM). The Self-Organizing Map (SOM) [10] is an unsupervised neural algorithm widely used in clustering, vector quantization and pattern recognition tasks. Neurons in this network are placed in an output layer, A, geometrically arranged in arrays of one, two or three dimensions. In addition, each neuron i ∈ A has a vector of weights wi with same dimension of the input vector x. The SOM learning procedure can be summarized as follows: 1. Search for the winning neuron, i∗ (t): i∗ (t) = arg min{gi (x(t))}, i∈A
where gi (x(t)) = kx(t) − wi (t)k
(10)
2. Weight updating procedure: ∆wi (t) = α(t)h(i∗ , i; t)[x(t) − wi (t)]
(11) so that α(t) is the learning rate and h(i∗ , i; t) = exp −kri (t) − ri∗ (t)k2 /σ 2 (t) is a gaussian neighborhood function, where ri (t) and ri∗ (t) are respectively the positions of the neurons i and i∗ in the output array. The variables 0 < α(t), σ(t) < 1 should decay in time for convergence purposes. For classification purposes we have to assign class labels to the neurons in SOM. This is done in a post-training stage, called labelling phase [10], in which all the training vectors are presented once more to the SOM and the corresponding winners are found. By voting, the class label of neuron i is the class label of the training vectors for which it was the winner in most of the cases. No weight updating is performed at this stage. During testing, the class of an incoming feature vector will be the class of the corresponding winning neuron. It is worth noting that the class decision rule in (10) is computationally equivalent to that in (5) and (9), because the discriminant function depends only on the Euclidean distance. Thus, the SOM-based classifier (SBC) is essentially a Linear Gaussian classifier. However, the LGC uses only one discriminant function per class, while SBC generally uses more than one neuron per class (the exact number depends on the labelling phase). Hence, the SBC is better understood as a piecewise linear Gaussian classifier [17].
5. SIMULATION RESULTS The PSD of each EEG signal was estimated by using Welch’s periodogram as described in Section 3 over equally spaced non-overlapping segments of the
signal. A gaussian window is chosen for computing the Welch’s periodogram. The length of each segment and, hence, of the gaussian window was defined as N = 250 points, corresponding to one second (1s) of brain activity. Each set of periodogram samples is normalized to unity variance and the square root2 of the amplitudes are taken before using them to build the feature vectors. We further computed periodogram samples for 6 different values of the standard-deviation of the gaussian window, thus augmenting the number of feature vectors to 6×3225 = 19325. From this total, 80% are used for training the classifiers and 20% for testing them (Hold-out validation). To estimate the classification rates we performed 100 runs of training/testing procedures, randomly selecting feature vectors for the training and testing sets, at each run. Classification rates are given per subject and per standard-deviation of the Gaussian window (in brackets), as shown in Tables 1 to 3. The overall classification rate of a classifier is averaged over the five subjects. For these simulations the training parameters were the following: MLP: 6 input neurons, 30 hidden neurons, 5 output neurons were used. Learning rate and moment factor were set to 0.35 and 0.85, respectively. Hidden and output neurons have logistic activation functions. The LevenbergMarquadt method was used to train the MLP according to the input-output representation described in Sections 3 and 4. A training run is stopped when M SE ≤ 0.001 or 1000 epochs is reached. Only training runs which converged (i.e. M SE ≤ 0.001) were used to compute the classification rates during the testing phase. Convergence occurred for 60% of the training runs. SOM: 6 input neurons, 112 output neurons in a 28 × 4 rectangular grid were used. Initial and final learning rates were set to 0.1 and 0.0001, respectively, adopting an exponential decay between these values. Training is carried out in batch mode and each run lasted 1000 epochs. From the tables we infer that the MLP classifier performed much better than the QGC and SBC algorithms. It is worth noting that, even the SBC algorithm, performed in average better than previously reported classifiers, for the case of 5 mental tasks. For instance, in [1], Anderson and Sijercic were able to identify only with 70% of accuracy which of five mental tasks a person is doing for two out of four subjects tested, and near 40% for the other two, also using the MLP. Using a smaller number of mental tasks, Anderson et al. [2] achieved classification rates, ranging from 86.1% to 91.4%, only for the classification of two mental tasks using the MLP classifier. Palaniappan et al. [12] reported classification rates higher than 94%, only for three mental tasks, using the Fuzzy ARTMAP classifier. The superior performance of the MLP can be explained by remembering that the design of the QGC and SBC algorithms (Section 4) are based on the assumption of gaussianity of the data. However, Johnson and Long [6] demonstrated recently that the probability density of the spectral estimates 2 The distribution of sample values of a random variable can be made more Gaussian by applying a simple square root transformation.
computed by Welch’s periodogram is highly non-gaussian. Since the MLP classifier makes no assumption about the distribution of the preprocessed data, it is able to perform better in the non-gaussian case, building very general nonlinear discriminant functions. By its turn, The QGC performs better than the SBC because it extracts extra information from the data by computing the covariance matrices of each class. Table 1: Classification rates and standard-deviations (QGC algorithm).
Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Average
0.1 84.8 [2.68] 99.9 [0.35] 99.8 [0.50] 99.7 [0.56] 99.8 [0.52] 96.8 [0.92]
Standard-Deviation of Gaussian Window 0.2 0.4 0.6 0.8 97.7 [1.09] 88.7 [2.67] 99.7 [0.60] 79.9 [3.34] 95.7 [1.40] 80.7 [3.40] 77.4 [3.13] 74.4 [3.64] 99.9 [0.08] 100.0 [0.00] 99.6 [0.90] 98.3 [1.39] 99.7 [0.52] 97.7 [1.19] 94.5 [1.76] 94.3 [1.88] 99.9 [0.30] 99.2 [0.77] 98.8 [0.97] 99.2 [1.00] 98.6 [0.68] 93.3 [1.61] 94.0 [1.47] 89.2 [2.25]
0.9 80.6 [3.03] 74.5 [3.27] 97.8 [1.42] 93.7 [1.89] 99.3 [0.99] 89.2 [2.12]
Table 2: Classification rates and standard-deviations (SBC algorithm).
Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Average
0.1 70.3 [3.88] 56.7 [4.40] 79.6 [3.05] 84.4 [3.50] 78.3 [3.51] 73.8 [3.67]
Standard-Deviation of Gaussian Window 0.2 0.4 0.6 0.8 72.2 [4.15] 69.8 [3.76] 74.9 [3.38] 68.7 [4.31] 51.7 [3.99] 54.7 [3.72] 56.5 [4.17] 56.9 [4.62] 82.5 [3.11] 84.4 [3.28] 85.7 [3.38] 86.3 [3.28] 83.2 [3.09] 83.2 [3.71] 80.1 [3.17] 81.8 [3.43] 77.4 [3.84] 78.8 [3.65] 81.1 [3.04] 79.6 [3.23] 73.4 [3.63] 74.2 [3.62] 75.7 [3.43] 74.7 [3.77]
0.9 69.2 [3.84] 56.2 [4.05] 86.5 [2.94] 81.3 [3.47] 79.5 [3.65] 74.6 [3.59]
Table 3: Classification rates and standard-deviations (MLP algorithm).
Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Average
0.1 89.3 [3.27] 99.2 [0.22] 99.7 [0.13] 99.0 [0.26] 99.3 [0.77] 97.3 [0.93]
Standard-Deviation of Gaussian Window 0.2 0.4 0.6 0.8 98.1 [1.45] 93.1 [2.87] 99.4 [0.27] 85.3 [2.88] 97.9 [1.65] 90.4 [2.86] 86.6 [3.11] 84.1 [3.81] 99.5 [0.72] 97.3 [1.92] 96.7 [2.12] 95.6 [2.25] 98.7 [0.99] 95.9 [2.09] 94.0 [1.99] 92.6 [2.30] 98.7 [1.16] 96.5 [1.52] 95.9 [1.55] 95.4 [1.61] 98.6 [1.19] 94.6 [2.25] 94.5 [1.81] 90.6 [2.57]
0.9 86.5 [3.16] 82.6 [3.24] 96.8 [1.56] 92.6 [1.88] 95.4 [2.30] 90.8 [2.43]
The last set of simulations aims to give a rough idea of the “speed of knowledge acquisition” of the algorithm, i.e., how much information from the inputs it needs to achieve a reasonable classification rate. Figure 2 shows the results, which were obtained by averaging the classifications rates (computed on the remaining testing vectors) over 100 training runs for each size (in percentage of the total of vectors) of the training set. It is worth noting that both the MLP and the QGC require less information to discriminate data in comparison to the SBC algorithm. The MLP classifier achieved higher classification rates than the QGC and SBC algorithms. As a general conclusion for this simulation, one can say that the usual approach of partitioning the data vectors into two sets, using 75-80% of the available vectors for training and the remaining 25-20% for testing, may be too conservative for some classifiers
100
Accuracy of Classification (%)
90
80
70
60 MLP QGC SOM 50
40 0.1
0.2
0.3
0.4
0.5 0.6 Size of Training Set
0.7
0.8
0.9
Figure 2: Sensitivity of the classifiers to the size of the training set.
(e.g. MLP and QGC). This approach indeed favors the SBC algorithm.
6. CONCLUSION This paper aimed to find a concise representation of EEG data, corresponding to five (5) mental tasks performed by different individuals, for classification purposes. For that, we proposed the use of Welch’s periodogram method as a powerful feature extractor and compared the performance of SOM- and MLPbased neural classifiers with that of Quadratic Gaussian (optimal) classifier. The results have shown that the Welch’s periodogram allowed all classifiers to achieve higher classification rates than those so far presented in the literature.
Acknowledgment The authors would like to thank CNPq (DCR: 305275/2002-0) and the Federal University of Cear´ a (UFC) for the financial support. REFERENCES [1] C. Anderson and Z. Sijercic, “Classification of EEG Signals from Four Subjects During Five Mental Tasks,” in Proceedings of the International Conference on Engineering Applications of Neural Networks (EANN), London, England, 1996, pp. 407–414. [2] C. W. Anderson, E. A. Stolz and S. Shamsunder, “Multivariate autoregressive models for classification of spontaneous electroencephalografic signals during mental tasks,” IEEE Transactions on Biomedical Engineering, vol. 45, no. 3, pp. 277–286, 1998.
[3] D. Garrett, D. A. Peterson, C. W. Anderson and M. H. Thaut, “Comparison of linear, nonlinear and feature selection methods for EEG signal classification,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 11, no. 2, pp. 141–144, 2003. [4] N. Hazarika, J. Z. Chen, A. C. Tsoi and A. Sergejew, “Classification of EEG signals using the wavelet transform,” Signal Processing, vol. 59, pp. 61–72, 1997. [5] H. Jasper, “The ten-twenty electrode system of the international federation,” Electroencephalographic Clinical Neurophysiology, vol. 10, pp. 371– 375, 1958. [6] P. E. Johnson and D. G. Long, “The probability density of spectral estimates based on modified periodogram averages,” IEEE Transactions on Signal Processing, vol. 47, no. 5, pp. 2429–2438, 1999. [7] S.-L. Joutisiniemi, S. Kaski and T. A. Larsen, “Self-organizing map in recognition of topographic patterns of EEG spectra,” IEEE Transactions on Biomedical Engineering, vol. 42, no. 11, pp. 1062–1068, 1995. [8] Z. A. Keirn and J. I. Aunon, “A new mode of communication between man and his surroundings,” IEEE Transactions on Biomedical Engineering, vol. 37, pp. 1209–1214, 1990. [9] R. E. Kingsley, Concise Text of Neuroscience, Baltimore: Lippincott Williams and Wilkins, 2nd edn., 2000. [10] T. Kohonen, Self-Organizing Maps, Berlin: Springer-Verlag, 2nd edn., 1997. [11] K.-R. M¨ uller, C. W. Anderson and G. E. Birch, “Linear and nonlinear methods for brain-computer interface,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 11, no. 2, pp. 165–169, 2003. [12] R. Palaniappan, R. Paramesran, S. Nishida and N. Saiwaki, “A new braincomputer interface design using Fuzzy ARTMAP,” IEEE Transactions on Neural Systems and Rehabilitation Engineering, vol. 10, no. 3, pp. 141– 148, 2002. [13] J. C. Pr´ıncipe, N. R. Euliano and W. C. Lefebvre, Neural and Adaptive Systems: Fundamentals through Simulations, John Wiley & Sons, 2000. [14] R. M. Rangayyan, Biomedical signal analysis: A case-study approach, Wiley-Interscience, 2nd edn., 2002. [15] A. C. K. Soong and Z. J. Koles, “Principal-component localization of the sources of the background EEG,” IEEE Transactions on Biomedical Engineering, vol. 42, no. 1, pp. 59–67, 1995. [16] C. W. Therrien, Discrete Random Signals ans Statistical Signal Processing, New Jersey: Prentice-Hall, 1992. [17] A. Webb, Statistical Pattern Recognition, Wiley & Sons, 2002. [18] P. D. Welch, “The use of the fast Fourier transform for the estimation of power spectra,” IEEE Transactions on Audio Electroacoustics, vol. 15, pp. 70–73, 1967.