An experimental evaluation of ensemble methods ... - Semantic Scholar

ARTICLE IN PRESS

Pattern Recognition Letters xxx (2007) xxx–xxx www.elsevier.com/locate/patrec

An experimental evaluation of ensemble methods for EEG signal classification Shiliang Sun a

a,*

, Changshui Zhang b, Dan Zhang

b

Department of Computer Science and Technology, East China Normal University, 3663 Zhongshan (North) Road, Shanghai 200062, China b Department of Automation, Tsinghua University, Beijing 100084, China Received 12 March 2007; received in revised form 19 June 2007

Communicated by R.C. Guido

Abstract Ensemble learning for improving weak classifiers is one important direction in the current research of machine learning, and thereinto bagging, boosting and random subspace are three powerful and popular representatives. They have so far shown efficacies in many practical classification problems. However, for electroencephalogram (EEG) signal classification with application to brain–computer interfaces (BCIs), there are almost no studies investigating their feasibilities. The present study systematically evaluates the performance of the three ensemble methods for EEG signal classification of mental imagery tasks. With the base classifiers of k-nearest-neighbor, decision tree and support vector machine, classification experiments are carried out upon real EEG recordings. Experimental results suggest the feasibilities of ensemble classification methods, and we also derive some valuable conclusions on the performance of ensemble methods for EEG signal classification. 2007 Elsevier B.V. All rights reserved. Keywords: Brain–computer interface (BCI); EEG signal classification; Bagging; Boosting; Random subspace

1. Introduction The primary goal of the brain–computer interface (BCI) research is to provide severely motor-disabled people a kind of communication and control channel directly between the brain and external devices, without the participation of peripheral nerves and muscles. During the last few years, it has evoked an increasing interest among different fields, e.g., neuroscience, biomedical engineering, clinical rehabilitation and computer science (Ebrahimi et al., 2003; Nicolelis, 2001; Vaughan, 2003; Wolpaw et al., 2000, 2002). Each discipline therein contributes to the BCI research from a different point of view, and the improvements of the BCI technology benefit from this interdisciplinary cooperation. *

Corresponding author. Tel.: +86 21 62232485; fax: +86 21 62861049. E-mail addresses: [email protected], [email protected] (S. Sun).

Classification methods or translation algorithms, which convert electrophysiological inputs from users into external device commands, are a central component in a BCI (Vaughan, 2003). In the present study, we focus on the problem of electroencephalogram (EEG) signal classification for application in general EEG-based BCIs. EEG signals are electrical brain activities recorded from electrodes placed on the scalp. Compared to magnetoencephalography (MEG), optical imaging, positron emission tomography (PET) and functional magnetic resonance imaging (fMRI), electroencephalography is a relatively inexpensive and convenient means to monitor the brain’s activities. Although the recorded EEG signals may suffer from the trouble of low signal noise rate, currently it is a very recipient way (ethical and non-invasive) to access brain signals (Wolpaw et al., 2002). With respect to EEG signal classification, the applicability of many linear and nonlinear single classifiers has

0167-8655/$ - see front matter 2007 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2007.06.018

Please cite this article in press as: Sun, S. et al., An experimental evaluation of ensemble methods for EEG ..., Pattern Recognition Lett. (2007), doi:10.1016/j.patrec.2007.06.018

ARTICLE IN PRESS 2

S. Sun et al. / Pattern Recognition Letters xxx (2007) xxx–xxx

already been assessed, such as Fisher discriminant analysis, neural networks, support vector machines, hidden Markov models, Bayes classifiers, and source analysis (Garrett et al., 2003; Kamousi et al., 2005; Millań et al., 2004; Mu¨ller et al., 2003; Obermaier et al., 2001; Sun and Zhang, 2005, 2006a,b). However, as we know it may be difficult to build a good single classifier if data are of high dimensionality and the training set is comparatively small. Usually, a classifier built on a training set that is small compared to the data dimensionality is biased and has a large variance due to the insufficient estimation of related parameters (Skurichina and Duin, 2002). As a consequence, such a classifier may be weak (having a poor performance) and unstable (changing largely with respect to the choice of training sets). Skurichina and Duin (2002) have summed up the factors causing bad performance of a classifier: • incorrect model assumptions when building a classifier, • too low a complexity of the classification rule to solve the problem on hand, • incorrect settings or estimations for classifier parameters and • the classifier is unstable. The concept ‘‘weak classifier’’ can refer to any of the above cases. Ensemble learning, building many weak classifiers and then combining their outcomes, is among the approaches to improve the performance of a weak classifier. We justify that for EEG signal classification using in BCIs, the phenomenon of high dimensionality and relatively small training sets does exist widely. Usually, the EEG signals are recorded from an electrode hat with many electrodes, such as 64, 128, and 256. Even if some electrodes unrelated to the desired task are removed, the number of remaining electrodes may still be large. The data dimensionality is composed of features from these electrodes, and therefore is often high. Moreover, from the aspect of BCI users, they always wish to learn to control a BCI using as less time as possible. This proposes a guide line for BCI designers: they had better train a classifier with less time, namely less training data. Forasmuch these reasons, the obtained EEG training sets tend to be of high dimensionality and comparatively small size. Consequently, ensemble learning can play an important role to the problem of EEG signal classification in principle. Besides, for EEG signal classification, there is another cause justifying ensemble learning. Everybody knows that EEG signals are inherently time-varying. This characteristic makes it hazardous to employ only one classifier trained from the given training data to predict the labels of forthcoming samples. In spite of these, as one of the principal current directions in the research of machine learning (Dietterich, 1997), ensemble learning has not been paid enough attention during the previous study of BCIs. To the best of

our knowledge, there is almost no literature systematically reporting results of EEG signal classification by ensemble methods. As a consequence, people do not know whether ensemble methods are effective for EEG signal classification, and even if so, to what extent. Besides, comparisons among different ensemble methods for EEG signal classification are also necessary to facilitate people to pick out an appropriate one. The present study attempts to make a starting point to compensate for these vacancies. Three well-known and popular ensemble learning methods bagging (Breiman, 1996a), boosting (Freund and Shapire, 1997) and random subspace (Ho, 1998) are employed for empirical evaluation of EEG signal classification. Although they have demonstrated effectivities in a variety of application problems, it is not a facile task to judge generally which is the superlative (Dietterich, 2000). Nevertheless, for a specific task, e.g., EEG signal classification, it is plausible to give a relative performance rank, at least to some extent. 2. Methods To assess ensemble methods for EEG signal classification, three widely used classifiers k-nearest-neighbor, C4.5 decision tree and support vector machine are espoused as base classifiers. In this section, after briefly reviewing the three ensemble classification methods bagging, boosting and random subspace, we then provide a terse note for the mentioned base classifiers. And finally we report the existing theoretical insight justifying ensemble classification, which will be propitious to explain sequent experimental results. 2.1. Ensemble classification methods Bagging: The Bagging predictor, acronym of ‘‘bootstrap aggregating’’, introduced by Breiman (1996a) is a popular ensemble approach which integrates the bootstrap sampling technique to manipulate training data. Every time it selects Ntrain samples at random with replacement from the original training set of Ntrain samples to learn an individual classifier. Prediction of a test sample by an ensemble is given by the uniform majority voting of individual classifiers. Theoretically, if bootstrap can induce significant differences in the constructed individual classifiers, the accuracy of bagging will improve greatly (Breiman, 1996a). Boosting (AdaBoost): The AdaBoost family of algorithms, also known as boosting is another category of powerful ensemble methods (Freund and Shapire, 1997). It explicitly alters the distribution of training data fed to every individual classifier, specifically weights of each training sample. Initially the weights are uniform for all the training samples. During the boosting procedure, they are adjusted after the training of each classifier is completed. For misclassified samples the weights are increased, while for correctly classified samples they are decreased. The final ensemble is constructed by combining individual classifiers


ARTICLE IN PRESS S. Sun et al. / Pattern Recognition Letters xxx (2007) xxx–xxx

according to their own accuracies. In our present study, the AdaBoost.M1 algorithm is adopted (Freund and Shapire, 1997). Random Subspace (RanSub): The random subspace ensemble method, proposed by Ho (1998), applies the random selection of feature subspaces to construct individual classifiers. This method can take advantage of high dimensionality, and is an effective countermeasure for the traditional problem of the curse of dimensionality (Ho, 1998). Its merit can be attributed to the high ensemble diversity, which compensates for the possible deficiency of accuracies in individual classifiers (Tsymbal et al., 2005). In random subspace, feature subspaces are picked at random from the original feature space, and individual classifiers are created only based on those attributes in the chosen feature subspaces using the original training set. The outputs from different individual classifies are combined by the uniform majority voting to give the final prediction. For random subspace, how to select the optimum dimensionality of feature subspaces is still an open problem. In the present study we fix the size of feature subspaces as 50% of the number of original attributes, since Ho has shown that for a variety of data sets adopting half of the feature components usually yields good performance (Ho, 1998). The ensemble sizes for bagging, boosting and random subspace are all taken as 25, for it has been shown that for many ensemble classification problems, the biggest profit in accuracy has already been made with this number of individual classifiers (Bauer and Kohavi, 1999; Tsymbal et al., 2005). Besides, considering that the ensemble methods involve random sampling, we run each ensemble 5 times, and use the average outcome over 5 rounds for performance evaluation. 2.2. Base classifiers K-Nearest-Neighbor (KNN): The KNN classifier is a traditional classification rule, which assigns the label of a test sample with the majority label of its k nearest neighbors from the training set (Duda et al., 2000). As the purpose of this paper is to evaluate the performance of ensemble methods, there is no need for us to elaborately tune the parameter k. Therefore, we simply set the value of k as 5. That is, we use the 5-nearest-neighbor rule for classification. C4.5 Decision Tree: The C4.5 algorithm is the most popular in a series of tree-growing methods for classification, e.g., CART and ID3 (Duda et al., 2000). In this paper, the Weka software for C4.5 with the default configuration is used (Witten and Frank, 2005). Support Vector Machine (SVM): The support vector machine classifier adopts the large margin principle to construct decision rules, with a solid foundation in statistical learning theory (Vapnik, 1998). Depending on the choice of kernel functions, different classifiers including linear and nonlinear classifiers can be constructed. For the convenience of evaluation we simply employ linear classifiers and

3

specify the punishing parameter in SVM training (useful when training data are not linearly separable) as 1. 2.3. Theoretical insights How to elaborately elucidate the effectiveness of ensemble classification for discrete outputs of individual classifiers has been intractable for a long time, though it is widely acknowledged that an effective ensemble classification system should consist of individuals that are not only highly accurate, but are diverse as well. To this end, Breiman (2001) brings bagging, boosting and random subspace into his framework of random forests and derives a bound for the error rate. According to Breiman (2001), the upper bound for the generalization error of a family of random forest ensemble methods is given as follows ð1 s2 Þ=s2 ; PE 6 q

ð1Þ

where s stands for the strength of the individual classifiers is the averaged correlation between them. This and q can lead inequality manifests that a big s and a small q to a low upper bound of the generalization error. 3. Experiments The data set used contains EEG recordings from 3 normal subjects (denoted by S1, S2, S3 respectively) during 3 kinds of mental imagery tasks, which are imagination of repetitive self-paced left hand movements (class C1), imagination of repetitive self-paced right hand movements (class C2) and generation of different words beginning with the same random letter (class C3) (Chiappa and Millań, 2005; Millań, 2004). The subjects sit in a normal chair, relax arms resting on their legs. For a given subject, there are 4 nonfeedback sessions recorded on the same day. Each session lasts about 4 min with breaks of 5–10 min in between. The subjects perform a given task for about 15 s and then switch randomly to the next task which is appointed by the operator. The raw EEG potentials (sampling rate 512 Hz) are first spatially filtered by means of a surface Laplacian (Millań, 2004). To provide an idea of the nature of the signals, Fig. 1 gives an illustration of the EEG signals recorded on channel C3 from three kinds of mental imagery tasks. Then, every 62.5 ms the power spectral density in the band 8–30 Hz is estimated using the last second of data with a frequency resolution of 2 Hz for the 8 centro-parietal channels (intimately related to the current mental tasks) C3, Cz, C4, CP1, CP2, P3, Pz, and P4. Resultantly, 12 frequency components are obtained from each channel, and an EEG sample whose features are from 8 channels is a 96dimensional vector. The entire preprocessing mechanism is summarized in Fig. 2. The present study uses these 96dimensional data as input, and the numbers of samples in 4 sessions for S1, S2, and S3 are respectively 3488/3472/


ARTICLE IN PRESS 4


6180 6160 6140 6120 6100 6080 6060

0

50

100

150

200

250

0

50

100

150

200

250

0

50

100

150

200

250

6180 6160 6140 6120 6100 6080 6060

6180 6160 6140 6120 6100 6080 6060

Fig. 1. An illustration of the EEG signals recorded on channel C3 from three kinds of mental imagery tasks. The time horizon is half a second and thus includes 256 sampling points. The vertical axis is the voltage value in lF. (a) Class C1; (b) Class C2; (c) Class C3.

3568/3504, 3472/3456/3472/3472, and 3424/3424/3440/ 3488 (see Chiappa and Millań, 2005). For the purpose of evaluation, we construct 9 data sets from the above data of three subjects S1, S2, and S3. Each of the 9 data sets consists of a training set and a test set. The first 3 data sets are formed using the data of four sessions from S1, and so on, for a total of 9 data sets. Specifically, data sets 1, 2, 3 are respectively composed of sessions 1–2, 2–3, 3–4 of S1. For each data set, the former session serves as the training set, and the latter session the test set. Data sets 4–9 are constituted with a similar style from subjects S2 and S3.

The Bayes classifier is a classical method which has been applied to EEG signal classification recently (Millań, 2004; Sun and Zhang, 2005). Therefore, besides the methods discussed above, we also conduct experiments with the Bayes classifier to provide an auxiliary reference. Specifically, Gaussian mixture models (GMMs) are used to approximate the probability density of each class. And we then classify test samples using their posterior probabilities. The parameters involved in the GMMs are initialized as in (Millań, 2004), and then the k-means clustering method is used to estimate the means and variances in the GMMs which are further optimized by gradient des-



5

Table 2 A win–loss–tie comparison between different ensemble methods Single

Bagging

AdaBoost

KNN Single (53.06) Bagging (53.12) AdaBoost (51.90) RanSub (56.35)

4–5–0 0–8–1 9–0–0

1–8–0 9–0–0

9–0–0

C4.5 Single (50.16) Bagging (56.31) AdaBoost (55.49) RanSub (56.14)

8–1–0 8–1–0 7–2–0

2–7–0 4–5–0

5–4–0

SVM Single (56.37) Bagging (56.47) AdaBoost (57.21) RanSub (55.20)

7–2-0 6–1–2 1–8–0

6–3–0 0–8–1

0–9–0

Fig. 2. The block diagram for the entire preprocessing process. PSD denotes power spectral density.

Bayes (53.45)

cent methods on the corresponding training set (Sun and Zhang, 2005). The experimental results of ensemble classification with different base classifiers for these 9 data sets are given in Table 1, where the results of the Bayes classifier are also listed. Therein ‘single’ means an individual base classifier trained using all the intact training data. Note that since the present classification problem involves three classes, the expected accuracy rate by random guess is 33.33%. Besides, to provide a general view of the performance of different ensemble classification methods (‘Single’ can be regarded as a special ensemble method), based on Table 1 we implement a win–loss–tie comparison between every two ensemble methods, which is shown in Table 2. A result in bold means the performance difference between the corresponding algorithms is statistically significant at the

95% confidence level when using the one-tailed paired t-test. For example, in Table 2 with the base classifier KNN the win–loss–tie value between ‘RanSub’ and ‘Single’ is 9–0–0, which means that all the 9 results of random subspace are superior to those of the single KNN classifier, and the superiority is statistically significant. Moreover, as an auxiliary comparison, the averaged accuracy rates across 9 data sets are listed in the second column of Table 2 as well. From Table 2, we draw the following conclusions about the performance of different ensemble methods for EEG signal classification. Because the ensemble performance can depend on the selection of base classifiers and the parameters of each base classifier usually have a great many options, in this paper we don’t assess the relative performance of ensemble methods across different base classifiers.

Table 1 Test set accuracy rates (%) of ensembles with base classifiers KNN, C4.5 and SVM Method

Data set 1

2

3

4

5

6

7

8

9

(KNN) Single Bagging AdaBoost RanSub

64.63 63.97 62.59 67.86

68.33 68.29 66.55 72.95

65.27 65.16 62.62 72.79

45.78 46.15 44.40 46.81

54.23 54.58 53.92 56.47

57.49 57.94 55.65 60.74

43.55 44.04 43.55 46.91

37.24 36.99 37.02 38.37

41.00 40.96 40.83 44.25

(C4.5) Single Bagging AdaBoost RanSub

56.74 66.95 65.94 66.39

63.54 72.34 71.61 73.22

59.45 70.79 69.17 72.41

49.45 52.60 51.44 49.40

50.84 58.45 57.06 56.85

52.59 59.56 57.93 60.62

41.41 47.77 46.17 47.21

39.53 38.58 39.40 38.55

37.93 39.75 40.71 40.58

(SVM) Single Bagging AdaBoost RanSub

64.92 64.98 69.32 64.98

72.81 72.68 73.52 72.22

68.44 68.42 69.40 66.22

50.06 50.09 50.39 49.31

57.60 57.75 57.60 57.04

63.80 63.91 63.80 62.98

48.71 48.88 48.70 47.59

36.08 36.15 36.59 34.41

44.87 45.33 45.53 42.04

Bayes

73.13

65.17

73.48

45.77

48.65

52.13

43.75

39.47

39.54


ARTICLE IN PRESS 6


(1) With the base classifier KNN, only random subspace brings significant performance improvements compared to a single classifier. The bagging ensemble does almost not change the performance, while boosting deteriorates the performance.This should attribute to the stability of KNN with respect to changes of training sets. KNN is a stable classifier (Breiman, 1996b), while ensemble methods constructed through subsampling the training examples (e.g., bagging and boosting) donot work well for stable classifiers (Dietterich, 1997; Bay, 1999). Because high stability means high averaged correlation between individual classifiers, which can not contribute much to the performance of ensemble methods as given by the inequality in (1). However, KNN benefits random subspace, since random subspace carries out classification in subspaces of much lower dimensionality and this can diminish the negative influence of noises for the calculation of neighbors. That is, the neighbors can be calculated more accurately in a space of low dimensionality. Thus the strength of individual classifiers would rise, which can favor the performance of ensemble methods as the inequality in (1) shows. (2) With the base classifier C4.5, all the three considered ensemble methods make great performance improvements compared to a single classifier. Bagging is the best, while boosting brings comparatively small improvement. The performance of random subspace is in-between.Since C4.5 decision tree is an unstable classifier (Breiman, 1996b; Dietterich, 1997), the averaged correlation between individual classifiers in ensemble methods is low. Thus the three ensemble methods tend to improve the classification performance.On why applying bagging to C4.5 decision tree is the best among the three considered methods, we give the following explanations. Dietterich (2000) has shown that in applications with large amounts of classification noise bagging to C4.5 is the best method compared to boosting and randomization (with some close similarity to random subspace), since noise improves the diversity of bagging. The current EEG signals are recorded from mental imagery tasks whereas there is no warranty that the subjects donot perform another task when an appointed task should be performed. In other words, the recorded samples are very possible to be assigned incorrect labels, and thus possess large amounts of classification noise. Therefore, in such situation applying bagging to C4.5 decision tree would be the best, as our experiments indicate. (3) With the base classifier linear SVM, both bagging and boosting get better performance than a single classifier, and further boosting is slightly superior to bagging. However, random subspace ensemble deteriorates the performance.Skurichina and Duin (2002) have shown that the training sample size can

also influence the performance of the bagging, boosting and random subspace ensemble methods when combing linear classifiers. They obtain a series of meritorious remarks for ensembles of linear classifiers on the effect of the training sample size. With the linear SVM classifier, we validate one of their conclusions, namely, boosting is the best for large training sample sizes. Since bagging and random subspace are useful for critical training sample sizes (Skurichina and Duin, 2002), their performances are inferior to that of boosting for the considered EEG classification problem in this paper. (4) The selection of base classifiers influences the performance of ensemble methods greatly. Depending on the choice of base classifiers, the performance of an ensemble method can outperform that of a single classifier or suffer a reverse. Generally speaking, ensemble learning for EEG signal classification is effective, though different ensemble methods would display different performance, and even some combination of base classifiers and ensemble methods would deteriorate the performance. 4. Conclusion In this paper, we evaluate three ensemble methods, bagging, boosting and random subspace in the context of EEG signal classification. The effectivity of ensemble methods over a single base classifier is shown empirically. Experimental results also indicate that the capability of ensemble methods is subject to the type of base classifiers. The findings of the present study are helpful in guiding the choice of classification algorithms for future BCI applications. In addition, since ensemble methods have more individual classifiers than a single base classifier does, they would require higher computational burden, especially for classifier training. The parallel computing techniques can be employed to remedy this disadvantage, which have the potential to make ensemble methods have nearly the same computational time as one single classifier. Finally, as the focus of the present study is to evaluate the performance of ensemble methods, it is beyond the scope of this paper to find the best parameter configuration and adaptive method for each subject. In the future, the issue of parameter tuning and online evaluation for further performance improvements should be investigated. Besides, in this paper we adopt three representative ensembles with standard fusion mechanisms embedded for empirical studies. This can facilitate practitioners to apply these methods directly to their own research. As combining is costly, for practical use more significant improvement to single classifiers is preferred. Therefore, in the future investigating the applications of more advanced fusion methods for ensembles would be beneficial, such as the methods in (Alkoot and Kittler, 2002; Kittler and Alkoot, 2003). Apart from the current considered base classifiers, the perfor-



mance of some other classifiers such as fuzzy ID3 (Chang and Pavlidis, 1977; Janikow, 1998; Umanol et al., 1994; Wang et al., 2001) can be further investigated as well. Acknowledgements The authors appreciate very much the valuable comments from the editor and anonymous reviewers to improve our work. Also, the authors thank IDIAP Research Institute of Switzerland for providing the analyzed data. References Alkoot, F., Kittler, J., 2002. Modified product fusion. Pattern Recognition Lett. 23 (8), 957–965. Bauer, E., Kohavi, R., 1999. An empirical comparison of voting classification algorithms: Bagging, boosting, and variants. Mach. Learn. 36 (1–2), 105–139. Bay, S., 1999. Nearest neighbor classification from multiple feature subsets. Intell. Data Anal. 3 (3), 191–209. Breiman, L., 1996a. Bagging predictors. Mach. Learn. 24 (2), 123–140. Breiman, L., 1996b. Heuristics of instability and stabilization in model selection. Ann. Statist. 24 (6), 2350–2383. Breiman, L., 2001. Random forests. Mach. Learn. 45 (1), 5–32. Chang, R.L.P., Pavlidis, T., 1977. Fuzzy decision tree algorithms. IEEE Trans. Syst. Man Cybern. 7 (1), 28–35. Chiappa, S., Millań, J.R., 2005. Data set V, mental imagery, multi-class, . Dietterich, T.G., 1997. Machine learning research – Four current directions. AI Mag. 18 (4), 97–136. Dietterich, T.G., 2000. An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Mach. Learn. 40 (2), 139–157. Duda, R.O., Hart, P.E., Stork, D.G., 2000. Pattern Classification, 2nd ed. John Wiley and Sons, New York. Ebrahimi, T., Vesin, J.M., Garcia, G., 2003. Brain–computer interface in multimedia communication. IEEE Signal Proc. Mag. 20 (1), 14–24. Freund, Y., Shapire, R.E., 1997. A decision-theoretic generalization of online learning and an application to boosting. J. Comput. Syst. Sci. 55 (1), 119–139. Garrett, D., Peterson, D.A., Anderson, C.W., Thaut, M.H., 2003. Comparison of linear, nonlinear, and feature selection methods for EEG signal classification. IEEE Trans. Neural Syst. Rehabil. Eng. 11 (2), 141–144. Ho, T.K., 1998. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 20 (8), 832–844. Janikow, C.Z., 1998. Fuzzy decision trees: Issues and methods. IEEE Trans. Syst. Man Cybern. Part B-Cybern. 28 (1), 1–14. Kamousi, B., Liu, Z., He, B., 2005. Classification of motor imagery tasks for brain–computer interface applications by means of two equivalent

7

dipoles analysis. IEEE Trans. Neural Syst. Rehabil. Eng. 13 (2), 166– 171. Kittler, J., Alkoot, F., 2003. Sum versus vote fusion in multiple classifier systems. IEEE Trans. Pattern Anal. Mach. Intell. 25 (1), 110–115. Millań, J.R., 2004. On the need for on-line learning in brain–computer interfaces. Proc. Int. Joint Conf. Neural Networks. Hungary, Budapest, pp. 2877–2882. Millań, J.R., Renkens, F., Mourinõ, J., Gerstner, W., 2004. Brainactuated interaction. Artif. Intell. 159 (1-2), 241–259. Mu¨ller, K.-R., Anderson, C.W., Birch, G.E., 2003. Linear and nonlinear methods for brain–computer interfaces. IEEE Trans. Neural Syst. Rehabil. Eng. 11 (2), 165–169. Nicolelis, M.A.L., 2001. Actions from thoughts. Nature 409 (6818), 403– 407. Obermaier, B., Munteanu, C., Rosa, A., Pfurtscheller, G., 2001. Asymmetric hemisphere modeling in an offline brain–computer interface. IEEE Trans. Syst. Man Cybern. Part C – Appl. Rev. 31 (4), 536–540. Skurichina, M., Duin, R.P.W., 2002. Bagging, boosting and the random subspace method for linear classifiers. Pattern Anal. Appl. 5 (2), 121– 135. Sun, S., Zhang, C., 2005. Learning on-line classification via decorrelated LMS algorithm: Application to brain–computer interfaces. Lecture Notes Comput. Sci. 3735, 215–226. Sun, S., Zhang, C., 2006a. Adaptive feature extraction for EEG signal classification. Med. Biol. Eng. Comput. 44 (10), 931–935. Sun, S., Zhang, C., 2006b. An optimal kernel feature extractor and its application to EEG signal classification. Neurocomputing 69 (13–15), 1743–1748. Tsymbal, A., Pechenizkiy, M., Cunningham, P., 2005. Diversity in search strategies for ensemble feature selection. Inf. Fusion 6 (1), 83–98. Umanol, M., Okamoto, H., Hatono, I., Tamura, H., Kawachi, F., Umedzu, S., Kinoshita, J., 1994. Fuzzy decision trees by fuzzy ID3 algorithm and its application to diagnosis systems. Proc. 3rd IEEE Int. Conf. Fuzzy Syst. Orlando, FL, USA, pp. 2113–2118. Vapnik, V., 1998. Statistical Learning Theory. John Wiley and Sons, New York. Vaughan, T.M., 2003. Guest editorial brain–computer interface technology: A review of the second international meeting. IEEE Trans. Neural Syst. Rehabil. Eng. 11 (2), 94–109. Wang, X.Z., Yeung, D.S., Tsang, E.C.C., 2001. A comparative study on heuristic algorithms for generating fuzzy decision trees. IEEE Trans. Syst. Man Cybern. Part B – Cybern. 31 (2), 215–226. Witten, I.H., Frank, E., 2005. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. Morgan Kaufmann, San Francisco. Wolpaw, J.R., Birbaumer, N., Heetderks, W.J., McFarland, D.J., Peckham, P.H., Schalk, G., Donchin, E., Quatrano, L.A., Robinson, C.J., Vaughan, T.M., 2000. Brain–computer interface technology: A review of the first international meeting. IEEE Trans. Rehab. Eng. 8 (2), 164–173. Wolpaw, J.R., Birbaumer, N., McFarland, D.J., Pfurtscheller, G., Vaughan, T.M., 2002. Brain–computer interfaces for communication and control. Clin. Neurophysiol. 113 (6), 767–791.