On Pattern Classification in Motor Imagery-Based

0 downloads 0 Views 3MB Size Report
Laboratoire de Génie Informatique et d'Ingénierie de Production ...... Fixing attention on a flashing stimuli at a frequency between 6 Hz and 24 Hz ...... durée de calibrage signifie moins de temps pour l'utilisateur pour pouvoir générer des motifs.
Délivré par Université de Montpellier (France) Préparée au sein de l’école doctorale I2S – Information, Structures, Systèmes Et de l’unité de recherche Laboratoire de Génie Informatique et d’Ingénierie de Production

Ecole Nationale Supérieure des Mines d’Alès Spécialité : Informatique Présentée par Sami Dalhoumi

On Pattern Classification in Motor Imagery-Based Brain-Computer Interfaces Méthodes d’apprentissage automatique en interfaces cerveau-machine basées sur l’imagerie motrice

Soutenue le 19/11/2015 devant le jury composé de M. Antoine Cornuéjols, Professeur

Rapporteur

AgroParisTech M. François Cabestaing, Professeur

Rapporteur

Université Lille 1 M. Jérôme BOUDY, Professeur

Examinateur

Télécom SudParis M. Stéphane Perrey, Professeur

Président du jury

Université de Montpellier M. Jacky Montmain, Professeur

Directeur de thèse

Ecole Nationale Supérieure des Mines d’Alès M. Gérard Dray, Enseignant-Chercheur des Mines Encadrant de thèse Ecole Nationale Supérieure des Mines d’Alès

To my parents…

A mes parents…

Remerciements Tout d’abord, je souhaite remercier l’ensemble des membres du jury d’avoir accepté d’évaluer le travail que j’ai réalisé pendant ces trois années de thèse. Ainsi, je remercie les rapporteurs : Pr. Antoine Cornuéjols et Pr. François Cabestaing. Je remercie également Pr. Jérome Boudy et Pr. Stéphane Perrey d’avoir accepté d’examiner ce manuscrit. Merci à vous tous pour vos remarques et suggestions qui ont contribué à l’amélioration de ce manuscrit. Je remercie tout particulièrement mon encadrant Gérard Dray et mon directeur de thèse Jacky Montmain. Merci Gérard pour la confiance que tu m’as accordée, pour ta patience quand je stressais et pour ton soutien tout au long de cette thèse. Merci Jacky pour tes commentaires, critiques et tes remarques pertinentes, pour ton encouragement aux moments difficiles et pour ta bonne humeur. Merci à vous deux pour vos qualités humaines qui ont rendu cette aventure possible. Je tiens à remercier l’Ecole Nationale Supérieure des Mines d’Alès qui a financé cette thèse. Je remercie particulièrement les membres du laboratoire LGI2P au sein duquel j’ai passé la moitié de ma thèse. Je remercie notre directeur de la recherche Yannick Vimont pour son soutien à l’activité de recherche en TIC et Santé. Je remercie également l’équipe administrative, notamment Valérie Roman et Claude Badiou qui ont beaucoup facilité les tâches administratives. Merci à tous les enseignants-chercheurs du LGI2P avec qui j’ai eu des échanges très utiles. J’ai également une pensée pour les doctorants du laboratoire avec qui j’ai passé de très agréables moments. Merci à toutes les personnes avec qui j’ai eu des échanges et collaborations pendant cette thèse. Je remercie infiniment Stéphane Perrey avec qui j’ai collaboré et qui m’a accordé l’opportunité de travailler au sein du centre de recherche EuroMov durant la deuxième moitié de ma thèse. Je remercie également Gérard Derosière avec qui j’ai travaillé au début de ma thèse et qui m’a initié à la technologie NIRS. Merci au directeur Benoit Bardy, à tous les chercheurs et doctorants du centre de recherche EuroMov pour l’excellent accueil et pour l’ambiance conviviale. Une spéciale dédicace à mon « cobural » Pierre Jean pour sa bonne humeur et sa gentillesse. Pendant cette thèse, j’ai eu également des échanges avec Pascal Poncelet, professeur au LIRMM, que je tiens à remercier pour ses conseils et ses encouragements. Enfin, je remercie du fond du cœur ma famille. Merci à mes parents qui ont sacrifié leurs vies pour me fournir un bon cadre de vie et une bonne éducation. Merci à mes frères et sœurs qui m’ont toujours soutenu et qui, malgré la distance, restent proches de moi.

Abstract A brain-computer interface (BCI) is a system that allows establishing direct communication between the brain and an external device, bypassing normal output pathways of peripheral neuromuscular system. BCIs were originally meant to provide new means of communication for completely paralyzed patients who have lost all means of communication with their environments. According to the output signals used for interfacing external devices, different types of BCIs exist in literature. Among them, BCIs based on motor imagery (MI) are the most promising ones. They rely on self-regulation of sensorimotor rhythms by imagination of movement of different limbs (e.g., left hand and right hand). MI-based BCIs are best candidates for applications dedicated to severely paralyzed patients but they are hard to set-up because selfregulation of brain rhythms is not a straightforward task. In early stages of BCI research, weeks and even months of user training were required in order to generate stable brain activity patterns that can be reliably decoded by the system. The development of user-specific supervised machine learning techniques allowed reducing considerably training periods in BCIs. However, these techniques are still faced with challenging problems that limit the use of this technology in out-of-the-lab applications. In fact, user-specific techniques require a long calibration phase (20 to 30 minutes) before every use of the BCI in order to achieve good decoding performance. During this phase, the user interacts with the system in a cue-based mode which allows collecting enough labeled data necessary for calibrating the system. But even after a long calibration phase, the recorded brain signals are rarely stationary that a user-specific classification algorithm may perform poorly in classifying brain activity patterns during self-paced interaction mode. Reducing calibration time while maintaining good decoding performance has been one of the most challenging problems in BCI research during the last years. This problem concerns particularly BCIs in which the user voluntarily modulates his brain rhythms such as motor imagery-based BCIs. Although many out-of-the-box machine learning techniques have been attempted, it is still not a solved problem. In this thesis, I thoroughly investigate supervised machine learning techniques that have been attempted in order to overcome the problems of long calibration time and brain signals nonstationarity in MI-based BCIs. These techniques can be mainly classified into two categories: techniques that are invariant to non-stationarity and techniques that adapt to the change. In the first category, techniques based on knowledge transfer between different sessions and/or subjects have attracted much attention during the last years. In the second category, different online

adaptation techniques of classification models were attempted. Among them, techniques based on error-related potentials are the most promising ones. The two main contributions of this thesis are based on linear combinations of classifiers. Thus, these methods are given a particular interest throughout this manuscript. In the first contribution, I study the use of linear combinations of classifiers in knowledge transfer-based BCIs and I propose a novel ensemble-based knowledge transfer framework for reducing calibration time in BCIs. I investigate the effectiveness of the classifiers combination scheme used in this framework when performing inter-subjects classification in MI-based BCIs. Then, I investigate to which extent knowledge transfer is useful in BCI applications by studying situations in which knowledge transfer has a negative impact on classification performance of target learning task. In the second contribution, I propose an online inter-subjects classification framework that allows taking advantage from both knowledge transfer and online adaptation techniques. In this framework, called “adaptive accuracy-weighted ensemble” (AAWE), inter-subjects classification is performed using a weighted average ensemble in which base classifiers are learned using EEG signals recorded from different subjects and weighted according to their accuracies in classifying brain signals of the new BCI user. Online adaptation is performed by updating base classifiers’ weights in a semi-supervised way based on ensemble predictions reinforced by interaction errorrelated potentials.

Contents 1

2

Introduction ............................................................................................................................ 1 1.1

Motivation ........................................................................................................................ 1

1.2

Scope of this thesis ........................................................................................................... 2

1.3

Main contributions of this thesis ...................................................................................... 4

1.4

Outline of this thesis ......................................................................................................... 5

1.5

List of publications ........................................................................................................... 6

1.5.1

Peer-reviewed international conferences and journals ............................................. 6

1.5.2

Other publications .................................................................................................... 6

Introduction to brain-computer interfaces .......................................................................... 7 2.1

Neurophysiological background ....................................................................................... 8

2.2

Neuroimaging techniques in BCIs .................................................................................. 11

2.2.1

Electroencephalography ......................................................................................... 12

2.2.2

Magnetoencephalography....................................................................................... 14

2.2.3

Electrocorticography .............................................................................................. 15

2.2.4

Microelectrodes arrays ........................................................................................... 15

2.2.5

Functional magnetic resonance imaging ................................................................ 15

2.2.6

Near-infrared spectroscopy..................................................................................... 16

2.2.7

Multimodal neuroimaging in BCIs ......................................................................... 16

2.2.8

The missing part of the puzzle ................................................................................ 17

2.3

Relevant signals in EEG-based BCIs ............................................................................. 17

2.3.1

Sensorimotor rhythms ............................................................................................ 17

2.3.2

Slow cortical potentials .......................................................................................... 18

2.3.3

Visual evoked potentials......................................................................................... 19

2.3.4

P300 evoked potentials ........................................................................................... 19

2.3.5

Error-related potentials ........................................................................................... 19

2.4

Different types of EEG-based BCIs ............................................................................... 20

2.4.1

Active vs. reactive vs. passive BCIs ....................................................................... 20

2.4.2

Dependent vs. independent BCIs............................................................................ 21

2.4.3

Synchronous vs. asynchronous BCIs...................................................................... 21

2.5 2.5.1

Applications of EEG-based BCIs ................................................................................... 21 Communication and environment control .............................................................. 22

2.5.2

Locomotion ............................................................................................................ 22

2.5.3

Therapy .................................................................................................................. 23

2.5.4

Human-computer interaction ................................................................................. 23

2.6 From raw EEG signals to output commands: a case study of signal processing, feature extraction and classification pipeline in motor imagery-based BCIs ......................................... 24 2.6.1

Calibration phase.................................................................................................... 25

2.6.2

Feedback phase ...................................................................................................... 28

2.7 3

Conclusion ..................................................................................................................... 29

Supervised machine learning in motor imagery-based brain-computer interfaces ....... 30 3.1

The classical paradigm for solving supervised learning problems ................................ 31

3.1.1

Different steps of learning...................................................................................... 31

3.1.2

Probabilistic interpretation ..................................................................................... 33

3.1.3

Variance reduction using regularization ................................................................ 34

3.1.4

Examples of supervised learning techniques used in MI-based BCIs ................... 35

3.1.5

Linear versus non-linear supervised learning techniques in MI-based BCIs ......... 38

3.2

Challenges of supervised learning in MI-based BCIs .................................................... 39

3.2.1

Using techniques that are invariant to the change .................................................. 40

3.2.2

Using techniques that adapt to the change ............................................................. 43

3.2.3 Robustness versus adaptiveness of supervised machine learning techniques in MIbased BCIs ............................................................................................................................. 45 3.3

Performance evaluation in supervised machine learning ............................................... 46

3.3.1

Evaluation in stationary environments ................................................................... 46

3.3.2

Evaluation in non-stationary environments ........................................................... 47

3.4

Conclusion ..................................................................................................................... 48

4 Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice .......................................................................................................................................... 50 4.1

Combining multiple learners .......................................................................................... 51

4.1.1

Generating complementary learners ...................................................................... 51

4.1.2

Classifiers combination techniques ........................................................................ 52

4.2

Theory of linear combiners ............................................................................................ 53

4.3 Linear combiners in ensemble-based knowledge transfer: application to inter-subjects and/or inter-sessions classification in MI-based BCIs ............................................................... 57 4.3.1

Dynamic weighting ................................................................................................ 58

4.3.2

Static weighting using source data ......................................................................... 60

4.3.3

Static weighting using calibration data .................................................................. 61

4.4 A novel ensemble-based knowledge transfer framework for reducing calibration time in brain-computer interfaces .......................................................................................................... 61

4.4.1

Motivation .............................................................................................................. 61

4.4.2

Methods .................................................................................................................. 62

4.4.3

Experiments ............................................................................................................ 65

4.5

Conclusion ...................................................................................................................... 73

5 On combining knowledge transfer and online adaptation in motor imagery-based brain-computer interfaces ........................................................................................................... 74 5.1

Motivation ...................................................................................................................... 75

5.2 An adaptive accuracy-weighted ensemble for reducing calibration time in motor imagery-based BCIs ................................................................................................................... 79 5.2.1

Methods .................................................................................................................. 79

5.2.2

Experiments ............................................................................................................ 86

5.3

Study of the case when there is no calibration data from target subject ....................... 106

5.3.1

Methods ................................................................................................................ 106

5.3.2

Experiements ........................................................................................................ 109

5.4 6

Conclusion .................................................................................................................... 117

Conclusion ........................................................................................................................... 119 6.1

Summary of contributions ............................................................................................ 119

6.2

Discussions ................................................................................................................... 121

6.2.1

Next generation brain-computer interfaces .......................................................... 121

6.2.2

Next generation machine learning algorithms ...................................................... 122

Bibliography................................................................................................................................ 124

List of figures Figure 1.1: General architecture of a brain-computer interface ...................................................... 3 Figure 2.1: Simplified illustration of intracellular spreading of excitatory postsynaptic potentials from dendritic terminals to the axon in a neural cell of the human brain ........................................ 9 Figure 2.2: Neural activation in the cerebral cortex from pyramidal neurons to synchrony of large cell assemblies (adapted from Baillet et al., 2001) .......................................................................... 9 Figure 2.3: Simplified illustration of the neurovascular coupling mechanism ............................. 10 Figure 2.4: Brodmann cortical areas (Trans Cranial Technologies ltd., 2012) ............................ 11 Figure 2.5: Coronal section of the brain illustrationg mapping of body parts in the somatosensory cortex and motor cortex (adapted from Samek, 2014) ................................................................... 11 Figure 2.6: Electrodes placement according to the 10-20 international system (adapted from Nicolas-Alonso and Gomez-Gil, 2012) .......................................................................................... 12 Figure 2.7: Different brain rhythms measured by electroencephalography (Lotte, 2008) ............ 13 Figure 2.8: The problem of volume conduction in EEG-based BCIs ........................................... 14 Figure 2.9: Band power time courses of EEG signals recorded from an electrode placed over the motor cortex during right finger index lifting (Pfurtscheller and Neuper, 2001) .......................... 18 Figure 2.10: Typical signature of an interaction error-related potential. (Left) Average EEG for the difference error-minus-correct at channel “FCz” for five subjects plus the grand average of them. (Right) Scalp potential topographies, for the grand average EEGof the five subjects, at the occurrence of the peaks (Ferrez and Millán, 2008). ...................................................................... 20 Figure 2.11: Applications of EEG-based BCIs based on the targeted users population (adapted from Nicolas-Alonso and Gomez-Gil, 2012).................................................................................. 22 Figure 2.12: Illustration of signal processing and classification pipline in MI-based BCIs (adapted from Blankertz et al., 2008) ............................................................................................ 24 Figure 2.13: The geometric interpretation of Linear Discriminant Analysis (adapted from Alppaydin, 2010) ............................................................................................................................ 28 Figure 3.1: Example of linear hypothesis space ........................................................................... 32 Figure 3.2: Decomposition of the generalization error in supervised learning. Left: relationship between generalization error and hypothesis space complexity. Right: relationship between generalization error and size of training set (adapted from Alpaydin, 2010)................................. 34 Figure 3.3: Optimal decision boundary for linear SVM ............................................................... 36 Figure 3.4: Example in which data from different classes are not linearly separable in one dimensional space but separable in two-dimensional space .......................................................... 37 Figure 3.5: Example illustrating the influence of non-stationarity on generalization capacity of linear classifiers ............................................................................................................................. 40 Figure 3.6: Categorization of machine learning techniques used for performing knowledge transfer in MI-based BCIs .............................................................................................................. 43 Figure 3.7: A review of state-of-the-art online classifiers adaptation techniques in MI-based BCIs ....................................................................................................................................................... 45 Figure 3.8: Schema of a block-design experiment where data in the same block are stochastically dependent and data from different blocks are sotochastically independent (Lemm et al., 2011) .. 48 Figure 3.9: Schema of an experiment in which data are recorded from different individuals and/or during different sessions ................................................................................................................ 48

Figure 4.1: Illustration of statistical reason for combining multiple learners (adapted from Kuncheva, 2004b) ........................................................................................................................... 51 Figure 4.2: True posterior probabilities around the boundary x* between classes y0 and y1 (solid lines) and estimated posteriors leading to the boundary xb (dashed lines). Lightly and darkly shaded areas represent the contribution of this class boundary to Bayes error and to added error, respectively (Fumera and Roli, 2005). ........................................................................................... 56 Figure 4.3: Synopsis of the proposed ensemble-based inter-subjects classification framework for motor imagery-based BCIs ............................................................................................................. 64 Figure 4.4: Example illustrating the necessity of learning from other subjects in order to reduce the duration of calibration phase in MI-based BCIs ....................................................................... 66 Figure 4.5: Example illustrating inter-subjects variability of EEG signals ................................... 67 Figure 4.6: Average classification performance of the learning paradigms CALIB, TL and POOL ........................................................................................................................................................ 69 Figure 4.7: Detailed classification results for left hand vs. right motor imagery .......................... 70 Figure 4.8: Detailed classification results for left hand vs. both feet motor imagery.................... 71 Figure 4.9: Detailed classification results for left hand vs. tongue motor imagery ....................... 71 Figure 4.10: Detailed classification results for right hand vs. both feet motor imagery ............... 72 Figure 4.11: Detailed classification results for right hand vs. tongue motor imagery................... 72 Figure 4.12: Detailed classification results for both feet vs. tongue motor imagery ..................... 73 Figure 5.1: Illustration of base classifiers' weights adaptation using guided labeling .................. 77 Figure 5.2: The two ways of using error-related potentials for improving BCIs performance. (left) They are used to trigger a corrective action such as preventing execution of the last BCI action. (right) They are used as reinforcement signals to update the BCI classification model (from Chavarriaga et al., 2014) ............................................................................................................... 78 Figure 5.3: Structure of the hierarchical ensemble method for multi-class classification in MIbased BCIs ...................................................................................................................................... 83 Figure 5.4: Example illustrating the use of the proposed framework for multi-class classification ........................................................................................................................................................ 84 Figure 5.5: Simulation procedure of iErrPs for binary classification tasks ................................... 87 Figure 5.6: Simulation procedure of iErrPs for multi-class classification .................................... 88 Figure 5.7: Example illustrating EEG signals non-stationarity between calibration and feedback phases of the same session ............................................................................................................. 89 Figure 5.8: Example illustrating EEG signals non-stationarity during feedback phase ................ 90 Figure 5.9: Classification accuracies of the inter-subjects classification method and the baseline classification method for all binary classification tasks in data set 2A from BCI competition IV. 91 Figure 5.10: Classification accuracies of the static ensemble method and the adaptive ensemble method using guided labeling for left hand vs. right hand motor imagery in data set 2A from BCI competition IV. The size of calibration set from target subject is 10 trials .................................... 93 Figure 5.11: Average classification accuracies of the static ensemble method and the adaptive ensemble method based on guided labeling for different values of update coefficient UC in the motor imagery data set from BNCI Horizon 2020 project ............................................................. 94 Figure 5.12: Comparative results for all binary classification tasks in data set 2A from BCI competition IV when the size of calibration set is equal to 10 trials. The x-axis corresponds to different values of the update coefficient and the y-axis to the average classification accuracy ... 96 Figure 5.13: Comparative results for the binary classification data set from BNCI Horizon 2020 project when the size of calibration set is equal to 10 trials. The x-axis corresponds to different values of the update coefficient and the y-axis to the average classification accuracy over all subjects. .......................................................................................................................................... 98

Figure 5.14: The evolution of base classifiers’ weights during the test session for two different subjects in data set 2A from BCI competition IV. (a) and (b) correspond to base classifiers’ weights at the beginning and the end of test session of subject 3. (c) and (d) correspond to base classifiers’ weights at the begin-ning and the end of test session of subject 7 .............................. 99 Figure 5.15: Illustration of the adavantage of combining inter-subjects classification and online adaptation in MI-based BCIs. The figure corresponds to comparative results for subject 2 in data set 2A from BCI competition IV when performing right hand vs. feet motor imagery task ....... 101 Figure 5.16: Comparative results of the baseline classification method and the static ensemblebased inter-subjects classification methods for all three-class classifications tasks in data set 2A from BCI competition IV ............................................................................................................. 103 Figure 5.17: Classification results of the different base classifiers’ weights online adaptation techniques for all three-class classifications tasks in data set 2A from BCI competition IV....... 104 Figure 5.18: Comparative results of the baseline classification method and the static ensemblebased inter-subjects classification methods for four-class classification in data set 2A from BCI competition IV ............................................................................................................................. 105 Figure 5.19: Classification results of the different base classifiers’ weights online adaptation techniques for four-class classification in data set 2A from BCI competition IV ....................... 106 Figure 5.20: Comparative results of different classification schemes for all binary classification tasks in data set 2A from BCI competition IV ............................................................................. 110 Figure 5.21: Comparative results of different classification schemes for the binary classification data set from BNCI Horizon 2020 project ................................................................................... 110 Figure 5.22: Comparative results of the adaptive accuracy-weighted ensemble using a small calibration set from target subject (10 trials) and the adaptive accuracy-weighted ensemble without using calibration data from target subject for all binary classification tasks in data set 2A from BCI competition IV ............................................................................................................. 111 Figure 5.23: Normalized base classifiers’ weights at the beginning of feedback phase for different target subjects in left hand vs. right hand motor imagery data set from BCI competition IV ..... 115 Figure 5.24: Normalized base classifiers’ weights at the end of feedback phase for different target subjects in left hand vs. right hand motor imagery data set from BCI competition IV ............... 116 Figure 5.25: Iinitial states of the accuracy-weighted ensemble in all binary classification tasks in data set 2A from BCI competition IV when subject 1 is considered as target subject ................ 117

List of tables Table 2.1: Different neuroimaging techniques used in brain-computer interfaces ........................ 12 Table 4.1: p-values of paired t-test over all subjects and all numbers of trials per class ..... Erreur ! Signet non défini. Table 5.1: Main features of the proposed AAWE framework....................................................... 85 Table 5.2: Classification accuracies of the inter-subjects classification method and baseline method when the size of calibration set is equal to 10 trials in data set 2A from BCI competition IV .................................................................................................................................................... 92 Table 5.3: Average classification accuracies of the baseline method and the inter-subjects classification method for different sizes of calibration set in the data set from BNCI Horizon 2020 project ............................................................................................................................................. 92 Table 5.4: Average classification accuracies of the static ensemble method and the adaptive ensemble method based on guided labeling for different values of the update coefficient in data set 2A from BCI competition IV. The size of calibration set from target sub ............................... 94 Table 5.5: Comparative results for all binary classification tasks in data set 2A from BCI competition IV when the size of calibration set is equal to 10 trials and the update coefficient is equal to 0.5 ..................................................................................................................................... 97 Table 5.6: Comparative results for the binary classification data set from BNCI Horizon 2020 project when the size of calibration set is equal to 10 trials and the update coefficient is equal to 0.5. .................................................................................................................................................. 98 Table 5.7: Comparative results of the AAWE method and a state-of-the-art adaptive classification method using realistic iErrPs detection for all binary classification tasks in data set 2A from BCI competition IV .............................................................................................................................. 100 Table 5.8: Comparative results of the AAWE method and a state-of-the-art adaptive classification method using realistic iErrPs detection for the binary classification data set from BNCI Horizon 2020 project .................................................................................................................................. 101 Table 5.9: Detailed classification results of the baseline method and the static ensemble-based inter-subjects classification method for all three-class classification tasks in data set 2A from BCI competition IV when the size of calibration set is equal to 20 trials ............................................ 103 Table 5.10: Detailed classification results of the baseline method and the static ensemble-based inter-subjects classification method for all three-class classification tasks in data set 2A from BCI competition IV when the size of calibration set is equal to 20 trials ............................................ 105 Table 5.11: Comparative results of different classification schemes for left hand vs. right hand motor imagery in data set 2A from BCI competition IV .............................................................. 112 Table 5.12: Comparative results of different classification schemes for left hand vs. feet motor imagery in data set 2A from BCI competition IV ........................................................................ 112 Table 5.13: Comparative results of different classification schemes for left hand vs. tongue motor imagery in data set 2A from BCI competition IV ........................................................................ 112 Table 5.14: Comparative results of different classification schemes for right hand vs. feet motor imagery in data set 2A from BCI competition IV ........................................................................ 113 Table 5.15: Comparative results of different classification schemes for right hand vs. tongue motor imagery in data set 2A from BCI competition IV .............................................................. 113 Table 5.16: Comparative results of different classification schemes for feet vs. tongue motor imagery in data set 2A from BCI competition IV ........................................................................ 113

Table 5.17: Comparative results of different classification schemes for the binary classification data set from BNCI Horizon 2020 project ................................................................................... 114

Abbreviations AAWE: ensemble

adaptive

accuracy-weighted

ADIM: adaptive estimation of information matrix

FES: functional electrical stimulation fMRI: functional imaging

magnetic

resonance

HCI: human-computer interaction ALDA: adaptive linear discriminant analysis HHb: deoxygenated hemoglobin ALDEC: adaptive linear analysis with error correction

discriminant i.i.d.: independent and identically disctibuted

ALS: amyotrophic lateral sclerosis

ICA: independent component analysis

AP: action potential

iErrPs: interaction error-related potentials

AWE: accuracy-weighted ensemble

KALDA: Kalman discriminant analysis

adaptive

linear

BCI: brain-computer interface LDA: linear discriminant analysis BOLD: blood-oxygen-level dependent LOOCV: leave-one-out cross-validation CLIS: completely locked-in state MEG: magnetoencephalography CSP: common spatial patterns CV: cross-validation DWEC: dynamic weighting according to classes’ centers

MI: motor imagery NIRS: near-infrared spectroscopy O2Hb: oxygenated hemoglobin

DWEN: dynamic weighting according to the neighborhood

PCA: principal component analysis

ECoG: electrocorticography

SCPs: slow cortical potentials

EEG: electroencephalography

SMRs: sensorimotor rhythms

EPSP: excitatory postsynaptic potential ERD: event-related desynchronisation

SSVEPs: potentials

ErrPs: error-related potentials

SVM: support vector machines

ERS: event-related synchronization

VEPs:

QDA: quadratic discriminant analysis

steady

state

visual-evoked

visual-evoked

potentials

1 Introduction

1

1.1 Motivation Human communication is the process of using words, sounds or movements in order to exchange information and express feelings. Communication is a central aspect in human life and without it we cannot live in society. According to today’s knowledge, the electrochemical activity of brain cells is the origin of our perception of the world and interaction with it. Although broad electrical activity patterns of large populations of human brain cells was discovered early (Berger, 1929), the idea of interfacing the brain in order to provide new means of communication was not explored until the development of modern digital signal processing tools. The term “braincomputer interface” (BCI, for short) first appeared in the pioneering work of Jacques J. Vidal in 1973 (Vidal, 1973). Since then, BCI research has started to grow rapidly. By definition, a brain-computer interface is a system that allows establishing direct communication between the brain and an external device, bypassing normal output pathways of peripheral neuromuscular system (Wolpaw et al., 2002). This general definition includes systems that allow feeding signals into the brain and systems that allow monitoring signals from the brain. Cochlear implants are an example of BCIs that allow feeding signals into the brain in order to restore the ability of hearing. This category of BCIs is beyond the scope of this thesis. In the rest of this manuscript, the term BCI refers to systems that allow monitoring signals from the brain in order to establish communication with external devices. BCIs were originally meant to provide new means of communication for individuals in completely locked-in state. The locked-in state is a situation in which patients are aware of themselves and their environments but they are completely paralyzed and locked into their bodies (Smith and Delargy, 2005). This syndrome is caused by neuromuscular disorders such as amyotrophic lateral sclerosis (ALS) or severe stroke which damage the nervous tissue. Generally, populations of neurons within the cerebral cortex responsible of sensorimotor functions remain intact but the output pathways between them and different parts of the body are damaged. Thus, the idea of a BCI is to directly record signals from the cerebral cortex and use them to send commands to external devices, bypassing these damaged pathways. According to the modality of interaction, BCIs can be classified as active, reactive or passive (Tan, 2006). In active BCIs, users actively modulate their brain rhythms in order to interact with the system. Reactive BCIs are based on brain responses to visual, auditory or tactile stimuli. Passive BCIs monitor changes in 1

Chapter 1: Introduction

brain activity patterns that passively occur during a cognitive or physical task execution. Active BCIs are best candidates for applications dedicated to severely disabled persons because they provide a natural way of interaction. During the last two decades, BCIs research has increased rapidly and many prototype systems dedicated to patients with less severe disabilities and even to able-bodied persons have been proposed. Today’s BCIs applications include communication and control, therapy, entertainment, etc.

1.2 Scope of this thesis Translating brain activity patterns into commands is not a straightforward task. Brain signals have to go through different building blocks of the BCI before being decoded and translated. The decoding performance of the BCI is then assessed through feedback provided to the user. A BCI is generally composed of six building blocks (Lotte, 2008) (see figure 1.1): -

Signal acquisition: different neuroimaging techniques have been used in BCIs but electroencephalography (EEG) is still the most commonly used one (Nicolas-Alonso and Gomez-Gil, 2012). This thesis focuses particularly on EEG-based BCIs.

-

Preprocessing: brain signals are very noisy and need to be preprocessed in order to extract relevant information from them.

-

Feature extraction: since it is very difficult to extract relevant information from high dimensional data, few descriptors are elicited from brain signals before decoding them.

-

Pattern classification: classification allows assigning different class labels to features associated to different brain states.

-

Device control: depending on the BCI applications, different commands can be associated to different brain states.

-

Feedback: feedback is necessary in BCI applications for assessing performance of the system and allowing the user to increase his own performance.

2

Chapter 1: Introduction

Figure 1.1: General architecture of a brain-computer interface

BCI research requires multidisciplinary skills such as neuroscience, biomedical engineering and computer science. From a computer scientist’s perspective, a BCI is considered as a pattern classification system that assigns different brain activity patterns to different brain states according to their spatio-temporal characteristics. In early stages of BCI research, naïve techniques were used for decoding brain activity patterns. Extensive user training was required in order to generate stable brain activity patterns that can be reliably decoded by the system. The use of modern machine learning techniques in later stages allowed changing the paradigm from “let the user learn” to “let the machine learn” (Samek, 2014). The development of user-specific machine learning algorithms allowed reducing considerably training periods in BCIs. However, these subject-specific approaches are still faced with challenging problems that limit the use of this technology in every-day life. These challenging problems are particularly met in active BCI systems because self-regulation of brain rhythms is difficult. Motor imagery (MI)-based BCIs are the most commonly investigated active BCI systems. They are based on self-regulation of sensorimotor rhythms by imagination of movement of different limbs (e.g., left hand and right hand). MI-based BCIs are the most promising BCI systems but they are the most hard to set-up because the majority of users fail to modulate their sensorimotor rhythms. Using subject-specific machine learning techniques, 20 to 30 minutes of calibration is required for user-system co-adaptation before every use of the BCI (Fazli et al., 2009). During this phase, the user interacts with the BCI in a “cue-based” mode which allows him to learn self-regulating his brain rhythms and the system to create a “robust” classification model. This model is then used to classify brain signals in a feedback phase during 3

Chapter 1: Introduction

which users interact with the application at free will (self-paced mode). Even after a long period of calibration, many users remain unable to modulate their brain rhythms. This phenomenon, called “BCI illiteracy”, is one of the most challenging problems in MI-based BCIs. BCI illiteracy cannot be resolved using modern machine learning techniques. New experimental paradigms should rather be investigated in order to resolve this problem (Bamdadian et al., 2015). These techniques target another category of users for which brain activity patterns related to different cognitive tasks are elicited but they are not stable and vary enormously from trial to trial (nonstationary). Traditional machine learning approaches require a long calibration phase in order to reliably decode non-stationary brain signals. Because long calibration phase is time consuming and boring, especially for users with impaired concentration ability, many out-of-the-box machine learning techniques have been attempted in order to reduce calibration time without decreasing performance of the BCI. Although many approaches have been attempted during the last years, conceiving machine learning algorithms that allow reducing calibration time and managing nonstationarity in MI-based BCIs is still not a solved problem (Lotte, 2015). Motivated by the aforementioned challenges, this thesis focuses on machine learning techniques used for reducing calibration time and managing non-stationarity in motor imagery-based BCIs. A particular emphasis is given to techniques based on knowledge transfer between different subjects which consists of incorporating data recorded from different BCI users in the learning process of a new user. These techniques have attracted much attention in BCI research community during the last years because they allow conceiving “subject-independent” classification models and consequently reduce calibration time.

1.3 Main contributions of this thesis This thesis aims at thoroughly investigating machine learning techniques used in MI-based BCIs and highlighting some aspects that should be taken into consideration in future BCI systems in order to bring this technology out of the lab. The main contributions of this thesis are the following: -

Performing a critical analysis of state-of-the-art supervised learning techniques in motor imagery-based brain-computer interfaces and highlighting some important issues that have to be taken into account in future BCI systems.

-

Analyzing ensemble-based knowledge transfer techniques. This analysis focuses particularly on linear combinations of classifiers and their use in knowledge transferbased BCI systems.

4

Chapter 1: Introduction

-

Proposing a novel ensemble-based knowledge transfer framework for reducing calibration time in BCIs. The effectiveness of the classifiers weighting scheme used in this framework is assessed and situations in which knowledge transfer may have a negative impact on decoding performance of BCI systems are investigated.

-

Proposing a framework for online inter-subjects classification in MI-based BCIs. This framework allows taking advantage from both knowledge transfer and online adaptation techniques and consequently increasing performance of BCI systems.

1.4 Outline of this thesis In this multidisciplinary research field, some prerequisites should be presented before communicating ideas. Thus, I will give an introduction to BCIs in chapter 2. A background on machine learning techniques and a review of state-of-the-art machine learning techniques in MIbased BCIs will be given in chapter 3. In chapter 4, I will present the first contribution of this thesis. Chapter 5 describes the second contribution of this thesis. Finally, chapter 6 gives main conclusions and perspectives of this work. The detailed list of chapters is the following: Chapter 2 gives an introduction to BCIs. It includes neurophysiological background, neuroimaging techniques, relevant signals, types and applications of BCIs. This introduction will be biased towards electroencephalography (EEG) as neuroimaging technique and motor imagery as modality of interaction. Chapter 3 presents a background on supervised machine learning and introduces challenges of supervised learning in MI-based BCIs. A review of state-of-the-art machine learning techniques in MI-based BCIs is given in this chapter. Chapter 4 gives a background on ensemble methods in machine learning and makes a link between theory of linear combinations of classifiers and their use for knowledge transfer in MIbased BCIs and presents a novel inter-subjects classification framework in MI-based BCIs. Chapter 5 presents a framework that combines inter-subject classification and online adaptation in MI-based BCIs. In this framework, I show the advantages of combining both techniques in order to create robust and adaptive BCI systems. Chapter 6 summarizes and discusses main contributions of this thesis and discusses possible directions for future work.

5

Chapter 1: Introduction

1.5 List of publications The following list of publications contains contribution of the author to the field of pattern classification in BCIs. Only contributions related to EEG-based BCIs are presented in this manuscript [1-3]. Other contributions related to NIRS-based BCIs are not presented in order to maintain coherence [4-6].

1.5.1 Peer-reviewed international conferences and journals [1] S. Dalhoumi, G. Dray, J. Montmain, S. Perrey, “A Framework for Online Inter-Subjects Classification in Endogenous Brain-Computer Interfaces,” in 22th International Conference on Neural Information Processing, Istanbul, Turkey, 2015. [2] S. Dalhoumi, G. Dray, J. Montmain, G. Derosière, S. Perrey, “An Adaptive AccuracyWeighted Ensemble for Inter-Subjects Classification in Brain-Computer Interfacing,” in 7th International IEEE EMBS Neural Engineering Conference, Montpellier, France, 2015. [3] S. Dalhoumi, G. Dray, J. Montmain, “Knowledge Transfer for Reducing Calibration Time in Brain-Computer Interfacing,” in 26th IEEE International Conference on Tools with Artificial Intelligence, Limassol,-Cyprus,-2014. [4] G. Derosière, S. Dalhoumi, S. Perrey, G. Dray, T. Ward, “Towards a near infrared spectroscopy-based estimation of operator attentional state,” PloS one, vol. 9, no 3, p. e92045, 2014. [5] S. Dalhoumi, G. Derosiere, G. Dray, J. Montmain, S. Perrey, “Graph-Based Transfer Learning for Managing Brain Signals Variability in NIRS-Based BCIs,” in 15th International Conference on Information Processing and Management of Uncertainty in Knowledge-Based Systems, Montpellier, France,_2014.

1.5.2 Other publications [6] G. Derosière, S. Dalhoumi, M. Billot, S. Perrey, T. Ward, G. Dray, “Classification of NIRS-Measured Hemodynamics of The Cerebral Cortex to Detect Lapses in Attention,” in 16th International Conference on Near-Infrared Spectroscopy, La Grande Motte, France, 2013.

6

2 Introduction to braincomputer interfaces

2

“The human brain is an incredible pattern-matching machine.” [Jeff Bezos]

“Can these observable electrical brain signals be put to work as carriers of information in mancomputer communication or for the purpose of controlling such external apparatus as prosthetic devices or spaceships? Even on the sole basis of present states of the art of computer science and neurophysiology, one may suggest that such a feat is potentially around the corner.” This visionary idea of Jacques J. Vidal in 1973 (Vidal, 1973) led to the growth of brain-computer interfaces research. Since then, research on BCIs has increased rapidly and today neuroscientists, computer scientists, medical doctors and engineers are working together to make the idea of Jacques J. Vidal a reality. In this chapter, I review the main aspects of BCIs technology from neurophysiological background to applications. An intractable number of papers on BCIs have been published during the last years which makes it impossible to give a complete review on the topic in few pages. This review will be biased towards BCIs based on electroencephalography as neuroimaging technique and motor imagery as control modality. More comprehensive reviews on the topic can be found in these papers (Wolpaw et al., 2002; Nicolas-Alonso and Gomez-Gil, 2012; Lance et al., 2012; Birbaumer and Cohen, 2007; Daly and Wolpaw, 2008, Millán et al., 2010).

7

Chapter 2: Introduction to brain-computer interfaces

2.1 Neurophysiological background The power of the brain comes from its highly miniaturized and interconnected cellular components. The human brain contains about 1010 neurons which are the fundamental elements of its communication networks and 1011 glial cells which play an important role in regulating and protecting these networks (Douglas Fields, 2009). In order to understand the information processing mechanism within the brain, let us start with the structure of neurons and the way they communicate between each other. A neuron is a simple information processing unit that receives signals from sensory cells or other neurons through dendrites, integrates them into the cell body (soma) and forwards them to other neurons via the axon (Blum and Rutkove, 2007). This mechanism is governed by the electrochemical properties of the lipid bilayer membrane separating cell body from the external environment. The proteins in this separating membrane play the role of ions gates that maintain the difference between potential inside the cell and potential outside the cell in the order of -70 mV (resting state). When the neuron is electrically or chemically stimulated at the dendrites level by a sensory cell or another neuron, an inflow of Na+ ions occurs which reduces the negativity of the membrane (depolarization) and produces an excitatory postsynaptic potential (EPSP). The stop of Na+ influx and the start of K+ efflux lead to a brief hyperpolarization followed by the reestablishment of the resting membrane potential. When the action potential arrives to the cell body with sufficient amplitude, a series of action potentials (APs) is triggered through the axon in order to transmit excitatory signals to other neurons (see figure 2.1). One might ask a question: if the information processing units of the brain are that simple, where does its power come from? To answer this question, we should see the whole picture rather than focusing on neurons as single information processing units. In fact, the power of the brain resides in the high interconnectivity between neurons that allows information processing and transfer with high rate. It has been admitted for a long time that neurons are hardly-wired between them until Ramos y Cajal discovered that each neuron in our brain is an island unto itself that communicates with other neurons across a tiny gulf of saltwater that bathes every cell in our body (Douglas Fields, 2009). The gulfs of saltwater between axons terminals and dendrites, called synapses, allow each neuron to be connected to up to 10 000 other neurons. In order to take advantage of this highly interconnected network, neurons process and transmit information at a very high rate (an action potential lasts only 1 ms).

8

Chapter 2: Introduction to brain-computer interfaces

Figure 2.1: Simplified illustration of intracellular spreading of excitatory postsynaptic potentials from dendritic terminals to the axon in a neural cell of the human brain

Electrochemical activity at the membrane of a neuron contributes to changes in extracellular electromagnetic fields that can be monitored by means of invasive neuroimaging techniques. The synchrony of large populations of neurons close to the surface of the scalp can produce changes in electromagnetic fields that can be monitored by means of non-invasive neuroimaging techniques (see section 2.2). These neurons, called pyramidal neurons, constitute the neural cells of the cerebral cortex which is a 2 to 4 millimeters thick outer layer of the brain. The folded structure of the human cerebral cortex, composed of sulci (grooves of the folding) and gyri (ridges), gives it a large surface area confined in the volume of the skull. This fragile structure is encapsulated in many protecting layers composed of meninges, skull and scalp. Figure 2.2 illustrates the structure of the cerebral cortex and electromagnetic activity within it from a single pyramidal neuron to a large population of neurons.

Figure 2.2: Neural activation in the cerebral cortex from pyramidal neurons to synchrony of large cell assemblies (adapted from Baillet et al., 2001)

9

Chapter 2: Introduction to brain-computer interfaces

Brain activity requires a constant supply of energy that is provided by a large vascular system covering all neural cells. Astrocytes are a particular type of glial cells that maintain brain metabolism by carrying out glucose and oxygen from blood vessels to neurons (figure 2.3). The increasing consumption of glucose and oxygen during neural activity is followed by an increase of blood flow. The supply of oxygenated hemoglobin to the location of activity is generally in order of magnitude much higher than its consumption. This mechanism leads to a local increase in oxygenated hemoglobin (O2Hb) concentration and a decrease in deoxygenated hemoglobin (HHb) concentration. Changes in hemodynamic activity of the brain can be monitored using different neuroimaging techniques in order to get indirect information about neural activity (see section 2.2).

Figure 2.3: Simplified illustration of the neurovascular coupling mechanism

For more than one century, anatomists and neuroscientists have been trying to understand the correlation between cerebral activity patterns and sensory information processing, motor control and high-level cognitive functions such as memory and attention. Since the pioneering work of the German anatomist Korbinian Brodmann in 1909 (Zilles and Amunts, 2010) a mapping of the cerebral cortex started to be established and it has continued to be discussed and refined till now (Trans Cranial Technologies ltd, 2012). The study of activation of large populations of neurons allowed locating neural networks in different areas of the cerebral cortex related to different cognitive and sensorimotor functions. Figure 2.4 illustrates the location of the different Brodmann areas in the left hemisphere of the cerebral cortex (the same areas are located symmetrically in the right hemisphere). These areas are highly interconnected and the execution of any function requires the exchange of information between them. Although the areas related to sensorimotor functions are well located, the location of neural populations related to high-level cognitive functions such as attention is still debated (Derosière, 2014).

10

Chapter 2: Introduction to brain-computer interfaces

Figure 2.4: Brodmann cortical areas (Trans Cranial Technologies ltd., 2012)

The areas related to somatosensory and motor functions are of particular interest for conceiving brain-computer interfaces (see section 2.3). A direct association between neural networks in these areas and body parts is well established but their structure can vary between individuals and change due to neuroplasticity of the brain. Figure 2.5 illustrates the mapping of body parts in the somatosensory and motor cortex.

Figure 2.5: Coronal section of the brain illustrationg mapping of body parts in the somatosensory cortex and motor cortex (adapted from Samek, 2014)

2.2 Neuroimaging techniques in BCIs Many neuroimaging techniques can be used to monitor brain activity in BCIs (Nicolas-Alonso and Gomez-Gil, 2012). The commonly used ones are: electroencephalography (EEG), magnetoencephalography (MEG), electrocorticography (ECoG), microelectrode arrays, functional magnetic resonance imaging (fMRI) and near-infrared spectroscopy (NIRS). These techniques can be classified into different categories according to the measured activity, invasiveness, portability, etc (see table 2.1). In this section, I will focus particularly on EEG and give a short

11

Chapter 2: Introduction to brain-computer interfaces

overview of other techniques. The development of multimodal neuroimaging and the use of new techniques in BCI technology will be discussed in the end of the section. Table 2.1: Different neuroimaging techniques used in brain-computer interfaces

EEG MEG EcoG Microelectrode arrays fMRI NIRS

Measured activity Electrical Electrical Electrical Electrical Hemodynamic Hemodynamic

Invasiveness Non-invasive Non-invasive Invasive Invasive Non-invasive Non-invasive

Portability Portable Non-portable Portable Portable Non-portable Portable

2.2.1 Electroencephalography Since the first measurements on human subject by Hans Berger (Berger, 1929), EEG has been one of the most useful diagnostic tools for medical doctors. It was the first neuroimaging technique used to conceive a brain-computer interface (Vidal, 1973) and today most of BCIs are based on EEG. EEG allows measurement of electrical activity of large populations of pyramidal neurons by means of electrodes placed on the surface of the scalp. The potential differences between two electrodes constitute EEG signals which are in the order of ±100𝜇𝑉. EEG signals are generally recorded using multiple electrodes (up to 256 electrodes) attached using an elastic cup and placed according to a standard positioning system called 10-20 international system (see figure 2.6).

Figure 2.6: Electrodes placement according to the 10-20 international system (adapted from NicolasAlonso and Gomez-Gil, 2012)

12

Chapter 2: Introduction to brain-computer interfaces

Recorded EEG signals show prominent oscillatory activity related to the synchrony of activity of neural populations of the cerebral cortex. According to their spectral properties and spatial localization, these oscillatory rhythms can be classified into six categories (Başar et al., 2001) (Figure 2.7): -

Delta rhythm: situated in the frequency band 0.1-4 Hz and localized in the frontal and posterior regions of the brain, this rhythm is generally found in children and adults during sleep.

-

Theta rhythm: situated in the frequency band 4-8 Hz, this rhythm can be monitored in different regions of the brain and found in children and adults during sleep or drowsiness.

-

Alpha rhythm: this rhythm is in the 8-13 Hz frequency band and situated in the posterior areas. It is prominent when we are closing eyes or in relaxed state.

-

Mu rhythm: in the same frequency band as alpha rhythm but located in the sensorimotor cortex, this rhythm is altered by movement, motor imagery or tactile stimulation.

-

Beta rhythm: this rhythm is situated in the 13-30 Hz frequency band and located in the frontal regions and the somatosensory cortex. It is observed during active concentration and can be altered by the performance of movements.

-

Gamma rhythm: in the frequencies beyond 30 Hz, this rhythm can be found in various locations and associated to conscious attention and various cognitive processes.

Figure 2.7: Different brain rhythms measured by electroencephalography (Lotte, 2008)

13

Chapter 2: Introduction to brain-computer interfaces

In addition to non-invasiveness and portability, the most important advantage of EEG is its high temporal resolution in the order of milliseconds which allows recording signals at high sampling frequencies (higher than 256 Hz). However, the main problems of EEG are the low spatial resolution and high sensitivity to artifacts. Due to volume conduction caused by the different layers separating cerebral cortex and electrodes, neural activity of a pyramidal cell assembly (sources of activity) can spread to locations at the surface of the scalp up to 10 cm away from the location of activity (Figure 2.8). Furthermore, the low amplitude of EEG signals makes them sensitive to artifacts caused by muscular activity, eye blink, movement, etc. For these reasons, signals amplification, temporal filtering and spatial filtering are necessary before EEG signals analysis (Lotte, 2014).

Figure 2.8: The problem of volume conduction in EEG-based BCIs

2.2.2 Magnetoencephalography MEG was used for the first time to record brain activity by David Cohen in 1968 (Cohen, 1968) and the first MEG-based online BCI experiment was performed in 2005 (Lal et al., 2005). MEG measures small variations in the magnetic field at the surface of the scalp (in the order of 10 -15 Tesla) which are caused by intracellular currents flowing through pyramidal neurons. As EEG, the temporal resolution of MEG is in the order of millisecond and its spatial resolution is better than EEG (≈ 5 mm) which makes it a promising technique for online recording of brain activity. However, this technique is not often used in BCIs because the material is too bulky and must be installed inside a magnetically shielded room.

14

Chapter 2: Introduction to brain-computer interfaces

2.2.3 Electrocorticography EcoG is an invasive technology that allows recording electrical activity of the brain by means of electrodes (less than 1 mm diameter) placed directly on the surface of the cortex (Jasper and Penfield, 1949). Compared to EEG, EcoG signals are less sensitive to spatial smearing by the tissue layers and artifacts and have higher temporal resolution. However, because its invasiveness, most of EcoG-based BCIs studies were with animals (Margalit et al., 2003). Experiments with humans target specifically patients with severe neuromuscular disorders like epilepsy and tetraplegy (Leuthardt et al., 2006). EcoG-based BCIs are promising for performing complex tasks like 3D movement control, but the issue for long-term stability will have to be resolved in order to use them out of the lab.

2.2.4 Microelectrodes arrays Microelectrodes arrays technology pushes the boundary of invasiveness further by measuring electrical activity of a single neuron or a small group of neurons. Silicon-based microelectrode arrays of the size of 5 × 5 mm are inserted into the cerebral cortex which allows targeting specific neurons. As EcoG, first attempts with microelectrodes-based BCIs were with animals (Wessberg et al., 2000). It was also tested with tetraplegic patients for prosthetic devices control (Hochberg et al., 2006). This technology provides the highest spatiotemporal resolution among neuroimaging techniques used in BCIs. However, issues like reaction of brain tissue to microelectrodes, the death of neurons around the implant must be resolved in order to guarantee the long-term stability of this technology.

2.2.5 Functional magnetic resonance imaging fMRI allows measuring hemodynamic activity of the brain by applying a strong electromagnetic field and recording different responses of blood cells. Since oxygenated hemoglobin (O 2Hb) and deoxygenated hemoglobin (HHb) have different electromagnetic properties, their responses to the stimulation allow determining Blood Oxygenation Level Dependent (BOLD) change. fMRI has a high spatial resolution and allows measuring hemodynamic activity in deep layers of the brain. However its low temporal resolution (about 1 sec) and the delay induced by the indirect measurement of electrical activity present limitations to the use of this technology in real-time BCI experiments (Sitaram et al., 2009). Some fMRI-based BCI proof of concept experiments have been conducted (Weiskopf et al., 2004) but this technology cannot be used in realistic interaction settings because equipment is bulky and very expensive.

15

Chapter 2: Introduction to brain-computer interfaces

2.2.6 Near-infrared spectroscopy Since 1977, near-infrared spectroscopy (NIRS) has been used for clinical monitoring of tissue oxygenation (Jobsis, 1977). The first proof-of-concept NIRS-based BCI was introduced by Coyle et al., in 2004 (Coyle et al., 2004). Instead of using electromagnetic fields to measure hemodynamic activity of the brain, NIRS uses near-infrared light (wavelengths between 700 and 1000 nm) to measure variations of concentration of O2Hb and HHb. The intensity attenuation of near-infrared light sent at two different wavelengths penetrating to depth 1-3 cm below the surface of the scalp allows calculating variations of concentration of O 2Hb and HHb. In order to measure hemodynamic activity at different regions of the cortex, a probe composed of multiple near-infrared light emitters and receivers is installed on the surface of the scalp. Although it has lower spatial resolution than fMRI, NIRS technology is portable and not expensive. The delay induced by the hemodynamic measurement (5-10 sec) presents a limitation to this technology in comparison to EEG but it is less sensitive to movement artifacts. Many successful applications of NIRS-based BCIs have been introduced (Sitaram et al., 2009, Abibullaev et al., 2013) which makes it a promising technology for out-of-the-lab applications.

2.2.7 Multimodal neuroimaging in BCIs Each neuroimaging technique presented so far has its own advantages and limitations. Combining different techniques may allow taking advantage of each one and enhancing BCI technology. From one side, combining neuroimaging techniques that measure the same type of activity may be useful in proof-of-concept experiments. As an example, experiments with a multimodal fMRINIRS-based BCI allowed investigating the use of different cognitive tasks in NIRS-based BCIs (Cui et al., 2011). On the other side, combining neuroimaging techniques that measure electrical activity with techniques that measure hemodynamic activity has many advantages. First of all, it allows understanding the relationship between electrical and hemodynamic activity of the brain which is known as neurovascular coupling (Bießmann et al., 2011). Also combining two techniques such as EEG and NIRS may allow conceiving robust and practical BCIs that take advantage of high temporal resolution of EEG, good spatial resolution and less sensitivity to artifacts of NIRS and portability of both of them (Fazli et al., 2012; Yu et al., 2013). Multimodal neuroimaging has also its own challenges. Dedicated machine learning techniques should be used to exploit the potential of combining data with different temporal and spatial sampling rates and different physiological interpretations (Sui et al., 2012).

16

Chapter 2: Introduction to brain-computer interfaces

2.2.8 The missing part of the puzzle All neuroimaging techniques used in BCIs measure the activity of neurons directly through electrical activity or indirectly through hemodynamic activity. However, neurons represent only 15% of the cells in our brain. The rest of cells are glial cells which are divided in three categories: astrocytes, microglia and oligodendrocytes. For more than one century after they were discovered, glial cells were believed to be supporting cells for neurons and do not participate in information processing inside the brain. Scientific research allowed discovering that these cells do more than protecting and maintaining the metabolism of neural networks. They monitor and regulate the flow of electric currents through neural networks and more importantly communicate between each other via calcium ions (Douglas Fields, 2009). Integrating calcium neuroimaging in BCI research may boost the development of this technology and shake our understanding of the brain. To the best of my knowledge, the development of calcium imaging-based BCIs is starting to be concretized through some experiments with mice (Leinweber et al., 2014).

2.3 Relevant signals in EEG-based BCIs The ultimate goal of BCI research is to conceive systems that are able to detect and decode every intention of their users. Actual BCI systems are still so far from achieving that goal because they are capable of discriminating only few neurophysiological signals related to well-studied cognitive tasks. In this section, I review the most relevant physiological signals in EEG-based BCIs with particular focus on sensorimotor rhythms. (Wolpaw et al., 2002; Nicolas-Alonso and Gomez-Gil, 2012) provide a more detailed review of these signals.

2.3.1 Sensorimotor rhythms The idling oscillatory rhythms over the sensorimotor cortex in the 8-13 Hz frequency band (mu rhythm) and 13-30 Hz frequency band (beta rhythm) are called sensorimotor rhythms (SMRs). Motor activity induces two types of amplitude modulation of SMRs: event-related desynchronization (ERD) and event-related synchronization. ERD corresponds to amplitude suppression (power attenuation) of mu and beta rhythms starting few seconds before movement onset and reaching its maximum during movement. ERS consists of amplitude augmentation (power increase) of beta rhythm which reaches its maximum after movement onset. The location of these modulations over the sensorimotor cortex is contralateral to the limb performing movement (i.e., movement of a limb in the left side of the body corresponds to SMRs amplitude modulation in the right side of the sensorimotor cortex and vice versa). Figure 2.9 shows an

17

Chapter 2: Introduction to brain-computer interfaces

example of power time courses of EEG signals recorded from electrode placed at position C3 in the 10-20 positioning system during right finger index lifting.

Figure 2.9: Band power time courses of EEG signals recorded from an electrode placed over the motor cortex during right finger index lifting (Pfurtscheller and Neuper, 2001)

Amplitude modulation of sensorimotor rhythms can also be induced by motor imagery which makes them useful in BCI applications (Pfurtscheller and Neuper, 2001; Pfurtscheller et al., 2006). Subjects can learn to modulate SMRs through motor imagery in order to perform different tasks using a BCI. However, self-modulation of SMRs is not a straightforward task. Most of users fail to self-modulate their brain rhythms because they tend to imagine visual images of the real movement. Multiple training sessions using visual or auditory feedback are necessary for motor imagery-based BCI users in order to learn performing kinesthetic movement imagery. Communication using a BCI becomes possible when the user learns different motor imagery tasks (e.g., hands, feet, etc.). SMRs rhythms have been actively studied by different BCI research groups (Blankertz et al., 2006; Wolpaw et al, 2000) because motor imagery-based BCIs provide a natural way of interaction for the user.

2.3.2 Slow cortical potentials Slow cortical potentials (SCPs) are negative or positive shifts in EEG signals around 1 Hz. Negative shifts are related to increase in cortical excitability while positive shifts correspond to decrease in cortical excitability. Subjects can learn to voluntarily modulate SCPs which allow conceiving BCIs. However, SCPs-based BCIs need much user training and do not offer the possibility to perform multiple tasks. A tool called “thought translation device” was designed to allow users to learn self-regulation of SCPs (Hinterberger et al., 2004). It consists of providing visual feedback using a cursor on a computer screen that changes its vertical position according to the amplitude of SCPs. 18

Chapter 2: Introduction to brain-computer interfaces

2.3.3 Visual evoked potentials Visual evoked potentials (VEPs) consist of brain rhythms modulation in the visual cortex generated by visual stimuli (Wolpaw et al., 2002). There are many types of VEPs but the most commonly used in BCI applications are steady-state visual evoked potentials (SSVEPs) (Herrmann, 2001). Fixing attention on a flashing stimuli at a frequency between 6 Hz and 24 Hz elicit rhythms in the visual cortex with the same frequency. By displaying different targets on a computer screen such as letters and digits flickering with different frequencies and measuring the brain response of the user, SSVEPs-based BCIs allow detecting which target the user is focusing on. SSVEPs-based BCIs are easy to set-up and do not need user training. However, they are dependent on external stimuli which do not offer a natural way of interaction.

2.3.4 P300 evoked potentials P300 evoked potentials consist of positive deflections in EEG signals occurring about 300 ms after infrequent visual or auditory stimuli (Eason, 1981). They are generally found in the parietal area of the brain (attention area in figure 2.4). P300-based BCIs are typically used to conceive speller devices (Farwell and Donchin, 1988). A matrix of letters and numbers is displayed on a computer screen. The user is instructed to focus on the desired letter or number and the rows and columns of the matrix are flashed at random. P300 is elicited when the row or column containing the target is flashed. As SSVEPs-based BCIs, P300-based BCIs do not require user training but they are dependent on external stimuli.

2.3.5 Error-related potentials Error-related potentials (ErrPs) consist of changes in brain activity patterns that occur immediately after BCI user’s awareness of erroneous response (Chavarriaga and Millán, 2014). Different types of error potentials have been reported in literature (Ferrez and Millán, 2008): -

Response ErrPs: occur when the user performs a wrong motor action.

-

Feedback ErrPs: arise following the presentation of a stimulus that indicates incorrect performance.

-

Observation ErrPs: occur when the operator makes an error during choice reaction tasks.

-

Interaction ErrPs: arise when the user perceives that the BCI output is in contradiction to his intent.

These potentials cannot be used as control signals for a BCI but they can be used as reinforcement signals to assess the performance of the system or correct its output (Chavarriaga and Millán, 2014; Llera et al., 2011). Figure 2.10 shows a typical signature and location of an interaction error-related potential (iErrP). 19

Chapter 2: Introduction to brain-computer interfaces

Figure 2.10: Typical signature of an interaction error-related potential. (Left) Average EEG for the difference error-minus-correct at channel “FCz” for five subjects plus the grand average of them. (Right) Scalp potential topographies, for the grand average EEGof the five subjects, at the occurrence of the peaks (Ferrez and Millán, 2008).

2.4 Different types of EEG-based BCIs EEG-based BCIs can be classified into different categories based on the recorded signals and the modality of interaction. Different terminologies are used in literature. Here I present the most commonly agreed upon ones.

2.4.1 Active vs. reactive vs. passive BCIs In active BCIs, users actively modulate their brain rhythms such as performing motor imagery in order to interact with the system (Tan, 2006). These BCIs are difficult to set-up because selfregulation of brain rhythms is not a straightforward task. Thus, a long training period is necessary for achieving optimal performance. An extensive research work on active BCIs has been conducted during the last years because they offer a natural way of interaction for the users. Reactive BCIs are based on brain responses to visual, auditory or tactile stimuli such as P300 evoked potentials. On contrary to active BCIs, reactive BCIs are easy to set-up and do not require user training. However, they depend on external stimuli which do not offer a natural way of interaction for the user. Passive BCIs monitor changes in brain activity patterns that passively occur during a cognitive or physical task execution. This type of BCIs allows assessing changes in cognitive or affective state of the user (Tan, 2006).

20

Chapter 2: Introduction to brain-computer interfaces

2.4.2 Dependent vs. independent BCIs This categorization is based on the target population of users. Dependent BCIs require minimal muscle control from the user to produce the brain activity used for interaction (Machado et al., 2013). VEPs-based BCIs are an example of dependent BCIs which depend on gaze control to produce the evoked potentials. Independent BCIs do not require any control of normal output pathways of the nervous system such as motor imagery-based BCIs. This category of BCIs can be used by completely locked-in patients who do not have any muscular control.

2.4.3 Synchronous vs. asynchronous BCIs In synchronous BCIs, also known as cue-paced BCIs, the user performs cognitive tasks during time periods predefined by the system (Nicolas-Alonso and Gomez-Gil, 2012; Tan, 2006). The BCI analyses signals recorded during these periods and any signal recorded outside the predefined window will be ignored. Asynchronous (or self-paced BCIs) operate differently because the user interacts with the system at free will. These BCIs are hard to evaluate because the system does not have any information about which cognitive task the user is performing (Nicolas-Alonso and Gomez-Gil, 2012).

2.5 Applications of EEG-based BCIs BCIs were originally meant to provide new means of communication and control for severely paralyzed persons. However, many proof-of-concept applications have shown that this technology can target all kinds of populations. These applications range from communication and environment control for severely paralyzed people to entertainment applications for able bodied persons (Figure 2.11). In this section, I review the most common EEG-based BCIs applications. There are many literature reviews that give more comprehensive details on these applications (Wolpaw, 2002; Nicolas-Alonso and Gomez-Gil, 2012; Lance et al., 2012; Millán et al., 2010; Daly and Wolpaw, 2008).

21

Chapter 2: Introduction to brain-computer interfaces

Figure 2.11: Applications of EEG-based BCIs based on the targeted users population (adapted from Nicolas-Alonso and Gomez-Gil, 2012)

2.5.1 Communication and environment control EEG-based BCI applications for communication target severely paralyzed persons that have lost all motor control abilities. These persons are generally in late stages of neurological disorders like cerebral palsy or amyotrophic lateral sclerosis (ALS) which make them completely locked into their bodies. Writing words on a computer screen or performing simple commands such as turning on and off the light would be of great help for them. The most commonly known applications for communication are the spelling devices which consist of virtual keyboards on the screen that allow users to select letters by means of their brain activity. Different types of EEG signals were used in these applications such as slow cortical potentials in the thought translation device (Hinterberger et al., 2004), P300 evoked potentials (Frarwell and Donchin, 1988) and sensorimotor rhythms (Obermaier et al., 2003). BCIs may also provide persons in completely locked-in state (CLIS) the ability to control surrounding devices and have minimal degree of autonomy. Some pilot studies showed that patients with severe motor disabilities can learn to control devices such as mouse or joystick by voluntarily modulating their sensorimotor rhythms (Cincotti et al., 2008).

2.5.2 Locomotion Not only persons in CLIS may take advantage of BCI technology but also persons suffering from less severe neurological disorders such as hemiplegic patients or amputees. A BCI may allow both categories of patients to restore locomotion by autonomously controlling a wheelchair (Galán et al., 2008). Portable EEG systems may allow using BCI-driven wheelchairs in domestic environments. Hemiplegic or tetraplegic patients may also restore the voluntary control of paralyzed limbs by means of EEG-based BCIs. Pfurtscheller et al., (Pfurtscheller et al., 2003) combined an EEG-based BCI with functional electric stimulation (FES) in order to control paralyzed hands. SMRs recorded during motor imagery are translated into electrical stimuli that are directly sent to the muscles of the paralyzed hand which allowed patients to grasp a cylinder. 22

Chapter 2: Introduction to brain-computer interfaces

In case where the limb is amputated or cannot be stimulated, an EEG-based BCI can be used to control neuroprosthesis (Birbaumer and Cohen, 2007).

2.5.3 Therapy The induction of cortical plasticity through functional rehabilitation allows patients suffering from neuromuscular disorders such as post-stroke patients to restore movement. Physiotherapists help their patients to perform movement tasks in order to allow the damaged cerebral pathways to reorganize and consequently the paralyzed limbs to restore movement. The rehabilitation process is generally very long and not efficient. Nowadays, physiotherapists and neuroscientists are interested in BCI technology in order to better understand neuroplasticity and provide patients more efficient rehabilitation protocols (Millán et al., 2010). Providing patients with real-time feedback about their brain activity during movement performance or motor imagery seems to be a promising direction that needs further exploration (Nicolas-Alonso and Gomez-Gil, 2012).

2.5.4 Human-computer interaction Beyond the use of BCIs as assistive technologies for persons with different levels of neurological disorders, BCIs can be useful in many other applications related to the field of human-computer interaction (HCI). Researchers have already started to think that BCIs are mature enough that HCI designers should integrate them into their input modalities (Tan, 2006). BCIs can provide new means for adjusting the dynamics of interaction in existing technology as a function of the reliability of user’s control capabilities. As an example, the use of BCIs as new input modality in video games may make this technology accessible for different categories of population (NicolasAlonso and Gomez-Gil, 2012). Furthermore, BCIs can provide implicit forms of input that are important for assessing user’s mental state in order to facilitate interaction and reduce user’s cognitive effort (Millán et al., 2010). In this area, BCIs can enhance user’s experience in video games by monitoring parameters such as frustration and excitement. It is also useful to monitor cognitive states such as attention deficit and mental fatigue in order to ensure the security of persons such as pilots and car drivers (Tan, 2006).

23

Chapter 2: Introduction to brain-computer interfaces

2.6 From raw EEG signals to output commands: a case study of signal processing, feature extraction and classification pipeline in motor imagery-based BCIs Many questions that someone who has no idea about BCIs could ask were answered in the previous sections. However, one of the most important questions is still without an answer: how signals recorded from the cerebral cortex are translated into information about user’s intent? In this section, I will present an example of the traditional pipeline of EEG signals processing and classification for conceiving a motor imagery-based BCI. The most commonly used signal processing and feature selection techniques in this type of BCIs are presented. For more details about signal processing and feature selection in EEG-based BCIs, I refer readers to review papers on the topic (Bashashati et al., 2007; Nicolas-Alonso and Gomez-Gil, 2012; Lotte, 2014). Technical aspects of classification algorithms will be detailed in the next chapter as they are the main focus of this thesis. In order to explain the process of signals translation into commands in MI-based BCIs, let us take an example of a BCI that allows perdicting whether the user is imagining moving the left hand or imagining moving the right hand using only EEG signals recorded from his cortex (this can be easily generalized to more than two classes). Figure 2.12 illustrates the signals preprocessing, feature extraction and classification pipeline of this BCI.

Figure 2.12: Illustration of signal processing and classification pipline in MI-based BCIs (adapted from Blankertz et al., 2008)

24

Chapter 2: Introduction to brain-computer interfaces

2.6.1 Calibration phase Before being able to reliably predict user’s intent by means of his brain activity, the BCI system needs to know what are the specific characteristics in the recorded EEG signals that allow distinguishing the two motor imagery tasks. To do so, a calibration phase during which the user interacts with the BCI in a cue-based mode is necessary. During this phase, a cue appears repeatedly on a computer screen asking the user to perform one of the two motor imagery tasks during predifined time intervals. After many trials of each motor imagery task, the recorded EEG signals are labeled (e.g., 0 for left hand motor imagery and 1 for right hand motor imagery). These labeled signals will be used by the system to learn a mapping function (classifier) that allows assigning the recorded signals the appropriate labels. But before classifier learning, the raw EEG signals need to be preprocessed and the relevant features should be extracted.

2.6.1.1 Preprocessing The first preprocessing step is the choice of time interval in each trial from which relevant features will be extracted. This choice is generally performed manually (e.g., 0.5 sec to 3 s after the beginning of each trial) or optimized according to the accuracy of the classifier in decoding user’s brain signals. The second preprocessing step is the spectral filtering. Since sensorimotor rhythms are located in frequencies between 8 Hz and 30 Hz, signals can be filtered in the 8-30 Hz frequency band or an optimal narrow band can be selected for each user. After preprocessing, a set of labeled EEG measurements {(𝑋1 , 𝑦1 ) … (𝑋𝑇 , 𝑦𝑇 ) / 𝑋𝑡 ∈ ℝ𝐷×𝑁 , 𝑦𝑡 ∈ {0,1}} is collected. Each EEG measurement 𝑋𝑡 , 𝑡 = 1 … 𝑇 corresponds to 𝑁 sample points recorded using 𝐷 electrodes (channels). The relevant features can then be extracted from EEG signals.

2.6.1.2 Feature extraction In MI-based BCIs, two types of information are necessary for distinguishing signals related to different motor imagery tasks: spatial information and spectral information (Lotte, 2014). As neural populations related to different limbs of the body are located in different regions of the sensorimotor cortex, selecting the best sources of signals is important for extracting relevant information. This can be performed in two ways: the simplest way is to manually select the electrodes (channels) placed on the regions related to the limbs used for motor imagery tasks (e.g., generally electrode C3 is selected for right hand and electrode C4 is selected for left hand) (see figure 2.6). However, channels selection is not optimal because relevant information recorded by other channels is missing and non-relevant information caused by the problem of volume conduction cannot be removed. Generally, EEG signals decoding is much more efficient when we use spatial filtering. Spatial filtering allows overcoming the problem of volume conduction by localizing the sources of signals. Different techniques can be used such as Principal Component

25

Chapter 2: Introduction to brain-computer interfaces

Analysis (PCA) or Independent Component Analysis (ICA) but the most widely used technique in MI-based BCIs is Common Spatial Patterns (CSP) (Samek, 2014). 2.6.1.2.1 Spatial filtering using CSP method Because of volume conduction and noise, a multichannel EEG measurement 𝑋 ∈ ℝ𝐷×𝑁 is generally modeled as a noisy linear mixture of neural sources of brain activity (Blankertz et al., 2011): 𝑋 = 𝐴. 𝑆 + ℵ

(2.1)

Where -

𝐴 ∈ ℝ𝐷×𝐷 is the mixing matrix mapping neural sources of brain activity to electrodes.

-

𝑆 ∈ ℝ𝐷×𝑁 are 𝑁 samples from 𝐷 neural sources.

-

ℵ ∈ ℝ𝐷×𝑁 is the matrix corresponding to the noise term in which each column is a sample generated from a 𝐷 dimensional normal distribution.

Spatial filtering consists of localizing the sources of neural activity by projecting the EEG measurements into a d-dimensional subspace (𝑑 < 𝐷) using a spatial filtering matrix 𝑊 = [𝑤1 … 𝑤𝑑 ] ∈ ℝ𝐷×𝑑 in which each column is a spatial filter: 𝑆̂ = 𝑊 ′ . 𝑋

(2.2)

The estimated sources of neural activity are assumed to be uncorrelated in the subspace spanned by 𝑊. As the variance of band-pass filtered EEG signals corresponds to band power, CSP algorithm aims at calculating spatial filters that maximize the variance of band-pass filtered EEG signals from one motor imagery task while minimizing the variance of signals from the other task. Let 𝛴 0 ∈ ℝ𝐷×𝐷 and 𝛴1 ∈ ℝ𝐷×𝐷 be the covariance matrices of band-pass filtered EEG signals recorded during left hand and right hand motor imagery, respectively: 𝛴 𝑐 = ∑ 𝑋𝑡 . 𝑋𝑡′ , 𝑐 ∈ {0,1}

(2.3)

𝑡∈𝑐

The CSP algorithm calculates spatial filters by maximizing the Rayleigh quotient: 𝑤 𝑇 . 𝛴0 . 𝑤 (2.4) 𝑤 𝑇 . 𝛴1 . 𝑤 Using Lagrange multipliers, this problem is equivalent to maximizing the following equation: 𝑅(𝑤) =

𝐿(𝜆, 𝑤) = 𝑤 𝑇 . 𝛴 0 . 𝑤 − 𝜆. (𝑤 𝑇 . 𝛴1 . 𝑤 − 1)

(2.5)

The solutions of this generalized eigenvalue problem with largest eigenvalues 𝜆 maximize the signal band-power in left hand motor imagery task while minimizing it in the other task and 26

Chapter 2: Introduction to brain-computer interfaces

inversely for the solutions with smallest eigenvalues. Selecting 𝑑 spatial filters that form the matrix 𝑊 = [𝑤1 … 𝑤𝑑 ] from both ends of the spectrum allows preserving the discriminant information between the two motor imagery tasks and removing non-relevant information. Generally, three spatial filters are selected for each condition (𝑑 = 6). Different variants of CSP algorithm have been proposed. See (Lotte and Guan, 2011) for a literature review. 2.6.1.2.2 Band-power features extraction After spatial filters learning using calibration data, the normalized log-variance feature vector 𝑥𝑡 ∈ ℝ𝑑 corresponding to the EEG measurement 𝑋𝑡 ∈ ℝ𝐷×𝑁 , 𝑡 = 1 … 𝑇 is calculated as follows: 𝑥𝑡 = log (

𝑑𝑖𝑎𝑔(𝑊 ′ . 𝑋𝑡 . 𝑋𝑡′ . 𝑊) ) 𝑡𝑟𝑎𝑐𝑒(𝑊 ′ . 𝑋𝑡 . 𝑋𝑡′ . 𝑊)

(2.6)

In this equation, the logarithmic transformation allows making the feature vectors normally distributed (Blankertz et al., 2008), diag(.) returns the diagonal elements of the square matrix and trace(.) returns their sum. At the end of preprocessing and feature extraction, a labeled data set {(𝑥1 , 𝑦1 ) … (𝑥𝑇 , 𝑦𝑇 )} is available.

2.6.1.3 Classifier training Based on the set of observations {(𝑥1 , 𝑦1 ) … (𝑥𝑇 , 𝑦𝑇 )}, the aim of a classifier is to find a rule that allows assigning an observation 𝑥 to one of the two classes (left hand / right hand) based on its characteristics. This is called a supervised learning task because the classifier uses labels (supervisors) to find such a rule. There are many supervised learning techniques in literature. The most widely used technique in BCIs is called Linear Discriminant Analysis (LDA). LDA assumes that both classes are normally distributed with mean vectors 𝜇𝑐 ∈ ℝ𝑑 , 𝑐 𝜖 {0,1} and common covariance matrix 𝛴 ∈ ℝ𝑑×𝑑 which are learned using the labeled data set collected during calibration. Given a new feature vector 𝑥, the rule used to assign the appropriate label to it (called discriminant function) is the following: ℎ(𝑥) = {

0, 𝑖𝑓 ([𝑏; 𝜃]′ . [1; 𝑥]) < 0 1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

1 2

(2.7)

where 𝜃 = 𝛴 −1 . (𝜇1 − 𝜇0 ) and 𝑏 = − . 𝜃 ′ . (𝜇0 + 𝜇1 ) are the parameters vector and bias of the hyper-plane separating the two classes.

27

Chapter 2: Introduction to brain-computer interfaces

Figure 2.13 illustrates the idea of LDA for two-dimensional feature space.

Figure 2.13: The geometric interpretation of Linear Discriminant Analysis (adapted from Alppaydin, 2010)

2.6.2 Feedback phase After calibration, the BCI is now ready for use in self-paced interaction mode (i.e., the user performs the different motor imagery tasks at his own will). The accuracy of the BCI in predicting user’s intents during this phase, called feedback phase, depends on the capacity of the previously learned classifier in decoding his brain signals. In online interaction settings, the signals preprocessing and feature extraction pipeline is the following: -

Sliding windows of EEG signals are extracted (e.g., windows of length 1 second extracted after each 200 milliseconds).

-

These widows are temporally filtered in the same frequency band as in calibration phase.

-

The spatial filtering matrix 𝑊 learned after calibration is used to project signals into the sources of neural activity subspace.

-

A logarithmic variance feature vector is calculated from each spatially filtered window of EEG signals.

-

Each extracted feature vector is fed to the classifier and the feedback given by the BCI application changes according to the predicted label.

For offline data analysis, true class labels are known for both calibration and feedback phases. EEG signals in feedback phase are cut off into intervals and temporally filtered in the same way as in calibration phase. The spatial filtering matrix and corresponding classifier learned during calibration phase are applied to these preprocessed signals. The accuracy of the classifier is then estimated according to the correspondence between predicted labels and true labels.

28

Chapter 2: Introduction to brain-computer interfaces

Although this signal processing and pattern classification pipeline is widely used in MI-based BCIs, it presents many challenging problems that limit the use of this technology in out-of-the-lab applications. These problems will be discussed during the next chapter and approaches for overcoming them will be reviewed.

2.7 Conclusion An extensive research work on brain-computer interfaces has been lead over decades in order to make this technology available for daily-life use. Different neuroimaging techniques have been used, different signal processing and machine learning techniques have been applied and many proof-of-concept applications have been explored. Despite all the efforts that have been made, this technology still suffers from many problems that limit its use in out-of-the-lab applications. One of the major problems that is still not resolved is the difficulty to reliably decode brain signals and translate them into commands. This problem will be investigated in the next chapter. I will focus particularly on machine learning challenges in motor imagery-based BCIs which are the most promising ones in the field of BCI research but the most difficult to set-up.

29

3 Supervised machine learning in motor imagery-based braincomputer interfaces

3

“All things flow, everything runs, as the waters of a river, which seem to be the same but in reality are never the same, as they are in a state of continuous flow.” [The doctrine of Heraclitus]

Machine learning consists of learning information from a set of observations in order to extract useful knowledge or predict future events (Alpaydin, 2010). When the set of observations is composed of input vectors 𝑥𝑡 ∈ 𝑋, 𝑡 = 1 … 𝑇 and output values 𝑦𝑡 ∈ 𝑌, 𝑡 = 1 … 𝑇, we talk about supervised learning. In traditional supervised learning, the set of observations is assumed to be independent and identically distributed (i.i.d.) according to an unknown probability distribution 𝑝(𝑥, 𝑦). Although traditional supervised learning techniques have been successfully used in many applications, nowadays applications present challenging problems that require out-of-the-box machine learning techniques. The field of BCIs research in general and MI-based BCIs in particular, is one of the most challenging application fields for supervised machine learning techniques. The non-stationary nature of brain signals from one hand and the lack of labeled data from the other hand make it very difficult to conceive BCI systems that are both reliable and userfriendly. This chapter is organized as follows: the first section gives an introduction to supervised learning techniques with an emphasis to techniques widely used in MI-based BCIs. In the second section, I highlight the challenging problems of supervised machine learning in MI-based BCIs and I review techniques used to overcome them. In the third section, I present approaches used for evaluating supervised learning techniques. The last section concludes the chapter.

30

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

3.1 The classical paradigm for solving supervised learning problems 3.1.1 Different steps of learning There are two types of supervised learning techniques: regression techniques in which the output values are continuous (𝑌 = ℝ) and classification techniques in which the output values are discrete (𝑌 = ℝ𝐶 / 𝐶 is the number of classes). Generally, supervised machine learning techniques suppose the existence of an unknown function 𝑓: 𝑋 → 𝑌 generating the observations and try to find a function ℎ, called a hypothesis, that gives the best estimation of 𝑓 (Hastie et al., 2001). The first step of learning consists of determining an adequate hypothesis space 𝐻 from which to choose ℎ. This choice is always performed by an expert who knows the structure of data from which the sample of observations is extracted. In MI-based BCIs, the class of linear functions is usually considered as hypothesis space (Lotte et al., 2007). To explain this choice, let us return to the example of a BCI that allows predicting whether the user is performing motor imagery of left hand or motor imagery of right hand. Suppose that the set of input vectors {𝑥𝑡 = (𝑥𝑡1 , 𝑥𝑡2 )𝑇 , 𝑡 = 1 … 𝑇} consists of band power features of signals recorded from electrodes C3 (𝑥 1 ) and C4 (𝑥 2 ) over sensorimotor cortex. As we know, movement imagination of right hand produces a power increase in EEG signals recorded from electrode C3 and movement imagination of left hand movement produces a power increase in EEG signals recorded from electrode C4. This can be translated into the following rules: -

If 𝑥 1 > 𝑎 and 𝑥 2 < 𝑏 then 𝑦 = 1 (right hand motor imagery).

-

If 𝑥 1 < 𝑎 and 𝑥 2 > 𝑏 then 𝑦 = 0 (left hand motor imagery).

In the naïve approach, one can visualize the data and set up the thresholds 𝑎 and 𝑏 manually. But when the dimension of the feature space is high, it is difficult to manually set up many parameters simultaneously. In the machine learning approach, the simplest way to solve this problem is to automatically learn the parameters of a line passing through the space separating feature vectors from the two different classes (Figure 3.1-left). When the dimension of the feature space is higher than two, a hyper-plane can solve the problem. For multi-class classification, multiple hyperplanes can be used to separate data from different classes (Figure 3.1-right). It is possible to choose a more complex function but this choice will be discussed later in this chapter.

31

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

Figure 3.1: Example of linear hypothesis space

The second step of learning is to choose the function ℎ from the hypothesis space 𝐻 that is as close as possible to the unknown function 𝑓. This is performed by evaluating the expected risk associated to choosing a particular function ℎ defined as follows (Vapnik, 2000): 𝑅(ℎ) = ∫ 𝑙(𝑦, ℎ(𝑥))𝑑𝑝(𝑥, 𝑦)

(3.1)

Where 𝑙 is called loss function which measures the loss incurred by mapping a feature vector to a specific output. An example of loss function commonly used when ℎ(𝑥) is discrete is the 0/1 loss: 0 𝑖𝑓 ℎ(𝑥) = 𝑦 𝑙(𝑦, ℎ(𝑥)) = { 1 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(3.2)

Another loss function used when ℎ(𝑥) is continuous is the squared error: 2

𝑙(𝑦, ℎ(𝑥)) = (𝑦 − ℎ(𝑥))

(3.3)

In practice, the expected risk cannot be evaluated because the joint distribution 𝑝(𝑥, 𝑦) is unknown. Since the only information we have about 𝑝(𝑥, 𝑦) is the set of observations {(𝑥𝑡 , 𝑦𝑡 ), 𝑡 = 1 … 𝑇}, the risk can be estimated empirically in the following way: 𝑇

𝑅̂ (ℎ) =

1 ∑ 𝑙(𝑦𝑡 , ℎ(𝑥𝑡 )) 𝑇

(3.4)

𝑡=1

Finally, the function ℎ that gives the minimal empirical risk is chosen for predicting outputs of feature vectors not contained in the given set of observations (training set).

32

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

3.1.2 Probabilistic interpretation Finding a hypothesis ℎ that performs a mapping between input feature vectors and output variables can be seen from a probabilistic point of view as finding an estimation 𝑝̂ (𝑦/𝑥) of the conditional probability of the outputs given the inputs (Vapnik, 2000). Depending on the way they estimate 𝑝(𝑦/𝑥), supervised machine learning approaches can be classified into two categories: -

Discriminative approaches which directly estimate 𝑝(𝑦/𝑥).

-

Generative approaches which estimate the joint distribution 𝑝(𝑦, 𝑥) from training data and use the Bayes rule to calculate the posterior distribution.

Supervised machine learning techniques can only provide estimates 𝑝̂ (𝑦/𝑥) (according to the training sample) of the true posterior probabilities 𝑝(𝑦/𝑥) which are always accompanied with error (Alpaydin, 2010). Estimation error is considered as a random variable with mean 𝛽 (called bias) and variance 𝜎 2 which can be written as: 𝛽 = 𝐸(𝑝̂ ) − 𝑝

(3.5)

and, 2

𝜎 2 = 𝐸 [(𝑝̂ − 𝐸(𝑝̂ )) ]

(3.6)

Where 𝐸 is the expectation, 𝑝 is short for 𝑝(𝑦/𝑥) and 𝑝̂ is short for 𝑝̂ (𝑦/𝑥). As the estimation error increases, the generalization capacity of the prediction model decreases. The bias term is generally related to the choice of the hypothesis space and measures how much the expected value varies from the correct one. On contrary, the variance, which measures how much on average the estimation varies around the true value when we change the training data, depends on both the choice of hypothesis space complexity and the training data. Figure 3.2-left illustrates the relationship between these two measures and the complexity of the hypothesis space while figure 3.2-right shows their relationship with the size of training set.

33

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

Figure 3.2: Decomposition of the generalization error in supervised learning. Left: relationship between generalization error and hypothesis space complexity. Right: relationship between generalization error and size of training set (adapted from Alpaydin, 2010)

Since a linear hypothesis space usually allows separating data from different classes in MI-based BCIs, choosing a more complex hypothesis space may increase error because the variance increases in comparison to linear functions while the bias is relatively similar. The variance of error also increases when the size of training set decreases. This is one of the major problems in MI-based BCIs because more training data means longer calibration sessions which is not practical in out-of-the-lab applications. In the next section, I will describe a classical technique that allows reducing the variance of estimation error in supervised machine learning. More advanced techniques will be reviewed in the second part of this chapter.

3.1.3 Variance reduction using regularization The situation in which the variance of estimation error in supervised machine learning techniques is high is called overfitting (Hastie et al., 2001). The first reason of overfitting is choosing a hypothesis space more complex than the real hypothesis space of the problem at hand. When we have enough knowledge about the data, we can choose the simplest hypothesis space which allows resolving the learning problem without the risk of overfitting. But in many cases we do not have prior knowledge about the learning problem and choosing a simple supervised learning technique may not allow resolving the problem at hand which is called underfitting. An efficient approach that allows trading off between model complexity and performance is called regularization. It consists of adding a regularization term to the empirical risk to be minimized as follows: 𝑇

𝑅̂ (ℎ) =

1 ∑ 𝑙(𝑦𝑡 , ℎ(𝑥𝑡 )) + 𝜆. 𝛷(ℎ) 𝑇 𝑡=1

34

(3.7)

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

The regularization function 𝛷: 𝐻 → ℝ allows controlling model complexity. The regularization parameter λ ∈ [0, +∞) is generally chosen using a validation set not used for training. High values of λ may lead to underfitting and low values of λ may lead to overfitting. The most commonly used regularization functions are the L2 norm and the L1 norm of the model parameters 𝜃, ||𝜃||2 and ||𝜃||1 respectively (Hastie et al., 2001). The L2 norm leads generally to smooth solutions in which large deviations around the mean are penalized. The L1 norm leads to sparse solutions in which only values of the parameter vector corresponding to relevant features are non-zero. This is generally useful when the size of feature space is high. Regularization is not only used for controlling model complexity but it can be also used with simple models such as linear classifiers. When the size of training set is small, outliers may lead to large deviations of the decision boundary of linear classifiers. Regularization allows reducing the variance of estimation error when the size of training set is small by penalizing large deviations caused by outliers. This is important for EEG signals classification in which outliers are prevalent.

3.1.4 Examples of supervised learning techniques used in MI-based BCIs 3.1.4.1 Support Vector Machines (SVM) In section 2.6, I introduced linear discriminant analysis (LDA) classifier which is a supervised machine learning technique that allows classifying data using linear decision boundaries. In this section, I describe another commonly used classification technique in BCIs which called Support Vector Machines (SVM). I discuss both linear and nonlinear cases. 3.1.4.1.1 Linear SVM SVM is a machine learning technique that was introduced by Vladimir Vapnik (Vapnik, 2000). Like LDA, given a training set, linear SVM tries to separate instances from different classes using linear decision boundaries. Unlike LDA, the optimal decision boundary in linear SVM is the one that maximizes the margin between data points belonging to two different classes (Figure 3.3). The closest data points from the margin are called support vectors. In order to find the optimal hyper-plane separating samples from different classes, a linear SVM algorithm solves the following optimization problem: 𝑇

1 𝑚𝑖𝑛𝜃,𝑏,𝜀 ||𝜃||22 + 𝐶 ∑ 𝜀𝑡 2 𝑡=1



𝑠. 𝑡. 𝑦𝑡 . ((𝜃 . 𝑥𝑡 ) + 𝑏) ≥ 1 − 𝜀𝑡 , 𝜀𝑡 ≥ 0, 𝑡 = 1 … 𝑇

35

(3.8)

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

Where 𝜃 and 𝑏 are respectively the vector of parameters and bias term of the hyperplane separating the two classes, 𝜀𝑡 , 𝑡 = 1 … 𝑇 are slack variables introduced to handle non-separable data and 𝐶 is a parameter controlling the amount of constraints violation introduced by the slack variables. The optimal decision function (hypothesis) is then: 0, 𝑖𝑓 (𝜃 ′ . 𝑥 + 𝑏) < 0 ℎ(𝑥) = { 1, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(3.9)

Figure 3.3: Optimal decision boundary for linear SVM

3.1.4.1.2 Nonlinear SVM: the kernel trick In case where training data are non-linearly separable, SVM maps data into a high dimensional feature space in which data are linearly separable (Figure 3.4). Instead of explicitly mapping data to a high dimensional space, a kernel function is used to substitute dot products in the original feature space which allows implicitly mapping data into a high dimensional feature space. In order to apply the kernel trick, the dual of equation (3.8) is used: 1 ′ 𝑚𝑎𝑥𝛼 ∑𝑇𝑡=1 𝛼𝑡 − 2 ∑𝑁 𝑡,𝑙=1 𝛼𝑡 𝛼𝑙 𝑦𝑡 𝑦𝑙 . (𝑥𝑡 𝑥𝑙 )

(3.10)

∑𝑇𝑡=1 𝛼𝑡 𝑦𝑡 = 0, 𝐶 ≥ 𝛼𝑡 ≥ 0, 𝑡 = 1 … 𝑇

A kernel function 𝐾: ℝ𝑑 × ℝ𝑑 → ℝ (𝑑 is the dimention of the feature space) replaces the dot product 𝑥𝑡′ 𝑥𝑙 in equation (3.10), which allows performing linear separation in a different space. An example of kernel function is the Gaussian kernel: 2

||𝑥𝑡 − 𝑥𝑙 || 𝐾(𝑥𝑡 , 𝑥𝑙 ) = exp (− ),𝜎 > 0 2𝜎 2

36

(3.11)

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

Figure 3.4: Example in which data from different classes are not linearly separable in one dimensional space but separable in two-dimensional space

3.1.4.2 Example of discriminative classification approach: logistic regression SVM is an example of discriminative supervised learning approach because it directly calculates the optimal decision boundary separating different classes. Logistic regression is another discriminative approach in which the probability of an output class 𝑦 given an input feature vector 𝑥 ∈ ℝ𝑑 takes the form (Hastie et al., 2001): 𝑝̂ (𝑦/𝑥) =

1 1+

′ 𝑒 −𝜃 .[1;𝑥]

(3.12)

where 𝜃 ∈ ℝ𝑑+1 is the weight vector (the first element 𝜃 0 accounts for the bias term). Given a training set {(𝑥𝑡 , 𝑦𝑡 ), 𝑡 = 1 … 𝑇}, the weight vector 𝜃 is adjusted by optimizing the following function, called log-likelihood: 𝑇 ′

𝑚𝑎𝑥𝜃 (− ∑ log(1 + 𝑒 −𝑦𝑡.𝜃 .[1;𝑥𝑡] ))

(3.13)

𝑡=1

For a new feature vector 𝑥 not contained in the training set, the output label is predicted as follows: 𝑦̂ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 (𝑝̂ (𝑦/𝑥))

(3.14)

3.1.4.3 Example of generative classification approach: linear discriminant analysis Although it is called “discriminant”, LDA is a generative machine learning approach that needs to learn the structure of the training data in order to find decision boundaries separating different classes (Hastie et al., 2001). In probabilistic settings, LDA calculates the class conditional probabilities indirectly using Bayes theorem: 𝑝(𝑦/𝑥) =

𝑝(𝑥/𝑦). 𝑝(𝑦) ∑𝑦∈𝑌 𝑝(𝑥/𝑦). 𝑝(𝑦)

(3.15)

Where p(x/y) is the class-conditional density and 𝑝(𝑦) is the prior probability of class 𝑦 (∑𝑦∈𝑌 𝑝(𝑦) = 1). 37

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

Class-conditional densities are assumed to be Gaussian which allows estimating them as follows: 𝑝̂ (𝑥/𝑦) =

1

1

𝑑 1 . exp (− 2 . (𝑥 (2. 𝜋) 2 . (det(𝛴))2



− 𝜇𝑦 ) . 𝛴 −1 . (𝑥 − 𝜇𝑦 ))

(3.16)

Where 𝑑 is the dimention of the feature space, 𝛴 is the common class covariance matrix, det(𝛴) is the determinant of this matrix and 𝜇𝑦 is the mean vector of class 𝑦. For a new feature vector 𝑥, the output label is predicted in the same way as in logistic regression: 𝑦̂ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 (𝑝̂ (𝑦/𝑥))

(3.17)

Since the denominator in equation (3.15) is the same for all classes, equation (3.17) becomes: 𝑦̂ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 (𝑝̂ (𝑥/𝑦). 𝑝̂ (𝑦))

(3.18)

3.1.5 Linear versus non-linear supervised learning techniques in MIbased BCIs In the presence of strong noise and outliers, linear supervised learning techniques are generally preferable to non-linear ones because they are less prone to overfitting (Bashashati et al, 2007). For this reason, most of BCI systems are based on linear classifiers such as linear discriminant analysis and linear support vector machines (Lotte et al., 2007). However, non-linear classification methods are more efficient when the patterns to be classified are not well understood because non-linear transformations can make it easier to distinguish different patterns in the data. In BCI applications, methods such as kernel-based SVM and neural networks have been used to create such transformations (Bashashati et al, 2007). A rencent study of the use of linear and non-linear classification methods in motor imagery-based BCIs has shown that the widespread view that linear methods are ideal for BCIs should be reconsidered (Steyrl et al., 2014). The supervised learning techniques that will be presented in the rest of this document are based on linear classification methods. Nevertheless, they can be easily extended to non-linear classification methods.

38

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

3.2 Challenges of supervised learning in MI-based BCIs The classical supervised learning scheme described so far is based on the assumption that training and test data are independently and identically distributed (i.i.d.). In EEG-based BCIs this assumption does not hold because of the non-stationary nature of EEG signals (Krauledat, 2008). As most of multivariate time series data, non-stationarity of EEG signals consists of change in their distribution over time: 𝑝(𝑋𝑡1 , 𝑦𝑡1 ) ≠ 𝑝(𝑋𝑡2 , 𝑦𝑡2 ) 𝑓𝑜𝑟 𝑡1 ≠ 𝑡2

(3.19)

Where (𝑋𝑡1 , 𝑦𝑡1 ) and (𝑋𝑡2 , 𝑦𝑡2 ) are the labeled multichannel EEG measurements at time steps 𝑡1 and 𝑡2 , respectively. Non-stationarity, also known as data shift in machine learning (Quionero-Candela et al., 2009; Cornuéjols, 2010), may occur in three cases: -

The marginal distribution 𝑝(𝑋𝑡 ) changes, which is called covariate shift in machine learning (Sugiyama et al., 2007).

-

The posterior distribution 𝑝(𝑦𝑡 /𝑋𝑡 ) changes, which is known in machine learning as concept drift (Gama et al., 2014).

-

Both distributions change.

In MI-based BCIs different types of data shift may occur during the same session for many reasons (Krauledat, 2008; Samek, 2014). They can be related to artifacts such as loose electrodes, muscular artifacts, blinking, swallowing, etc. They can also be related to changes in mental state of the BCI user such as tiredness, attention deficits, etc. Furthermore, the change in experimental paradigm is also a main cause of EEG signals non-stationarity. In fact, the change of interaction mode from “cue-based” during calibration phase to “self-paced” during feedback phase can significantly influence the mental state of the user and consequently lead to non-stationarity in the signals recorded from his cortex. Also occurrence of errors during feedback phase may alter the user’s mental state. Whatever its cause, non-stationarity in EEG signals may have dramatic influence on both feature extraction and classification stages in MI-based BCIs. From one side, spatial filtering techniques such as CSP algorithm use a data-driven approach in which they estimate covariance matrices using labeled data recorded during calibration. The lack of labeled data and the high dimensionality of the feature space lead to poorly estimated covariance matrices that do not represent well the underlying data generating process. Poorly estimated covariance matrices are

39

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

highly sensitive to outliers and abrupt changes in the mental state of the BCI user which makes the feature extraction process inefficient (Samek, 2014). On the other side, non-stationarity leads to change in the decision boundary of the classification algorithm between training and testing which deteriorates the generalization capacity of the classification model. For linear classifiers this means a rotation or translation of the hyper-plane separating different classes (Krauledat, 2008) (Figure 3.5).

Figure 3.5: Example illustrating the influence of non-stationarity on generalization capacity of linear classifiers

The most straightforward way to reduce effects of brain signals non-stationarity on the decoding performance of the BCI is to use a large training set for feature extraction and classification. However, increasing the size of the training set means increasing the duration of calibration phase which is not practical in daily-life applications. As an alternative to this naïve approach two main approaches were adopted in literature, namely the use of feature extraction and classification techniques that are invariant to the change or the use of techniques that adapt to the change. In the next subsections, I review both approaches. This review is limited to supervised learning. Reinforcement learning techniques such as (Grizou et al., 2014; Roset et al., 2013) are excluded from this review.

3.2.1 Using techniques that are invariant to the change In this category, I focus particularly on approaches based on knowledge transfer which have been actively explored in the BCI research community during the last years. Other approaches will be listed in the end of this section.

3.2.1.1 Knowledge transfer Knowledge transfer between different sessions and subjects has been shown to be very promising in BCI applications. It consists of incorporating labeled data recorded during different sessions of the same subject and/or from different subjects in the learning process of current BCI user. When performed correctly, knowledge transfer allows capturing information that generalizes across users and extends to new users. However, this type of learning is challenged by the high inter40

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

sessions and inter-subjects variability of EEG signals. Data recorded during one session may be very different from data recorded during other sessions of the same user (Liyanage et al., 2013). This may be related to the change of the user’s mental state, the change in experimental paradigm or the change in electrode placement between sessions. Variability is even higher in data recorded from different subjects (Samek et al., 2013b). In addition to the change in mental state and experimental paradigm, anatomical differences between users such as size of the head may have a significant effect in EEG signals variability. In order to alleviate the problem of brain signals variability, different machine learning techniques were used for performing knowledge transfer in MI-based BCIs. Figure 3.6 gives an overview of the most commonly used techniques that we will review in this section. 3.2.1.1.1 Regularization using data from other subjects and/or sessions Regularization allows performing knowledge transfer by combining parameters of the feature extraction and/or classification algorithm learned using data from current user with parameters learned using data from other users and/or sessions. In (Kang et al., 2009) a linear combination of class covariance matrices of calibration data recorded from current BCI user and covariance matrices of data recorded from other subjects was proposed for reducing calibration time in MIbased BCIs. (Lotte and Guan, 2010) proposed another regularization framework for inter-subjects classification in which class means and covariance matrices of CSP algorithm and LDA classifier learned using a small calibration set were regularized using data from other subjects. A unifying theoretical framework for regularizing CSP filters was proposed in (Lotte and Guan, 2011) in which learning from multiple subjects is taken into consideration. 3.2.1.1.2 Multi-task learning Multi-task learning consists of simultaneously training multiple learning tasks in order to capture shared information and increase the generalization ability of each learner (Ben-David and Schuller, 2003). It is generally useful when the size of training set in each learning task is small. Multi-task learning has been used to reduce calibration time in MI-based BCIs. (Devlaminck et al., 2011) proposed a multitask learning framework for learning CSP features of current BCI user based on its own data in conjunction with data from other users. Bayesian multitask learning was used for learning CSP filters and corresponding probabilistic classification models using data from different BCI users (Kang and Choi, 2011). (Alamgir et al., 2010) showed that combining data from different subjects with calibration data of new subject using multi-task learning allows outperforming a classifier learned using only subject-specific data. 3.2.1.1.3 Multiple kernel learning Multiple kernel learning has been widely used for feature fusion in different applications such as computer vision. (Samek et al., 2013a) applied it to inter-subjects classification in MI-based BCIs. 41

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

Instead of learning a single kernel as in classical SVM, multiple linear kernels are learned using data from different subjects. The contributions of other subjects were weighted according to their importance in classifying data from the new subject. 3.2.1.1.4 Ensemble learning Ensemble methods are one of the most widely used techniques for learning classification models using data from different sources. They consist of combining outputs of multiple learners in order to create prediction models that have better generalization abilities than any single learner (see chapter 4). In MI-based BCIs, ensemble learning has been used to address the problems of long calibration time and EEG signals non-stationarity. In (Fazli et al., 2009) a large database of EEG signals recorded from 83 subjects was used to create a subject-independent classification model. A regularized linear combination of classifiers learned using data recorded from different subjects and sessions allowed achieving acceptable classification rates without using calibration data from current BCI user. Using a small calibration set to adjust classifiers’ weights allowed increasing significantly classification accuracy. Two dynamically-weighted ensemble frameworks were attempted to deal with the problem of non-stationarity in MI-based BCIs. Dynamic classifiers weighting consists of recalculating classifiers’ weights in the ensemble for each incoming feature vector during feedback phase according to their accuracies in classifying samples in its neighborhood. The first framework aimed at choosing CSP filters that allow generalizing across subjects and adapting to new subjects (Tu and Sun, 2012). The second framework consisted of learning multiple classifiers using clustered data from a previous session which allows suppressing calibration phase in a new session of the same subject (Liyanage et al., 2013). 3.2.1.1.5 Other techniques used for knowledge transfer in MI-based BCIs Other techniques were used for inter-sessions and/or inter-subjects classification in MI-based BCIs. Feature space transformation using data from different subjects was attempted by (Heger et al., 2013) and (Samek et al., 2013b). (Krauledat et al., 2006; Krauledat et al., 2008) proposed to derive prototypical CSP filters that generalize across sessions using a clustering technique based on cosine distance. Last but not least, an adaptive scheme that starts with a subject-independent classifier learned using manually selected data sets from a large database of EEG signals recorded from different subjects was proposed in (Vidaurre et al., 2011c).

42

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

Figure 3.6: Categorization of machine learning techniques used for performing knowledge transfer in MIbased BCIs

3.2.1.2 Other techniques that are invariant to the change Other techniques such as finding a stationary subspace or robust covariance matrix estimation were also used for making prediction models invariant to the change and reducing calibration time in MI-based BCIs (Samek, 2014). Using a priori physiological information about which features and channels are robust to the change was also attempted in MI-based BCIs (Grosse-Wentrup et al., 2009; Ahn et al., 2011). (Lotte, 2015) has recently proposed a method for artificial data generation using a small calibration set which has been shown to be efficient in classifying EEG signals from motor imagery data sets.

3.2.2 Using techniques that adapt to the change In this category, learning spatial filters and classifiers does not stop at the end of calibration phase but online adaptation of their parameters continues during feedback phase (Sun and Zhou, 2014). Different techniques have been attempted for online adaptation of spatial filters and classifiers in MI-based BCIs. These techniques can be classified into three categories: supervised adaptation techniques in which true class labels are available after classification of each feature vector, unsupervised adaptation techniques in which class labels are unknown and semi-supervised adaptation techniques in which class-specific information is provided but uncertain (Figure 3.7).

3.2.2.1 Supervised adaptation techniques Several supervised adaptation techniques have been attempted for non-stationary data classification in MI-based BCIs. (Vidaurre et al., 2006) proposed supervised adaptation of the class covariance matrices of quadratic discriminant analysis (QDA) classifier based on a technique for measuring the quality of the signals used for adaptation called adaptive estimation of the information matrix (ADIM). The same authors proposed an online adaptation technique of 43

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

the weight vector of linear discriminant analysis (LDA) classifier based on Kalman filtering called Kalman Adaptive LDA (KALDA) (Vidaurre et al., 2007). (Buttfield and Millán, 2006) performed supervised adaptation of a Gaussian classifier using labeled data during training session in order to increase learning capacity of the user. Different online adaptation techniques at the feature extraction and classification levels were attempted by (Shenoy et al., 2006) in order to study non-stationarity within the same session. (Sun et al., 2011) used stochastic gradient descent methods in order to update Bayesian classifiers with Gaussian mixture models in a supervised manner. Although supervised adaptation approaches were effective in classifying non-stationary EEG data, their use in self-paced interaction settings is impossible since true class labels are not provided.

3.2.2.2 Unsupervised adaptation techniques The only unsupervised adaptive classification framework for MI-based BCIs that I am aware of was proposed by (Vidaurre et al., 2011a). In this framework, the common class covariance matrix and the pooled mean of a LDA classifier were updated using feature vectors received during feedback phase. Unsupervised approaches can be useful in self-paced interaction mode since they do not need class labels for adaptation. However they allow updating only class-independent parameters of the classifiers which makes them efficient in case of covariate shift but non efficient for tracking concept drift.

3.2.2.3 Semi-supervised adaptation techniques In the absence of true class labels during self-paced interaction mode, two approaches have been attempted in MI-based BCIs in order to estimate class-specific information. The first approach is based on the predictions of the previously trained classifier which is called “naïve labelling” (Zeyl and Chau, 2014). (Blumberg et al., 2007) proposed an adaptive LDA classifier (ALDA) based on expectation maximization. In the expectation step class posteriors are calculated for each incoming feature vector using the previously calculated class means and common covariance matrix. In the maximization step the parameters of the LDA classifier are updated using these posterior probabilities. An extended expectation maximization algorithm was proposed by (Li and Guan, 2006) for online adaptation of CSP features and Bayesian classifier using classifier’s predictions during feedback phase. Both (Oskoei et al., 2009) and (Qin and Li, 2006) used naïve labeling for online adaptation of a SVM classifier. Naïve labeling approaches may perform well in some cases but they are prone to error accumulation (Kuncheva et al., 2008). The second type of semi-supervised classifiers adaptation techniques in MI-based BCIs is based on error-related potentials (ErrPs). These potentials are detected in EEG signals of a BCI user just after the occurrence of an error (Chavarriaga et al., 2014). One particular type of ErrPs called interaction ErrPs (iErrPs) seems to be very promising for conceiving adaptive classification 44

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

algorithms in BCIs. iErrPs occur immediately after the user perceives that the feedback provided by the BCI is in contradiction to his intent (Ferrez and Millán, 2008) which provides an additional information about the predicted class labels. (Blumberg et al., 2007) proposed to include ErrPs in their ALDA algorithm by weighting class posterior probabilities by the reliability of error signals. The new algorithm, called ALDA with Error Correction (ALDEC) was evaluated using simulated ErrPs with different levels of reliability. Interaction error-related potentials were used for the first time in realistic interaction settings by (Llera et al., 2011). An adaptive logistic regression classifier was used to classify magnetoencephalography (MEG) signals in which adaptation was performed only when no iErrP was detected. Artusi et 2011 used iErrPs to increase the decoding accuracy of a BCI system based on the classification of the speed of motor imagery tasks. Updating the training set with samples estimated as correct based on the decoding of error potentials allowed increasing the decoding performance of the task classifier. Recently, (Zeyl and Chau, 2014) showed that online adaptation of LDA and logistic regression classifiers based on iErrPs allows increasing classification accuracy in comparison to a static classifier. They used a procedure for simulating iErrPs which allows assessing the influence of the accuracy of the iErrP classifier on the task classifier.

Figure 3.7: A review of state-of-the-art online classifiers adaptation techniques in MI-based BCIs

3.2.3 Robustness versus adaptiveness of supervised machine learning techniques in MI-based BCIs Although many out-of-the-box machine learning techniques have been attempted, the problem of reducing calibration time while maintaining good classification performance in motor imagerybased BCIs is still not resolved (Lotte, 2015). From one hand, robust classification techniques in MI-based BCIs assume the existence of information that is invariant to brain signals non-stationarity and try to capture it in order to increase the generalization capacity of the system. For knowledge transfer-based BCI systems, this information is supposed to be shared across data recorded from different subjects and/or during multiple sessions. Although they have been shown to be successful in previous studies, 45

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

these techniques may fail in increasing performance of BCI systems because the assumption on which they are based may not hold. This is mainly due to the high variability of brain signals between different subjects, different sessions of the same subject and between trials of the same session. On the other hand, adaptive classification techniques try to incorporate data recorded in online fashion in the learning process in order to track brain signals non-stationarity. The learning process starts with only few labeled data from target subject and do not allow taking advantage from abundant labeled data from other subjects and/or previous sessions of the same subject. These findings highlight the need of conceiving classification algorithms that make a compromise between robustness and adaptiveness in order to deal with the problems of brain signals nonstationarity and long calibration time in BCI systems.

3.3 Performance evaluation in supervised machine learning This section gives a short overview of some evaluation metrics of supervised machine learning techniques that will be used in the rest of the manuscript.

3.3.1 Evaluation in stationary environments It is important to assess the generalization ability of a classification model ℎ before using it for predicting outputs of future data samples. Depending on the size of the labeled data set 𝐷 = {(𝑥𝑡 , 𝑦𝑡 ), 𝑡 = 1 … 𝑇} that we have, offline evaluation of a prediction model can be performed in different ways. In case where the size of the data sample is big, it is generally divided into a training set 𝐷𝑡𝑟𝑎𝑖𝑛 of size 𝑁𝑡𝑟𝑎𝑖𝑛 used for learning the hypothesis ℎ and a test set 𝐷𝑡𝑒𝑠𝑡 of size 𝑇𝑡𝑒𝑠𝑡 used for evaluating its generalization capacity. Let ℎ(./𝐷𝑡𝑟𝑎𝑖𝑛 ) be the prediction model learned using the training set and 𝑙(𝑦𝑡 , ℎ(𝑥𝑡 /𝐷𝑡𝑟𝑎𝑖𝑛 )) be the error incurred by this model on the observation (𝑥𝑡 , 𝑦𝑡 ) (e.g., 𝑙 is the mean squared error or 0/1 loss). The error of the learned model in classifying test data can be calculated as follows (Lemm et al., 2011): 𝑒𝑟𝑟𝑡𝑒𝑠𝑡 =

1 𝑇𝑡𝑒𝑠𝑡

∑ 𝑙(𝑦𝑡 , ℎ(𝑥𝑡 /𝐷𝑡𝑟𝑎𝑖𝑛 ))

(3.20)

𝑡∈𝐷𝑡𝑒𝑠𝑡

The test error is a value between 0 and 1. For binary classification with 0/1 loss function, when it is close to 0, the classification model has good generalization capacity and when it is close to or higher than 0.5, the classification model is no more accurate than a random classifier. Classification accuracy is inversely proportional to the test error: 𝑎𝑐𝑐𝑡𝑒𝑠𝑡 = 1 − 𝑒𝑟𝑟𝑡𝑒𝑠𝑡 .

46

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

In case where the size the labeled data sample 𝐷 is small, cross-validation (CV) is used for offline assessment of generalization abilities of classification models. It consists of randomly splitting the data set 𝐷 into 𝐾 disjoint subsets of equal sizes 𝐷1 , … , 𝐷𝐾 and then evaluating the model 𝐾 times. In each time, the model is trained using all subsets except one left for testing. The generalization capacity is inversely proportional to the classification accuracy averaged over all subsamples: 𝐾

𝑒𝑟𝑟𝐶𝑉

1 = ∑ ∑ 𝑙(𝑦𝑡 , ℎ(𝑥𝑡 /𝐷\𝐷𝑘 ) 𝑇

(3.21)

𝑘=1 𝑡∈𝐷𝑘

A particular case of cross validation is called leave-one-out cross-validation (LOOCV) is used when the size of labeled set is very small. It corresponds to 𝐾 = 𝑇 which means that in each validation step, one sample (𝑥𝑡 , 𝑦𝑡 ), 𝑡 = 1 … 𝑇 is left out for testing and the rest of samples is used for classifier training. The LOOCV error is calculated as follows: 𝑇

𝑒𝑟𝑟𝐿𝑂𝑂𝐶𝑉

1 = ∑ 𝑙(𝑦𝑡 , ℎ(𝑥𝑡 /𝐷\(𝑥𝑡 , 𝑦𝑡 )) 𝑇

(3.22)

𝑡=1

This general validation scheme can be used for classification models without hyper-parameters such as linear discrimination analysis (LDA). However, in many cases classification models use hyper-parameters that need to be adjusted such as regularized LDA or a SVM classifier. In this case, the labeled set has to be split into three subsets: -

A training set 𝐷𝑡𝑟𝑎𝑖𝑛 used for learning model parameters.

-

A validation set 𝐷𝑣𝑎𝑙𝑖𝑑𝑎𝑡𝑖𝑜𝑛 used for adjusting the model’s hyper-parameters.

-

A test set 𝐷𝑡𝑒𝑠𝑡 used for assessing generalization capacity of the model.

Cross validation can be used in two loops where the inner loop is for hyper-parameters adjustment and the outer loop for generalization capacity assessment (Lemm et al., 2011).

3.3.2 Evaluation in non-stationary environments The classical model evaluation scheme described so far holds only when the samples in the labeled set 𝐷 are assumed to be independent and identically distributed. Because of brain signals non-stationarity in BCI systems, using this scheme for offline evaluation of a prediction model does not allow to know whether this model will perform well in online interaction settings or not.

3.3.2.1 Stochastically-dependent data In neuroscience in general and in human brain research in particular, experiments usually have a block design in which experimental conditions vary from one block to another (Figure 3.8). Data recorded during the same block are generally supposed to be stochastically dependent while data recorded during different blocks are supposed to be stochastically independent (Lemm et al., 47

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

2011). When we randomly split the data set for classifier evaluation, a trial in the test set is very likely to be classified correctly if trials from the same block are in the training set. Even if it performs well during evaluation, the classifier may not perform well in classifying new data samples. Thus, taking into consideration the stochastic dependence between data samples during evaluation is important for assessing the generalization capacity of classification models.

Figure 3.8: Schema of a block-design experiment where data in the same block are stochastically dependent and data from different blocks are sotochastically independent (Lemm et al., 2011)

3.3.2.2 Data from different sources Another situation that we often face in human brain research is when we have multiple data sets from different sources (e.g., brain signals recorded during different sessions and/or from different subjects). Pooling all the data sets together and randomly splitting them is not an efficient way for evaluating classification models because each data set is supposed to be distributed differently from others. In this case, leave-one-dataset-out cross-validation is generally used for assessing the generalization capacity of prediction models. Given 𝐾 data sets, in each time a prediction model is learned using 𝐾 − 1 data sets and evaluated using the remaining data set and the classification accuracy is then averaged over all iterations (Figure 3.9).

Figure 3.9: Schema of an experiment in which data are recorded from different individuals and/or during different sessions

3.4 Conclusion In this chapter, supervised machine learning techniques in motor imagery-based BCIs were reviewed. In the first part of the chapter, I presented the classical paradigms for solving supervised learning problems with emphasis to supervised learning techniques used in BCI systems. In the second part, I highlighted the most challenging problems of supervised learning in 48

Chapter 3: Supervised machine learning in motor imagery-based brain-computer interfaces

MI-based BCIs. The first problem is the long calibration time before every use of the BCI. Reducing calibration time means less time for the user to learn voluntarily modulating his sensorimotor rhythms and less data for the system to create a reliable prediction model. The second problem is the non-stationary nature of brain signals. This is a challenging problem because it violates the independent and identically distributed data assumption on which most of traditional machine learning techniques are based. Two categories of machine learning techniques were attempted to address these problems, namely techniques that are invariant to non-stationarity and techniques that adapt to it. An extensive review of these techniques was conducted in this chapter. This review ends with a critical analysis of the two main categories of classification techniques used in MI-based BCIs. The last part of the chapter was dedicated to performance evaluation of supervised learning techniques with distinction between stationary and nonstationary environments. In the rest of this manuscript, I will present the two main contributions of this thesis which consist of classification algorithms that combine both robustness and adaptiveness in order to deal with the problems of brain signals non-stationarity and long calibration time in BCI systems. In the first contribution, I present a knowledge transfer framework that allows dealing with the problem of high variability of brain signals by combining learning from other subjects and/or sessions and learning from target subject. The second contribution consists of a learning system that combines both knowledge transfer and online adaptation. Because these two contributions are based on ensemble methods, the next chapter will start with an overview of these methods and their use in motor imagery-based brain-computer interfaces.

49

4 Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

4

“When plugged in, the least elaborate computer can be relied on to work to the fullest extent of its capacity. The greatest mind cannot be relied on for the simplest thing; its variability is its superiority.” [Jacques Barzun]

Among all the machine learning techniques presented in the previous chapter, ensemble methods were chosen in this thesis for two reasons: the first one is their capacity in combining heterogeneous data from different sources. The second one is their flexibility as we can use different types of combination techniques. This chapter is devoted to the study of the use of linear combinations of classifiers in knowledge transfer-based BCI systems. It starts with a background on ensemble methods in machine learning. Then, a Bayesian analytical model of linear combiners is presented and the use of these techniques for knowledge transfer with emphasis to MI-based BCIs is discussed. After setting up the background, I propose a novel ensemble-based knowledge transfer framework for reducing calibration time in BCIs. In this framework, I investigate the effectiveness of weighting classifiers learned from different subjects and/or sessions according to their performances in classifying calibration data from a new subject. This allows creating a classification model that captures shared information between different subjects and/or sessions and adapts to the specificities of the new subject. I also investigate situations in which knowledge transfer has a negative impact on classification performance of target learning task and highlight the need of combining both learning from other subjects and/or sessions and learning from target subject in order to avoid such situations. 50

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

4.1 Combining multiple learners Combining multiple learners aims at conceiving prediction models that have better generalization abilities than any single learner (Kuncheva, 2004b). Although different learners may have similar training performances, they may have different generalization abilities which makes combining their decisions a more judicious choice than just picking one learner (Figure 4.1).

Figure 4.1: Illustration of statistical reason for combining multiple learners (adapted from Kuncheva, 2004b)

Two main conditions should be satisfied in order to conceive multiple classifiers systems (also called ensemble learners) that have good generalization abilities: the combined classifiers (known as base classifiers) should have minimal individual accuracies and their outputs should be decorrelated to complement each other. To satisfy these two conditions, two points have to be addressed (Alpaydin, 2010): -

The first point is how to generate diverse classifiers with minimal individual accuracies?

-

The second point is how to combine base classifiers in order to achieve maximum ensemble accuracy?

4.1.1 Generating complementary learners Diversity of base classifiers has been one of the hottest topics in designing classifiers ensembles during the last years (Didaci et al., 2013). Here I present practical approaches used for generating diverse ensembles. The theoretical foundations of diversity in multiple classifiers systems will be discussed in the next section.

4.1.1.1 Using different classification techniques Using different training algorithms may allow generating base learners with different outputs which guarantees ensemble diversity. As an example, generative and discriminative techniques use different approaches for learning. Combining both techniques may be useful for generating 51

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

accurate ensemble learners. Another example is using the same training algorithm with different parameters such as a SVM with different types of kernels.

4.1.1.2 Using different feature representations Training base classifiers using different feature representations may allow producing diverse ensembles because each one looks at the problem from different perspective. This is generally useful for high dimensional feature spaces or for multimodal data. The most common technique in this category is the random subspace ensemble method (Ho, 1998) which randomly splits the feature space into multiple subspaces and trains a single learner for each subspace.

4.1.1.3 Using different training sets This is the most commonly used approach for generating diverse ensemble learners (Kuncheva, 2004b). Techniques such as bagging and boosting are widely used for generating different subsets of data even if the size of training set is small. Bagging generates slightly different samples from the same data set by sampling with replacement. Boosting uses a different approach in which complementary base learners are generated by training the next learner on the mistakes of the previous learner. In case where we have multiple training sets from different sources, using an ensemble method is a straightforward choice for conceiving efficient learners.

4.1.2 Classifiers combination techniques There are multiple ways for combining multiple learners which can be classified into two categories: classifiers selection and classifiers fusion (Kuncheva, 2004b).

4.1.2.1 Classifiers selection Selection is a local approach for classifiers combination in which each base classifier is supposed to be specialized in one region of the feature space and responsible for generating outputs for feature vectors in that region. The most common approach in this category is mixture of experts in which a gating model chooses the base learner responsible for predicting the output label for each incoming feature vector (Waterhouse, 1998). Cascading is a particular case of selection in which a classifier is used for prediction when an input feature vector 𝑥 is submitted. In case where its prediction is uncertain, 𝑥 is transmitted to the next classifier.

4.1.2.2 Classifiers fusion In case where each base classifier has knowledge of the whole feature space, fusion is generally used for combining classifiers’ outputs. There are different types of combiners used for classifiers fusion but the most commonly used ones are the linear combiners (Biggio et al., 2007).

52

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

Let {ℎ𝑘 , 𝑘 = 1 … 𝐾} be a set of classifiers. Given a feature vector 𝑥 ∈ 𝑋, each classifier outputs a value ℎ𝑦𝑘 (𝑥) for each class label 𝑦 ∈ 𝑌. These outputs can be linearly combined in order to predict the output label for 𝑥 as follows: 𝐾

𝑦̂ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 ∑ 𝑤𝑘 . ℎ𝑦𝑘 (𝑥)

(4.1)

𝑘=1

Where 𝑤𝑘 , 𝑘 = 1 … 𝐾 are the non-negative weights assigned to the decisions of base classifiers. We distinguish two cases according to the values of classifiers’ outputs: discrete output and continuous output. In case where base classifiers produce binary outputs, the linear combination is simply a weighted majority vote: ℎ𝑦𝑘 (𝑥) = {

1 𝑖𝑓 𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑟 ℎ𝑘 𝑎𝑠𝑠𝑖𝑔𝑛𝑠 𝑙𝑎𝑏𝑒𝑙 𝑦 𝑡𝑜 𝑥 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(4.2)

Uniform majority vote is a particular case in which base classifiers are assigned the same weights: 1

𝑤𝑘 = 𝐾 , 𝑘 = 1 … 𝐾. Using abstract level of outputs does not allow accounting for the degree of uncertainty of the classifiers’ decisions. This can be dealt with in the weighted average combination technique in which base classifiers produce continuous outputs that are generally empirical estimations of posterior probability distributions. In the next section, theoretical aspects of these techniques are presented in order to illustrate how combining the outputs of multiple learners allows reducing posterior probabilities estimation error.

4.2 Theory of linear combiners Of the various combining rules existing in literature, linear combiners are the most widely used ones in multiple classifier systems (Biggio et al., 2009; Erdogan and Sen, 2010). Despite their wide use, the theoretical foundations of these combination rules are still debated. The Bayesian analytical model introduced by Tumer and Ghosh is one of the first works on this topic (Tumer and Ghosh, 1996; Tumer and Ghosh, 1995). It allowed quantifying the advantage attainable by the simple average of classifiers in terms of reduction of estimation error of the posterior probabilities. Fumera and roli (Fumera and Roli, 2003; Fumera and Roli, 2005, Biggio et al., 2007; Biggio et al., 2009) discussed the assumptions on which this model was based and extended

53

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

it to the weighted average rule. This Bayesian analytical framework applies to classifiers that provide probabilistic outputs such as neural networks and linear discriminant analysis. Let {ℎ𝑘 , 𝑘 = 1 … 𝐾} be a set of classifiers. Given a feature vector 𝑥 ∈ 𝑋, each classifier outputs an estimation of the posterior probability of each class label 𝑦 ∈ 𝑌 that can be written as: ℎ𝑦𝑘 (𝑥) = 𝑝(𝑦/𝑥) + 𝜀𝑦𝑘 (𝑥), 𝑘 = 1 … 𝐾

(4.3)

Where 𝜀𝑦𝑘 (𝑥) is the estimation error, which is considered as random variable with mean (bias) 𝛽𝑦𝑘 and variance (𝜎𝑦𝑘 )2 . The posterior probability of class 𝑦 given 𝑥 can also be estimated using a simple average (denoted “sa”) of the set of classifiers in the following way: 𝐾

ℎ𝑦𝑠𝑎 (𝑥)

1 = ∑ ℎ𝑦𝑘 (𝑥) = 𝑝(𝑦/𝑥) + 𝜀𝑦𝑠𝑎 (𝑥) 𝐾

(4.4)

𝑘=1

where, 𝐾

𝜀𝑦𝑠𝑎 (𝑥)

1 = ∑ 𝜀𝑦𝑘 (𝑥) 𝐾

(4.5)

𝑘=1

The estimation error of the simple average ensemble is also a random variable with bias 𝛽𝑦𝑠𝑎 and variance (𝜎𝑦𝑠𝑎 )2 . Tumer and Ghosh (Tumer and Ghosh, 1996) showed that they can be expressed as: 𝐾

𝛽𝑦𝑠𝑎

1 = ∑ 𝛽𝑦𝑘 𝐾

(4.6)

𝑘=1

and 𝐾

(𝜎𝑦𝑠𝑎 )2

𝐾

1 1 = 2 ∑(𝜎𝑦𝑘 )2 + 2 ∑ ∑ 𝜌𝑦𝑘𝑙 𝜎𝑦𝑘 𝜎𝑦𝑙 𝐾 𝐾 𝑘=1

Where

𝜌𝑦𝑘𝑙

(4.7)

𝑘=1 𝑙≠𝑘

is the correlation coefficient between

𝜀𝑦𝑘 (𝑥)

and 𝜀𝑦𝑙 (𝑥) and 𝜎𝑦𝑘 is the standard deviation

of 𝜀𝑦𝑘 (𝑥). The simple average rule is a particular case of the weighted average rule in which base classifiers are assigned equal weights. Fumera and Roli (Fumera and Roli, 2005) analyzed the estimation error of the weighted average rule (denoted “wa”) which can be written as follows:

54

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

𝐾

ℎ𝑦𝑤𝑎 (𝑥)

= ∑ 𝑤𝑘 ℎ𝑦𝑘 (𝑥) = 𝑝(𝑦/𝑥) + 𝜀𝑦𝑤𝑎 (𝑥)

(4.8)

𝑘=1

For the weighted average to express an estimation of posterior probability, base classifiers’ weights should satisfy the following constraints: 𝐾

∑ 𝑤𝑘 > 0 , 𝑤𝑘 ≥ 0, 𝑘 = 1 … 𝐾

(4.9)

𝑘=1

For computational convenience, the first constraint can be replaced by ∑𝐾 𝑘=1 𝑤𝑘 = 1. The bias 𝛽𝑦𝑤𝑎 and the variance (𝜎𝑦𝑤𝑎 )2 of the estimation error of the weighted average ensemble for class 𝑦 are given as follows: 𝐾

𝛽𝑦𝑤𝑎

= ∑ 𝑤𝑘 𝛽𝑦𝑘

(4.10)

𝑘=1

And 𝐾

(𝜎𝑦𝑤𝑎 )2

=

𝐾

∑ 𝑤𝑘2 (𝜎𝑦𝑘 )2 𝑘=1

+ ∑ 𝑤𝑘2 ∑ 𝜌𝑦𝑘𝑙 𝜎𝑦𝑘 𝜎𝑦𝑙 𝑘=1

(4.11)

𝑙≠𝑘

The bias-variance decomposition presented above is related to the added error for an individual class. Tumer and Ghosh (Tumer and Ghosh, 1996) presented a model for analyzing the added error of the simple average ensemble for binary classification. Their model is based on the assumption that the estimation error consists of a shift of the decision boundary between the two classes. Fumera and Roli (Fumera and Roli, 2005) used the same model for analyzing the added error of the weighted average ensemble in one dimensional space (Figure 4.2).

55

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

Figure 4.2: True posterior probabilities around the boundary x* between classes y0 and y1 (solid lines) and estimated posteriors leading to the boundary xb (dashed lines). Lightly and darkly shaded areas represent the contribution of this class boundary to Bayes error and to added error, respectively (Fumera and Roli, 2005).

The estimation error of class posteriors is equivalent to the surface of the darkly shaded area in figure 4.2 which can be calculated as follows: 𝑥 ∗ +𝑏 𝑤𝑎 𝜀𝑎𝑑𝑑 = ∫ [𝑝(𝑦1 /𝑥) − 𝑝(𝑦0 /𝑥)]𝑝(𝑥)𝑑𝑥

(4.12)

𝑥∗

Using a first-order approximation, this added error becomes

𝑝(𝑥 ∗ )𝑡 2 𝑏 , 2

where 𝑡 = 𝑝′ (𝑦1 /𝑥 ∗ ) −

𝑝′ (𝑦0 /𝑥 ∗ ). The expectation of the added estimation error with respect to 𝑏 can be written as: 𝑤𝑎 𝐸𝑎𝑑𝑑 =

where 𝛽𝑏𝑤𝑎 =

𝑤𝑎 𝛽𝑦𝑤𝑎 0 −𝛽𝑦1

𝑡

𝑝(𝑥 ∗ )𝑡 [(𝛽𝑏𝑤𝑎 )2 + (𝜎𝑏𝑤𝑎 )2 ] 2

and (𝜎𝑏𝑤𝑎 )2 =

2 𝑤𝑎 2 (𝜎𝑦𝑤𝑎 0 ) +(𝜎𝑦1 )

𝑡2

(4.13)

.

The expected value of the added error of the weighted average ensemble can be written as a linear combination of the expected values of the added errors of individual classifiers: 𝐾 𝑤𝑎 𝐸𝑎𝑑𝑑

𝑘 = ∑ 𝑤𝑘2 𝐸𝑎𝑑𝑑

(4.14)

𝑘=1

where

𝑘 𝐸𝑎𝑑𝑑

is the expected value of the added error of the base classifier ℎ𝑘 , 𝑘 = 1 … 𝐾.

Fumera and Roli (Fumera and Roli, 2005) have shown that the optimal base classifiers’ weights that allow achieving minimal added error are the following:

56

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

1 𝑘 𝐸𝑎𝑑𝑑

𝑤𝑘 =

∑𝐾 𝑙=1 (

1 𝑙 𝐸𝑎𝑑𝑑

,𝑘 = 1…𝐾

(4.15)

)

Although this theoretical framework is based on the assumption that estimation error consists of a shift in the decision boundary between two classes, it allows drawing two main conclusions about linear combinations of classifiers: -

The accuracy of the ensemble does not depend only on individual accuracies of base classifiers but also on the diversity of their predictions: the lower the correlation between base classifiers’ predictions the higher the accuracy of the ensemble. In theory, the influence of diversity of base classifiers on the accuracy of the ensemble classifier can be quantified, but in practice it is hard to make a compromise between diversity and accuracy of base classifiers. This explains the extensive research work on diversity in multiple classifier systems during the last years (Didaci et al., 2013).

-

Optimal base classifiers weighting is a crucial aspect in weighted average ensemble methods. The bias-variance decomposition of the estimation error shows that base classifiers’ weights are inversely proportional to their added errors. Since the variance of estimation error is the only term that depends on the size of training data, base classifiers with low variance are generally preferred for ensemble methods. Choosing the appropriate weights allows learning more complex decision boundaries without increasing the variance of the estimation error.

4.3 Linear combiners in ensemble-based knowledge transfer: application to inter-subjects and/or inter-sessions classification in MI-based BCIs Ensemble-based knowledge transfer frameworks applied to MI-based BCIs are generally based on linear combinations of classifiers in which different classifiers weighting techniques are used to overcome the problem of brain signals variability. In this section, advantages and drawbacks of different classifiers weighting techniques proposed in the literature are discussed. Let 𝑆 1 , … 𝑆 𝐾 be 𝐾 datasets (from now called source data) corresponding to EEG signals recorded from different subjects and/or during different sessions. Each data set 𝑆 𝑘 is used for training a spatial filtering matrix 𝑊 𝑘 and corresponding classifier ℎ𝑘 , 𝑘 = 1 … 𝐾. Given an EEG measurement 𝑋 recorded from a new subject (from now called target subject), a straightforward way for predicting its class label 𝑦 is to combine the outputs of the classification models learned from source data. These classification models should be weighted according to their individual 57

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

contributions to the ensemble in order to create an accurate predictor. This can be expressed as follows: 𝐾

𝑦̃ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 ∑ 𝑤𝑘 . ℎ𝑦𝑘 (𝑥 𝑘 )

(4.16)

𝑘=1

where 𝑦̃ is the predicted label, 𝑤𝑘 is the weight of the classification model ℎ𝑘 , 𝑥 𝑘 is the feature vector extracted from the EEG measurement 𝑋 using the spatial filtering matrix 𝑊 𝑘 as in equation (2.6) and ℎ𝑦𝑘 (𝑥 𝑘 ) is the 𝑘 𝑡ℎ classification model support for class 𝑦 given the EEG measurement 𝑋. In stationary environments, each spatial filter bank and corresponding classifier can be weighted according to its training performance. In non-stationary environments, other techniques should be used in order to efficiently learn a weighted-average ensemble. According to the way of weighting classification models, ensemble-based knowledge transfer approaches can be classified into three categories: dynamic weighting approaches in which weights are dynamically calculated for each incoming feature vector, static weighting approaches using source data and static weighting approaches using few labeled samples from target learning task (target subject in our case).

4.3.1 Dynamic weighting Dynamic classifiers weighting is the most commonly used strategy for combining multiple learners in changing environments (Kuncheva, 2004a). It consists of recalculating base classifiers’ weights for each incoming feature vector. Given an EEG measurement 𝑋, the classification models are weighted according the position of 𝑋 in the feature space. I describe two examples of dynamic weighting methods used in MI-based BCIs which will be studied in the next chapter. I call the first method Dynamic WEighting according to the Neighborhood (DWEN) and the second method Dynamic WEighting according to Classes’ centers (DWEC).

4.3.1.1 Dynamic weighting according to the neighborhood This approach was used in the ensemble-based inter-subjects classification framework proposed in (Tu and Sun, 2012). In this approach, estimates of base classifiers’ posterior probabilities are calculated according to their accuracies in classifying trials in the neighborhood of the incoming feature vector. Given an EEG measurement 𝑋, each classification model ℎ𝑘 is weighted independently from the previously calculated weights as follows:

𝑤𝑘 (𝑋) =

𝑁𝑘+ (𝑥𝑘 ) |𝑆𝑘 |

58

,𝑘 = 1…𝐾

(4.19)

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

where 𝑁+𝑘 (𝑥 𝑘 ) is the maximum number of neighbors of 𝑥 𝑘 belonging to the same class in the data set 𝑆 𝑘 and |𝑆 𝑘 | is the size of the data set 𝑆 𝑘 . The neighborhood of 𝑥 𝑘 is defined using the cosine distance and a parameter 𝑑 as: 𝑘

𝑘

𝑘

𝑥𝑘 . 𝑧𝑘

𝑘

𝑁 (𝑥 ) = {𝑧 ∈ 𝑆 / 1 −



√𝑥 𝑘 . 𝑥 𝑘 ′ × √𝑧 𝑘 . 𝑧 𝑘 ′

< 𝑑}

(4.20)

4.3.1.2 Dynamic weighting according to classes’ centers This approach, originally called dynamically weighted ensemble classification (DWEC), was used for knowledge transfer from a previous session to a new session of the same subject (Liyanage et al., 2013). Multiple classification models are learned using clustered data from a previous session and EEG measurements from the new session are dynamically classified using a majority vote ensemble. Base classifiers are weighted according to the distance of the incoming feature vector to the centers of clusters in each class. Here I describe it in case of inter-subjects classification where each class in a data set from source data corresponds to a cluster. Using a weighted majority vote, each EEG measurement 𝑋 is assigned class label 𝑦 with maximal 𝑘 𝑘 support ℎ𝑦𝑤𝑎 (𝑋) = ∑𝐾 𝑘=1 𝑤𝑘 . ℎ𝑦 (𝑥 ), where the support of each classification model to the class

𝑦 is binary: 1, 𝑖𝑓 ℎ𝑘 𝑙𝑎𝑏𝑒𝑙𝑠 𝑥 𝑘 𝑖𝑛 𝑐𝑙𝑎𝑠𝑠 𝑦 ℎ𝑦𝑘 (𝑥 𝑘 ) = { 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

(4.22)

It has been shown (Kuncheva, 2004b) that the optimal set of weights in weighted majority voting in a stationary environment is 𝑤𝑘 = log (

𝑝𝑘 ),𝑘 = 1…𝐾 1 − 𝑝𝑘

(4.23)

Where 𝑝𝑘 is the training accuracy of the kth classifier. Because of the high inter-subjects variability of EEG signals, optimality cannot be achieved using static weights. Since true class labels are not available during feedback phase of the new BCI user, (Liyanage et al., 2013) proposed a dynamic approach for classifiers’ performances estimation in binary classification tasks. Let 𝑑0𝑘 (𝑥 𝑘 ) be the distance of the feature vector 𝑥 𝑘 to the center of class 0 in the data set 𝑆 𝑘 and 𝑑1𝑘 (𝑥 𝑘 ) the distance of 𝑥 𝑘 to the center of class 1 in the same data set. The proposed function to estimate the performance of the classifier ℎ𝑘 in classifying neighborhood of the EEG measurement 𝑋 is defined as:

59

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

1 1 2 𝑝𝑘 (𝑋) = 1 − exp (− 2 (𝑙𝑜𝑔(𝑑0𝑘 (𝑥 𝑘 )) − 𝑙𝑜𝑔(𝑑1𝑘 (𝑥 𝑘 ))) ) 2 𝜓𝑘

(4.24)

𝜓𝑘 is a parameter that can be tuned by optimizing the following objective function 2

|𝑆 𝑘 |

𝑓(𝜓𝑘 ) = [

1 (∑ 𝑝𝑘 (𝑋𝑡 )) − 𝑝𝑘 ] |𝑆𝑘 |

(4.25)

𝑡=1

𝑘

𝑘

𝑘

Where |𝑆 | is the size of data set 𝑆 and 𝑝 is the training performance of the classifier ℎ𝑘 . Classifiers are then dynamically weighted for each incoming EEG measurement 𝑋 by replacing equation (4.23) by 𝑤𝑘 (𝑋𝑡 ) = log (

𝑝𝑘 (𝑋𝑡 ) ) 1 − 𝑝𝑘 (𝑋𝑡 )

(4.26)

Knowledge transfer based on dynamic classifiers weighting can be useful in zero-calibration BCI systems as they are based only on the positions of incoming samples in the feature space. However, these approaches do not take into consideration the stochastic dependence of timecontingent feature vectors and may perform poorly in the presence of outliers. They may also increase complexity of the prediction model.

4.3.2 Static weighting using source data In this case, base classifiers’ weights are learned using source data. The classifications models are assigned static weights that are proportional to their generalization capacities across data from different sessions and/or subjects. The framework proposed in (Fazli et al., 2009) is an example of weighted average ensemble in which base classifiers’ weights are learned using source data. A large database of EEG signals recorded from 83 subjects during different sessions was used. EEG signals from each subject and each session were filtered in different frequency bands and used for learning CSP filters and corresponding classifiers. Quadratic regression with 𝑙1 regularization was used for learning base classifiers’ weights which resulted in a sparse solution in which only classification models that generalize across all subjects are maintained. Offline evaluation showed that this framework allows the use of a BCI without calibration at a moderate performance loss. In the absence of calibration data from target BCI user, base classifiers weighting using source data may be a judicious choice. However, this weighting scheme requires a lot of labeled data from different subjects and sessions in order to achieve good classification performance.

60

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

4.3.3 Static weighting using calibration data Base classifiers’ contributions to the ensemble can also be weighted using a set of labeled samples from target learning task (Samdani and Yih, 2011). Although this approach is commonly used in ensemble-based knowledge transfer (Dai et al., 2007; Kamishima et al., 2009; Samdani and Yih, 2011), it has not been well investigated in BCI applications. In (Tu and Sun, 2012) calibration data was used to construct spatial filter banks that are adaptive to the target subject but a dynamic classifiers weighting approach was used during feedback phase. Because subject-calibrated systems are often superior in terms of classification accuracy in comparison to zero-calibration systems (Samek, 2014), base classifiers weighting using calibration data may be a more judicious choice for ensemble-based knowledge transfer in BCI applications. Subject-independent information is important for making classification models “robust” against outliers and nonstationarity but subject-specific information is necessary for adapting classification models to the specific signature of each BCI user.

4.4 A novel ensemble-based knowledge transfer framework for reducing calibration time in brain-computer interfaces 4.4.1 Motivation In the rest of this chapter, I present a novel ensemble-based inter-subjects classification framework for reducing calibration time in BCIs. In this framework, I study the effectiveness of base classifiers weighting according to their performances in classifying calibration data from target subject and investigate the relationship between the size of calibration set and the accuracy of this weighting scheme. Learning base classifiers using enough data from different BCI users may allow capturing shared information that is invariant to non-stationarity while learning classifiers’ weights using calibration data from target user may allow adapting the classification model to the specificities of his/her brain activity patterns. Through this framework, I also investigate to which exent knowledge transfer is useful in BCI applications. In some cases, knowledge transfer may have a negative impact on the performance of target learning task (Pan and Yang, 2010). This situation, called “negative transfer”, is likely to happen in BCI applications because of the high inter-subjects and inter-sessions variability of brain activity patterns. The high variability of brain activity patterns means that a classification model learned using data from one subject or a previous session does not generalize well for new subjects or new sessions. In case of ensemble-based knowledge transfer, appropriate weighting of base classifiers is a necessary condition for efficiently learning from different sources of data. However, it is not a sufficient condition for avoiding negative transfer because in some situations any linear combination of 61

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

classifiers trained on data from different learning tasks does not perform well in classifying data from a target learning task. Thus, not only generalization capacities of base classifiers should be assessed but also the overall performance of the ensemble in classifying data from target learning task.

4.4.2 Methods In this section, different steps of the proposed knowledge transfer framework for MI-based BCIs are described.

4.4.2.1 Learning CSP filters and corresponding classifiers from other subjects Let {𝑊 1 , … , 𝑊 𝐾 } and {ℎ1 , … , ℎ𝐾 } be the spatial filtering matrices (filter banks) and corresponding classifiers learned from different subjects that performed the same MI-based BCI experiment and/or different sessions of the same subject. Let 𝐿 = {(𝑋𝑡 , 𝑦𝑡 ), 𝑋𝑡 ∈ ℝ𝐷×𝑁 , 𝑦𝑡 ∈ 𝑌, 𝑡 = 1 … 𝑇} be a small set of labeled multi-channels EEG measurements recorded during calibration phase of a new subject, where 𝐷 is the number of channels and 𝑁 the number of samples. In order to create a prediction model that allows generalizing across subjects and extend to the new subject, a weighted average ensemble in which the spatial filter banks and corresponding classifiers learned from other subjects and/or sessions are weighted according to their contribution in classifying data from the new subject is used. Given a multichannel EEG measurement 𝑋 ∈ ℝ𝐷×𝑁 , the support of the ensemble for class label 𝑦 ∈ 𝑌 can be written as follows: 𝐾

ℎ𝑦𝑤𝑎 (𝑋)

= ∑ 𝑤𝑘∗ . ℎ𝑦𝑘 (𝑥 𝑘 )

(4.29)

𝑘=1

where

𝑤𝑘∗ , 𝑘

= 1 … 𝐾 are the weights learned using calibration data of the new subject, 𝑥 𝑘 is the

log-variance feature vector corresponding to EEG measurement 𝑋 filtered in the frequency band in which the filter bank 𝑊 𝑘 was learned and projected into the subspace spanned by this filter bank and ℎ𝑦𝑘 (𝑥 𝑘 ) is the support of classifier ℎ𝑘 for class 𝑦. Given a convex loss function 𝑙 (Boyd and Vandenberghe, 2004), the optimal weight vector 𝑤 ∗ can be learned by solving the following optimization problem: 𝑇

𝐾

𝑤 = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 ∑ 𝑙(∑ 𝑤𝑘 . ℎ𝑦𝑘𝑡 (𝑥𝑡𝑘 ), 𝑦𝑡 ) ∗

𝑡=1

𝑘=1

𝐾

𝑠. 𝑡. ∑ 𝑤𝑘 > 0 , 𝑤𝑘 ≥ 0, 𝑘 = 1 … 𝐾 𝑘=1

62

(4.30)

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

In case where base classifiers produce continuous outputs which represent the degrees of support for each class, the mean squared error can be used as a loss function in the previous formula which becomes: 𝑇

2

𝐾

𝑤 ∗ = 𝑎𝑟𝑔𝑚𝑖𝑛𝑤 ∑ (∑ 𝑤𝑘 . ℎ𝑦𝑘𝑡 (𝑥𝑡𝑘 ) − 1) 𝑡=1

(4.31)

𝑘=1

𝐾

𝑠. 𝑡. ∑ 𝑤𝑘 = 1 , 𝑤𝑘 ≥ 0, 𝑘 = 1 … 𝐾 𝑘=1

where 1 is the maximal support of each classifier to the output label 𝑦𝑡 and for computational convenience, base classifiers’ weights should be non-negative and sum-up to 1. This is a quadratic optimization problem that can be solved using gradient-based methods (Boyd and Vandenberghe, 2004). The prediction of the ensemble can then be written as: 𝐾

𝑦̃ = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 ∑ 𝑤𝑘∗ . ℎ𝑦𝑘 (𝑥 𝑘 )

(4.32)

𝑘=1

This ensemble-based inter-subjects classification approach can be effective when a linear combination of the classifiers learned from other subjects allows approximating the optimal hypothesis ℎ∗ for classifying data from the new subject. However, in cases where there is not enough data from other subjects and/or previous sessions or the calibration set is very noisy, negative transfer may occur.

4.4.2.2 Avoiding negative transfer A mechanism for assessing the capacity of the knowledge transfer framework to extend to the new subject should be used to avoid situations in which negative transfer may occur. Since the small calibration set is the only information given about the new subject, it may be used to perform this assessment. Certainly it does not give a complete view about the generalization capacity of the classification model learned from other subjects but it reduces uncertainty about it. Let 𝑊 𝐾+1 and ℎ𝐾+1 be the spatial filter bank and corresponding classifier learned using calibration data of the new subject. In order to assess whether the linear combination of classifiers learned from other subjects performs well in classifying calibration data of the new subject, it is compared to the classifier ℎ𝐾+1. Since knowledge transfer is supposed to be useful only when the size of calibration set from target subject is small, leave-one-out cross validation (LOOCV) is used in this comparison. In each iteration, base classifiers’ weights in the ensemble-based intersubjects classification model and the single classifier are learned using 𝑇 − 1 samples and both

63

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

models are tested using the remaining sample. Then, their accuracies are compared and the prediction model that performs better is selected: 𝑇

𝑇

𝑡=1

𝑡=1

1 1 𝑖𝑓 ∑ 𝑙(𝑦𝑡 , ℎ𝑤𝑎 (𝑋𝑡 /𝐿\(𝑋𝑡 , 𝑦𝑡 ))) ≤ ∑ 𝑙(𝑦𝑡 , ℎ𝐾+1 (𝑋𝑡 /𝐿\(𝑋𝑡 , 𝑦𝑡 ))), 𝑠𝑒𝑙𝑒𝑐𝑡 ℎ𝑤𝑎 { 𝑇 𝑇 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒, 𝑠𝑒𝑙𝑒𝑐𝑡 ℎ

(4.33)

𝐾+1

where 𝑙 is the 0/1 loss function, ℎ𝑤𝑎 (𝑋𝑡 /𝐿\(𝑋𝑡 , 𝑦𝑡 )) = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 (ℎ𝑦𝑤𝑎 (𝑋𝑡 /𝐿\(𝑋𝑡 , 𝑦𝑡 )) and is the feature vector extracted from the EEG measurement using the spatial filt When the ensemble learned from other subjects is more accurate than the single classifier learned from the new subject in classifying calibration data, it is more likely to perform better in classifying data recorded during calibration phase of the same session. In the opposite case, the single model is used for prediction during feedback phase (figure 4.3).

Figure 4.3: Synopsis of the proposed ensemble-based inter-subjects classification framework for motor imagery-based BCIs

It would be possible to integrate the classifier learned using calibration data of the new subject into the ensemble instead of using a selection procedure. This can be performed using a stochastic gradient descent optimization method which allows simultaneously assessing the performance of the additional classifier and adjusting base classifiers’ weights using LOOCV. However, this may reduce the accuracy of the ensemble because stochastic gradient descent methods require a large number of samples in order to converge to the batch methods (Boyd and Vandenberghe, 2004). This is not the case in this application because the size of calibration set is supposed to be small.

64

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

4.4.3 Experiments In this section, the effectiveness of base classifiers weighting according to their performances in classifying calibration data from target subject is assessed and the relationship between the size of calibration set and the accuracy of this weighting scheme is investigated. The issue of negative transfer in knowledge transfer-based BCI systems is also addressed. For those purposes, an empirical evaluation using a publicly available EEG data set was performed.

4.4.3.1 EEG data set The proposed framework was evaluated using the publicly available EEG data set 2A from BCI competition IV, provided by the Graz group (Blankertz, 2008). This data set consists of EEG signals recorded using 22 Ag/AgCl electrodes from 9 subjects at 250 Hz sampling rate. Subjects were asked to perform four different motor imagery tasks: left hand, right hand, both feet and tongue movement imagination. For each subject, a training and a testing set were collected. Both sets comprise 72 trials of duration 7 s from each class. At the beginning of a trial, a fixation point appeared on a computer screen. After two seconds, a cue appeared informing the subject which imagery task to perform. The subjects were asked to perform the task until the cue disappeared.

4.4.3.2 When learning from other subjects is useful and when it is harmful? Before assessing performance of the proposed framework, it is important to study inter-subjects variability and non-stationarity within the same session in order to illustrate the importance of inter-subjects classification in BCI applications and see to which extent it can improve their decoding performance. Figure 4.4 shows an example from the previously described data set. Training data of subject 1 corresponding to left hand motor imagery (blue circles) and right hand motor imagery (blue dots) were used for learning CSP filters and corresponding LDA classifier (blue line). Similarly, training data of subject 3 corresponding to left hand motor imagery (green circles) and right hand motor imagery (green dots) were used for learning CSP filters and corresponding LDA classifier (green line). Finally, only the first five trials from each class in training set of subject 1 (red circles and red dots, respectively) were used for learning CSP filters and corresponding classifier (red line). In this example, EEG signals were band pass filtered in the 8-30 Hz frequency band using a 5th order Butterworth filter and time segments from 3 to 5 sec after the beginning of each trial were extracted. Classifiers were learned using log-variance feature vectors projected into the subspace spanned by the most discriminative CSP filter from each class. As we can see, the classification model learned using only the first five trials from each class in the training set of subject 1 cannot perform well in classifying trials recorded during the same session because it has a partial view about brain activity patterns related to the two motor imagery tasks. However, the classification model learned using all the training set of subject 3 generalizes well to subject 1 65

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

because it is learned using enough labeled data which makes it robust against non-stationarity and outliers and both subjects have similar brain activity patterns. Thus, learning from other subjects is necessary when the size of calibration set of the new subject is small.

most discriminative CSP filter for right hand motor imagery

-0.5

-1

-1.5

-2

-2.5

-3

-3.5

left hand - subject 1 - all trials right hand - subject 1 - all trials LDA boundary - subject 1 - all trials left hand - subject 3 - all trials right hand - subject 3 - all trials LDA boundary - subject 3 - all trials left hand - subject 1 - first 5 trials right hand - subject 1 - first 5 trials LDA boundary - subject 1 - first 10 trials

-4

-4.5 -4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

most discriminative CSP filter for left hand motor imagery

Figure 4.4: Example illustrating the necessity of learning from other subjects in order to reduce the duration of calibration phase in MI-based BCIs

Another example from the motor imagery data set used in this experiment is illustrated in figure 4.5. In this example training sets corresponding to right hand and both feet motor imagery tasks of subjects 1, 2 and 3 are used to learn three classification models (red line, green line and blue line, respectively). Data sets were preprocessed in the same way as in the previous example and classifiers were learned using feature vectors projected into the subspace spanned by the most discriminative CSP filter from each class. We can see that a classifier learned using data recorded from subject 1 can accurately classify data recorded from subject 3 but it can dramatically deteriorate classification accuracy if it is directly used to classify data recorded from subject 2. We can also note that a linear combination of the classifiers learned using EEG signals recorded from subjects 2 and 3 can be an approximation of a classifier learned using all the training set of subject 1.

66

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

most discriminative CSP filter for both feet motor imagery

-0.5

-1

-1.5

-2

-2.5

right hand - subject 1 both feet - subject 1 LDA boundary - subject 1 right hand - subject 2 both feet - subject 2 LDA boundary - subject 2 right hand - subject 3 both feet - subject 3 LDA boundary - subject 3

-3

-3.5

-4 -4.5

-4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

most discriminative CSP filter for right hand motor imagery

Figure 4.5: Example illustrating inter-subjects variability of EEG signals

4.4.3.3 Classification accuracy In order to assess the classification performance of the proposed ensemble-based transfer learning framework (from now called TL), it was compared to two standard classification paradigms: -

CALIB: spatial filters and corresponding classifiers are trained using calibration data of current user. This is the traditional machine learning approach in brain-computer interfacing.

-

POOL: data from all users are pooled together in order to learn spatial filters and corresponding classifier. This allows highlighting the high inter-subjects variability and the ineffectiveness of brute-force knowledge transfer.

Evaluation was performed offline using leave one subject out cross-validation. In each iteration, training sets of eight subjects are temporally filtered in the 8-30 Hz frequency band using a 5th order Butterworth filter and time segments 3-5 sec after the beginning of each trial are used to learn spatial filters and classifiers in POOL and TL approaches (the most three discriminative spatial filters from each class were used). For the ninth subject, the number of trials per class from his training set was gradually increased and in each time spatial filters and corresponding classifier were trained for the CALIB approach and classifiers weights were adjusted for the TL approach. Comparison was made using the test set of the target subject which was preprocessed in the same way as training data. This comparison was restricted to binary classification tasks. CSP algorithm was used for spatial filtering and LDA was used as a base classifier in all experiments. Data sets were processed using

67

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

the Biosig open source software library for biomedical signal processing (Vidaurre et al., 2011b). Classification algorithms were implemented in Matlab (Coleman et al., 1999). Figure 4.6 illustrates classification performance of the approaches CALIB, POOL and TL averaged across all users. As we can see, the proposed inter-subjects classification framework outperformed the two traditional classification paradigms when the size of calibration set is small. When only few trials were available, CSP filters and classifier were poorly estimated for the approach CALIB which dramatically deteriorated classification performance. As more data are included in the training phase, CALIB performance increased rapidly. Regarding the approach POOL, classification performance does not depend on the size of calibration set as it is not included in the training phase. The poor classification accuracy of this approach reflects the high inter-subjects variability of EEG signals and the ineffectiveness of brute-force knowledge transfer. The proposed approach significantly increased classification performance when the size of calibration set is small (up to 11% for right hand vs. both feet movement imagination). Its performance converged to the CALIB approach performance as the number of trials increased. This shows that when there is enough labeled trials for the current user, learning from other users is not useful. Table 4.1 illustrates results of paired t-test over all subjects and all numbers of trials per class. I used the same test of significance as in (Lotte and Guan, 2010) which proves that performance achieved by the proposed approach was statistically higher than performance achieved by the other approaches (p < 0.05 compared to CALIB and p < 0.001 compared to POOL).

68

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

Left hand vs. Right hand

Left hand vs. Both feet

0.8 0.78 0.8

0.74

Classification accuracy

Classification accuracy

0.76

0.72 0.7 0.68 0.66 0.64

0.6

0

10

20

30 40 Number of trials per class

50

60

0.7

0.65

CALIB TL POOL

0.62

0.75

CALIB TL POOL 0.6

70

0

10

20

30 40 Number of trials per class

50

60

70

Right hand vs. Both feet

Left hand vs. Tongue 0.85

0.85

0.8

Classification accuracy

Classification accuracy

0.8

0.75

0.7

0.75

0.7

0.65

0.65

0.55

0.6

CALIB TL POOL

0.6

0

10

20

30 40 Number of trials per class

50

60

0.55

70

CALIB TL POOL 0

10

20

Right hand vs. Tongue

30 40 Number of trials per class

50

60

70

Both feet vs. Tongue 0.75

0.7

Classification accuracy

Classification accuracy

0.8

0.75

0.7

0.65

0.6

CALIB TL POOL 0

10

20

30 40 Number of trials per class

50

60

0.65

0.6

0.55

0.5

0.45

70

CALIB TL POOL 0

10

20

30 40 Number of trials per class

50

60

70

Figure 4.6: Average classification performance of the learning paradigms CALIB, TL and POOL

Table 4.1: p-values of paired t-test over all subjects and all numbers of trials per class

TL vs. CALIB TL vs. POOL

Left vs. right 0.012 3.6× 10-7

Left vs. feet 0.039 1.0× 10-12

Left vs. tongue 0.002 2.1× 10-18

69

Right vs. feet 1.5× 10-4 2.1× 10-11

Right vs. tongue 0.043 1.2× 10-18

Feet vs. tongue 0.010 1.2× 10-28

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

Detailed results of all subjects and all sizes of calibration set for each binary classification task are illustrated in figures 4.7 to 4.12. Bar graphs corresponding to classification accuracy of the weighted average ensemble learned from other subjects, the single classifier learned from the new subject and the overall inter-subjects classification framework show which model is selected in each time. In most of cases, the model that performs better in classifying the test set of the target subject is selected. When the size of calibration set is relatively small, the ensemble learned from other subjects is often selected. As the size of calibration set increases, the single classifier learned from the target subject becomes more robust to non-stationarity and outliers and consequently its generalization capacity increases. Consequently, this classification model is often selected when there is “enough” labeled trials from target subject.

Figure 4.7: Detailed classification results for left hand vs. right motor imagery

70

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

Subject 1

Subject 2

Subject 3

ensemble learned from other subjects classifier learned from target subject final prediction model

0.9

0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 5 10

20

30

40

50

60

70

0.9 0.8

0.5 5 10

20

Subject 4

30

40

50

60

70

5 10

20

Subject 5

30

40

50

60

70

60

70

60

70

60

70

60

70

60

70

Subject 6

1

0.8

0.6

0.9

0.9

0.8

0.8

0.7

0.7

0.6

0.6

0.5 0.4

5 10

20

30

40

50

60

70

0.5 5 10

20

Subject 7

30

40

50

60

70

5 10

0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 20

30

40

50

60

70

30

40

50

Subject 9

0.9

5 10

20

Subject 8

0.5 5 10

20

30

40

50

60

70

5 10

20

30

40

50

Figure 4.8: Detailed classification results for left hand vs. both feet motor imagery

Subject 1

Subject 2

Subject 3

ensemble learned from other subjects classifier learned from target subject final prediction model

0.9

0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 5 10

20

30

40

50

60

70

0.9 0.8

0.5 5 10

20

Subject 4

30

40

50

60

70

5 10

0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 20

30

40

50

60

70

20

30

40

50

60

70

5 10

0.9

0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 40

50

60

70

50

30

40

50

Subject 9

0.8

30

20

Subject 8

0.9

20

40

0.5 5 10

Subject 7

5 10

30

Subject 6

0.9

5 10

20

Subject 5

0.5 5 10

20

30

40

50

60

70

5 10

20

30

40

Figure 4.9: Detailed classification results for left hand vs. tongue motor imagery

71

50

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

Subject 1

Subject 2

Subject 3

0.9

0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 5 10

20

30

40

50

60

70

0.5 5 10

20

Subject 4

30

40

50

60

70

5 10

Subject 5 0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 20

30

40

50

60

70

20

30

40

50

60

70

5 10

0.9 0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 60

70

70

30

40

50

60

70

60

70

60

70

60

70

60

70

Subject 9

0.8

50

20

Subject 8 0.9

40

60

0.8

0.8

30

50

0.9

0.9

20

40

0.5 5 10

Subject 7

5 10

30

Subject 6

ensemble learned from other subjects classifier learned from target subject final prediction model

0.9

5 10

20

0.5 5 10

20

30

40

50

60

70

5 10

20

30

40

50

Figure 4.10: Detailed classification results for right hand vs. both feet motor imagery

Subject 1

Subject 2

Subject 3

ensemble learned from other subjects classifier learned from target subject final prediction model

0.9

0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 5 10

20

30

40

50

60

70

0.9 0.8

0.5 5 10

20

Subject 4

30

40

50

60

70

5 10

0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 20

30

40

50

60

70

20

30

40

50

60

70

5 10

0.9

0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 40

50

60

70

50

30

40

50

Subject 9

0.8

30

20

Subject 8

0.9

20

40

0.5 5 10

Subject 7

5 10

30

Subject 6

0.9

5 10

20

Subject 5

0.5 5 10

20

30

40

50

60

70

5 10

20

30

40

Figure 4.11: Detailed classification results for right hand vs. tongue motor imagery

72

50

Chapter 4: Ensemble-based knowledge transfer in brain-computer interfaces: from theory to practice

Subject 1

Subject 2

Subject 3

ensemble learned from other subjects classifier learned from target subject final prediction model

0.9

0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 5 10

20

30

40

50

60

70

0.9 0.8

0.5 5 10

20

Subject 4

30

40

50

60

70

5 10

Subject 5 0.9

0.9

0.8

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 20

30

40

50

60

70

20

30

40

50

60

70

5 10

0.9

0.9

0.8

0.8

0.7

0.7

0.7

0.6

0.6

0.6

0.5

0.5 40

50

60

70

50

60

70

30

40

50

60

70

60

70

Subject 9

0.8

30

20

Subject 8

0.9

20

40

0.5 5 10

Subject 7

5 10

30

Subject 6

0.9

5 10

20

0.5 5 10

20

30

40

50

60

70

5 10

20

30

40

50

Figure 4.12: Detailed classification results for both feet vs. tongue motor imagery

4.5 Conclusion Knowledge transfer between different subjects and/or sessions is a challenging task that requires the use of advanced machine learning techniques that allow efficiently learning from heterogeneous data. Ensemble methods are the most successful machine learning techniques used for performing knowledge transfer in BCIs. In this chapter, a thourough investigation of these techniques was performed. A particular focus was given to linear combinations of classifiers and their use in knowledge transfer-based BCI systems. In this direction, an ensemble-based intersubjects classification framework for reducing calibration time in BCIs was proposed. This framework allowed studying the use of a novel approach for linearly combining classifiers in knowledge transfer-based BCI systems. It also allowed studying the behavior of this weighting scheme in comparison to the size of calibration set from target BCI user. Furthermore, intersubject variability and its impact on effectiveness of knowledge transfer in BCIs was investigated. Situations in which knowledge transfer has a negative impact on classification performance of target learning task (negative transfer) were studied. This study highlighted the need of taking into consideration

such

situations

when

conceiving

73

knowledge

transfer-based

BCIs.

5 On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

5

“There is nothing so stable as change.” [Bob Dylan]

Knowledge transfer and online adaptation techniques have been actively explored in the braincomputer interfaces (BCIs) research community during the last years. However, few works tried to conceive classification models that take advantage from both techniques. In this chapter, I try to highlight the advantages of combining both techniques in a single prediction model. To do so, I propose an online inter-subjects classification framework, called adaptive accuracy-weighted ensemble (AAWE), in which inter-subjects classification is performed using a weighted average ensemble of classifiers learned using data recorded from different subjects and online adaptation is performed by updating base classifiers’ weights in a semi-supervised way based on ensemble predictions reinforced by interaction error-related potentials (iErrPs).

74

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

5.1 Motivation From one side, most of the knowledge transfer frameworks proposed for MI-based BCIs are based on the assumption that there is a subject-independent data generating process. Capturing subject-independent information allows creating a “robust” classification model that generalizes across BCI users and minimizes the need for subject-specific information. However, this assumption may be strong because of the high inter-subjects variability of brain activity patterns especially for persons with severe motor disabilities (Samek et al., 2013b). On the other side, online classification methods used in MI-based BCIs focus on adapting a “weak” classification model learned using a small labeled set recorded during calibration phase of a new BCI user and do not take advantage of abundant data recorded from other users and/or previous sessions of the current user. Combining knowledge transfer and online adaptation in a single framework may have many advantages. Knowledge transfer allows rapidly achieving good classification accuracy without the need of long calibration time (good start). Online adaptation allows avoiding negative transfer in cases where brain activity patterns of the new BCI user are very different from other subjects and ensuring efficiency of the prediction model. Few approaches have been attempted in this direction. (Vidaurre et al., 2011c) proposed a machine learning framework in which a subject-independent prediction model was initially learned using EEG signals selected from a large motor imagery data set of different subjects and then adapted using data from a new subject. Adaptation was performed in two stages: in the first stage, the prediction model was updated in a supervised way using calibration data. In the second stage, unsupervised adaptation was performed during feedback phase. Although it was effectively used in a MI-based BCI, this framework uses a naïve approach for creating the subjectindependent classification model. The two ensemble frameworks based on dynamic classifiers weighting presented in the previous chapter can also be classified into the category of machine learning techniques that combine both knowledge transfer and online adaptation of the prediction model (Tu and Sun, 2012; Liyanage et al., 2013). As mentioned before, dynamic classifiers weighting does not take into consideration the stochastic dependence between time-contingent trials during the same session which makes them sensitive to outliers. The performances of these methods will be assessed in this chapter. To address this issue, I propose a framework called adaptive accuracy-weighted ensemble (AAWE) for online inter-subjects classification in MI-based BCIs. In this framework, I explore the idea of using ensemble’s predictions reinforced by interaction error-related potentials (iErrPs) to update base classifiers’ weights in a weighted average ensemble of classifiers. The idea of

75

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

using ensemble’s predictions to update an ensemble classification model was proposed by (Plumpton, 2012; Plumpton, 2014) for online fMRI data classification. Because fMRI data suffers from a very high feature to instance ratio, the author used a random subspace ensemble in which “weak” classifiers learned using few labeled data projected into different views were updated using ensemble’s predictions during online settings. This approach, called “guided labeling”, outperformed a static random subspace ensemble and an adaptive random subspace ensemble in which base classifiers are updated using their own predictions (naïve labeling). However, the main drawback of this approach is that base classifiers converge to the same prediction model which jeopardizes the diversity of the ensemble and consequently reduces its accuracy. In our case, the situation is different. Spatial filters and base classifiers are supposed to be “robust” and do not need to be updated because they are learned using enough labeled data from other subjects. However, their weights are subject to uncertainty. So these weights should be updated rather than updating base classifiers’ parameters. This allows tracking non-stationarity within the same session without compromising the diversity of the ensemble since base classifiers are unchanged. Figure 5.1 illustrates the idea of base classifiers’ weights adaptation using guided labeling. In this example, three linear classifiers are represented by dashed lines and the ensemble is represented by a solid line. Calibration data from two classes are shown in green and red dots. In the beginning (figure 5.1 (a)) the ensemble is close to the more accurate base classifier in predicting class labels of calibration data. As more feature vectors are processed (black dots in figure 5.1 (b)) the disagreement between this classifier and the ensemble increases while the disagreement of other classifiers with the ensemble decreases. Updating base classifiers’ weights allows adjusting the contributions of base classifiers to the ensemble and consequently adjusting its decision boundary (figure 5.1 (c)). As new feature vectors continue arriving, they are correctly classified by the updated ensemble and the latter one continues updating (figure 5.1 (d)).

76

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Figure 5.1: Illustration of base classifiers' weights adaptation using guided labeling

As the predictions of the ensemble are subject to a high degree of uncertainty, using them for updating base classifiers’ weights may lead to error accumulation and consequently degrades the accuracy of the BCI. If an additional source of information is available, it should be used to reduce this uncertainty. In BCI applications, such information could come from error-related potentials (ErrPs). Error-related potentials in general and interaction error-related potentials (iErrPs) particularly can be used to increase BCIs accuracy in two ways: they can be used to trigger a corrective action such as preventing execution of the last BCI action or they can be used as reinforcement signals to update the BCI classification model (see figure 5.2). In the proposed adaptive accuracy-weighted ensemble, iErrPs will be integrated as reinforcement signals for assessing the reliability of ensemble’s predictions.

77

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Figure 5.2: The two ways of using error-related potentials for improving BCIs performance. (left) They are used to trigger a corrective action such as preventing execution of the last BCI action. (right) They are used as reinforcement signals to update the BCI classification model (from Chavarriaga et al., 2014)

The rest of this chapter is divided into two sections: -

The first section targets the problem of minimizing calibration time in MI-based BCIs. In this section, I propose an adaptive accuracy-weighted ensemble method through which I try to prove that combining both inter-subjects classification and online adaptation techniques is better than using them separately.

-

In the second section, I try to go a step further by investigating the case of completely suppressing calibration in MI-based BCIs. A slightly different approach for initializing and updating base classifiers’ weights in the proposed adaptive ensemble method is used. This study has two goals: the first one is to see to which extent we can completely suppress calibration. The second one is to assess the performance of the proposed base classifiers’ weights adaptation method in comparison to dynamic classifiers weighting approaches in linear combinations of classifiers.

78

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

5.2 An adaptive accuracy-weighted ensemble for reducing calibration time in motor imagery-based BCIs 5.2.1 Methods In order to describe different steps of the proposed method, binary classification and multi-class classification are treated separately.

5.2.1.1 Binary classification 5.2.1.1.1 Base classifiers’ weights initialization The accuracy-weighted ensemble (AWE) framework has been shown to be efficient for mining concept-drifting data streams (Wang et al., 2003). The weighting scheme used in this framework is adopted in the proposed online inter-subjects classification method because base classifiers’ weights can be updated in online fashion with low computational cost. Let {𝑊 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} and {ℎ 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} be the CSP filter banks and corresponding classifiers learned using data sets of EEG signals recorded from 𝑆 subjects and temporally filtered in 𝐹 frequency bands. Let 𝐿 = {(𝑋𝑡 , 𝑦𝑡 ), 𝑋𝑡 ∈ ℝ𝐷×𝑁 , 𝑦𝑡 ∈ {0,1}, 𝑡 = 1 … 𝑇} be a small set of labeled multi-channels EEG measurements recorded during calibration phase of a new subject, where 𝐷 is the number of channels and 𝑁 the number of samples. The mean-squared error of the classifier ℎ 𝑠𝑓 in classifying calibration data recorded from current BCI user is: 𝑇

𝑀𝑆𝐸

𝑠𝑓

2 1 𝑠𝑓 𝑠𝑓 = ∑ (1 − ℎ𝑦𝑡 (𝑥𝑡 )) 𝑇

(5.1)

𝑡=1

where: -

𝑠𝑓

𝑥𝑡 is the logarithmic variance feature vector corresponding to the multi-channel EEG signal 𝑋𝑡 temporally filtered in the frequency band 𝑓 and spatially filtered using the CSP filter bank 𝑊 𝑠𝑓 .

-

𝑠𝑓

𝑠𝑓

ℎ𝑦𝑡 (𝑥𝑡 ) ∈ [0 1] is the probabilistic output corresponding to the support of the classifier ℎ 𝑠𝑓 𝑠𝑓

for class 𝑦𝑡 given the feature vector 𝑥𝑡 . In order to create an inter-subjects classification framework, an accuracy-weighted ensemble ℎ𝑤𝑎 = ∑𝑆𝑠=1 ∑𝐹𝑓=1 𝑤𝑠𝑓 . ℎ 𝑠𝑓 is used. Base classifiers’ weights in this accuracy-weighted ensemble are calculated as follows: 𝑤𝑠𝑓 = max(0, 𝑀𝑆𝐸 𝑟 − 𝑀𝑆𝐸 𝑠𝑓 ) , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹

79

(5.2)

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

where 𝑀𝑆𝐸 𝑟 = ∑𝑦 𝑝(𝑦). (1 − 𝑝(𝑦))2 is the mean squared error of a random classifier (for binary classification with equal class priors, 𝑀𝑆𝐸 𝑟 = 0.25). This weighting scheme allows deleting classifiers performing less or equal than a random classifier from the ensemble and assigning weights to the rest of classifiers inversely proportional to their error in classifying calibration data of current learning task. Algorithm 1 illustrates different steps of base classifiers’ weights initialization in case of binary classification in motor imagery-based BCIs. Algorithm 1 Base classifiers’ weights initialization 1: Given: CSP filter banks: {𝑊 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} and corresponding classifiers {ℎ 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} learned using EEG signals from 𝑆 subjects, filtered in 𝐹 frequency bands; Small labeled set of EEG signals recorded during calibration set of current BCI user: 𝐿 = {(𝑋1 , 𝑦1 ), … , (𝑋𝑇 , 𝑦𝑇 )}; Mean squared error of a random classifier: 𝑀𝑆𝐸 𝑟 . 2: for 𝑠 = 1: 𝑆 3: for 𝑓 = 1: 𝐹 4: 𝑀𝑆𝐸 𝑠𝑓 = 0 5: for 𝑡 = 1: 𝑇 6: 𝑍𝑡 = 𝑓𝑖𝑙𝑡𝑒𝑟(𝑋𝑡 , 𝑓)∗ 7:

𝑑𝑖𝑎𝑔((𝑊 𝑠𝑓 .𝑍 ′ ).(𝑊 𝑠𝑓 .𝑍 ′ )′ )

𝑠𝑓

𝑥𝑡 = log (𝑡𝑟𝑎𝑐𝑒((𝑊 𝑠𝑓.𝑍𝑡 ′ ).(𝑊𝑠𝑓.𝑍𝑡 ′ )′ )) 𝑡

𝑠𝑓

𝑠𝑓

1 + . (1 − 𝑇

𝑡

2 𝑠𝑓 𝑠𝑓 ℎ𝑦𝑡 (𝑥𝑡 ))

8: 𝑀𝑆𝐸 = 𝑀𝑆𝐸 9: end for 10: 𝑤𝑠𝑓 = max(0, 𝑀𝑆𝐸 𝑟 − 𝑀𝑆𝐸 𝑠𝑓 ) 11: end for 12: end for 13: Return: ∑𝑆𝑠=1 ∑𝐹𝑓=1 𝑤𝑠𝑓 ℎ 𝑠𝑓

* 𝑓𝑖𝑙𝑡𝑒𝑟(𝑋𝑡 , 𝑓) performs band-pass temporal filtering of 𝑋𝑡 in the frequency band 𝑓.

5.2.1.1.2 Base classifiers’ weights adaptation using ensemble’s predictions In order to perform a gradual update of the ensemble, errors of classifiers learned from other subjects should be updated for each incoming data sample from the new subject. Given the mean squared errors of base classifiers at time step 𝑡, {𝑀𝑆𝐸 𝑠𝑓 (𝑡), 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} and a new labeled multichannel EEG measurement (𝑋𝑡+1 , 𝑦𝑡+1 ), the mean squared errors at time step 𝑡 + 1 (t starts from size of calibration set) can be simply calculated as follows: 𝑀𝑆𝐸 𝑠𝑓 (𝑡 + 1) =

2 1 𝑠𝑓 𝑠𝑓 . [𝑡. 𝑀𝑆𝐸 𝑠𝑓 (𝑡) + (1 − ℎ𝑦𝑡+1 (𝑥𝑡+1 )) ] , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹 𝑡+1

(5.3)

Base classifiers’ weights can then be updated using the adaptive version of equation (5.2): 𝑤𝑠𝑓 (𝑡 + 1) = max(0, 𝑀𝑆𝐸 𝑟 − 𝑀𝑆𝐸 𝑠𝑓 (𝑡 + 1)) , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹

80

(5.4)

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Since true class labels are unknown for the prediction model during feedback phase, the mean squared errors of base classifiers can be updated using the predictions of the ensemble: 𝑀𝑆𝐸 𝑠𝑓 (𝑡 + 1) =

2 1 𝑠𝑓 𝑠𝑓 . [𝑡. 𝑀𝑆𝐸 𝑠𝑓 (𝑡) + (1 − ℎ𝑦̃𝑡+1 (𝑥𝑡+1 )) ] , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹 𝑡+1

(5.5)

where 𝑦̃𝑡+1 is the prediction of the previously learned ensemble: 𝑆

𝐹 𝑠𝑓

(5.6)

𝑠𝑓

𝑦̃𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 (∑ ∑ 𝑤𝑠𝑓 (𝑡). ℎ𝑦 (𝑥𝑡+1 )) 𝑠=1 𝑓=1

In order to take into consideration data shifts with different speeds, an update coefficient 𝑈𝐶 ∈ [0 1] should be added to equation (5.5) that becomes: 𝑀𝑆𝐸 𝑠𝑓 (𝑡 + 1) =

1 × (1 − 𝑈𝐶). 𝑡 + 𝑈𝐶 𝑠𝑓

𝑠𝑓

(5.7) 2

[(1 − 𝑈𝐶). 𝑡. 𝑀𝑆𝐸 𝑠𝑓 (𝑡) + 𝑈𝐶. (1 − ℎ𝑦̃𝑡+1 (𝑥𝑡+1 )) ] For UC = 0, there is no update, for UC = 1, only the new data sample is used for calculating error and when UC = 0.5, we retrieve exactly the update equation (5.5). Base classifiers’ weights adaptation using ensemble’s predictions reinforced by interaction error-related potentials (iErrPs) Let 𝐸 ∈ {0,1} be the true absence or presence of an iErrP following the output of the BCI. 𝐸 = 0, 5.2.1.1.3

when the decision of the ensemble 𝑦̃𝑡+1 corresponds to the intent of the user 𝑦𝑡+1 and 𝐸 = 1, in the opposite case. The iErrPs classifier outputs a value 𝐸̃ ∈ {0,1} which is a prediction of 𝐸. The predicted value 𝐸̃ may or may not correspond to the real value 𝐸 depending on the accuracy of the iErrPs classifier. This iErrPs classifier can be used to assess the reliability of the predicted labels as follows: 𝑀𝑆𝐸 𝑠𝑓 (𝑡 + 1) =

1 × (1 − 𝑈𝐶). 𝑡 + 𝑈𝐶

(5.8) 2

𝑠𝑓 𝑠𝑓 [(1 − 𝑈𝐶). 𝑡. 𝑀𝑆𝐸 𝑠𝑓 (𝑡) + 𝑈𝐶. ((1 − 𝐸̃ ) − ℎ𝑦̃𝑡+1 (𝑥𝑡+1 )) ]

When 𝐸̃ = 0, the predicted label is considered as correct and the update is the same as in equation 2

𝑠𝑓 𝑠𝑓 (5.7). When 𝐸̃ = 1, the opposite label is used for update because (ℎ𝑦̃𝑡+1 (𝑥𝑡+1 )) = (1 − 𝑠𝑓

𝑠𝑓

2

ℎ(1−𝑦̃𝑡+1 ) (𝑥𝑡+1 )) .

81

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Algorithm 2 illustrates the online adaptation of base classifiers’ weights using ensemble’s predictions reinforced by iErrPs for binary classification tasks in MI-based BCIs. Algorithm 2 Online adaptation of base classifiers’ weights using ensemble’s predictions reinforced by interaction error-related potentials for binary classification tasks in MI-based BCIs 1: Given: CSP filter banks: {𝑊 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} and corresponding classifiers {ℎ 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} learned from different subjects; Mean-squared errors of base classifiers at time step 𝑡 of the new session: {𝑀𝑆𝐸 𝑠𝑓 (𝑡), 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹}; A new multichannel EEG measurement 𝑋𝑡+1 ; Mean squared error of a random classifier: 𝑀𝑆𝐸 𝑟 ; update coefficient: 𝑈𝐶. 2: for 𝑠 = 1: 𝑆 3: for 𝑓 = 1: 𝐹 4: 𝑍𝑡+1 = 𝑓𝑖𝑙𝑡𝑒𝑟(𝑋𝑡+1 , 𝑓) 5:

𝑠𝑓

𝑑𝑖𝑎𝑔((𝑊 𝑠𝑓 .𝑍



).(𝑊 𝑠𝑓 .𝑍

′ ′ ))

𝑥𝑡+1 = log (𝑡𝑟𝑎𝑐𝑒((𝑊 𝑠𝑓.𝑍𝑡+1 ′ ).(𝑊𝑠𝑓.𝑍𝑡+1 ′ )′ )) 𝑡+1

𝑡+1

6: 𝑤𝑠𝑓 (𝑡) = max(0, 𝑀𝑆𝐸 𝑟 − 𝑀𝑆𝐸 𝑠𝑓 (𝑡)) 7: end for 8: end for 𝑠𝑓 𝑠𝑓 9: 𝑦̃𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 (∑𝑆𝑠=1 ∑𝐹𝑓=1 𝑤 𝑠𝑓 (𝑡). ℎ𝑦 (𝑥𝑡+1 )) 10: for 𝑠 = 1: 𝑆 11: for 𝑓 = 1: 𝐹 12:

2 ∗

1 𝑠𝑓 𝑀𝑆𝐸 𝑠𝑓 (𝑡 + 1) = (1−𝑈𝐶).𝑡+𝑈𝐶 . [(1 − 𝑈𝐶). 𝑡. 𝑀𝑆𝐸 𝑠𝑓 (𝑡) + 𝑈𝐶. ((1 − 𝐸̃ ) − ℎ𝑦𝑘̃𝑡+1 (𝑥𝑡+1 )) ]

13: end for 14: end for 15: Return: {𝑀𝑆𝐸𝑠𝑓 (𝑡 + 1), 𝑠 = 1 … 𝑆 + 1, 𝑓 = 1 … 𝐹} * 𝐸̃ is given by the iErrPs classifier immediately after step 9

5.2.1.2 Multi-class classification The CSP algorithm is suited for binary classification (Blankertz et al., 2008). Some extensions have been proposed for multi-class classification but it has been shown that dividing the problem into multiple binary classification tasks and applying CSP algorithm to each task is more efficient than the one-step approaches (Lindig-León and Bougrain, 2014). For this reason, a hierarchical extension of the AAWE algorithm for MI-based BCIs that allows performing multi-class classification using the binary CSP algorithm is proposed. Figure 5.3 illustrates the structure of the hierarchical ensemble composed of two stages: in the first stage an ensemble of classifiers is learned for each binary classification task and in the second stage a majority vote is used to predict class output for new incoming feature vectors. The class with the highest number of votes is chosen. In case where more than one class label have the highest number of votes, we look at binary classification tasks in which both class labels have the highest number of votes and we choose the label predicted by the ensemble for that binary classification task.

82

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Figure 5.3: Structure of the hierarchical ensemble method for multi-class classification in MI-based BCIs

For online adaptation of base classifiers’ weights, only ensembles of classifiers learned for the binary classification tasks in which the predicted label exists are updated. These ensembles are updated using the predicted label in case where it is considered as correct (i.e., 𝐸̃ = 0). In the opposite case, the prediction model is not updated because we do not know which label from the remaining class labels can be considered as correct. The update equation of the mean-squared errors of the base classifier learned from subject 𝑠, 𝑠 = 1 … 𝑆 in the frequency band 𝑓, 𝑓 = 1 … 𝐹 for the binary classification task {(𝑦0𝑘 , 𝑦1𝑘 ), 𝑘 = 1 … 𝐾} when the predicted label 𝑦̃𝑡+1 ∈ (𝑦0𝑘 , 𝑦1𝑘 ) is given as follows: 𝑀𝑆𝐸 𝑠𝑓𝑘 (𝑡 𝑘 + 1) =

1 (1 − (1 − 𝐸̃ ). 𝑈𝐶). 𝑡𝑘 + (1 − 𝐸̃ ). 𝑈𝐶

×

(5.9) 2

𝑠𝑓𝑘 𝑠𝑓𝑘 [(1 − (1 − 𝐸̃ ). 𝑈𝐶). 𝑡 𝑘 . 𝑀𝑆𝐸 𝑠𝑓𝑘 (𝑡 𝑘 ) + (1 − 𝐸̃ ). 𝑈𝐶. (1 − ℎ𝑦̃𝑡+1 (𝑥𝑡+1 )) ]

Where: -

𝑡 𝑘 is the number of feature vectors belonging to the binary classification task 𝑘 at time step 𝑡.

-

𝐸̃ is the output of the iErrPs classifier.

-

𝑈𝐶 is the update coefficient. 83

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

-

ℎ 𝑠𝑓𝑘 is the classifier learned from subject 𝑠 in the frequency band 𝑓 for the binary classification task 𝑘.

-

𝑠𝑓𝑘

𝑥𝑡+1 is the feature vector corresponding to the EEG measurement processed using the spatial filter bank learned from subject 𝑠 in the frequency band 𝑓 for the binary classification task 𝑘 at time step 𝑡 + 1.

To better explain the idea, figure 5.4 illustates an example in which the proposed adaptive accuracy-weighted ensemble framework is used for classifying three different motor imagery tasks (left hand, right hand and both feet). At the beginning, an ensemble of classifiers is learned for each binary classification task using data from other subjects and calibration data of the new subject. During feedback phase, each incoming EEG measurement is assigned a label by each binary ensemble classifier. Then, the label with the highest number of votes is maintained as final prediction. Given the output of the error-related potentials classifier 𝐸̃ , the final prediction is used to update the corresponding binary ensemble classifiers. In this example, the first two binary classification models are updated when 𝐸̃ = 0 while the third one is not updated because the final prediction is left hand motor imagery.

Figure 5.4: Example illustrating the use of the proposed framework for multi-class classification

84

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

5.2.1.3 Summary The aim of the proposed AAWE framework is to combine both knowledge transfer and online adaptation techniques in a signle lerning algorithm. The main features related to each technique are summarized in table 5.1. Table 5.1: Main features of the proposed AAWE framework Knowledge transfer 



Online adaptation

Managing inter-subjects variability: in order



Implicit adaptation to the change: tracking

to manage inter-subjects variability in the

brain signals non-stationnarity during the same

spectral domain, EEG signals were filtered in

session is performed implicitly without using

different

allows

any concept change detection mechanism. This

increasing the probability of finding brain

is important in BCI applications because

activity patterns that are common across

different types of change may occur which

subjects.

makes it difficult to use a change detection

Weighting classifiers learned from other

mechanism.

frequency

bands.

This



subjects according to their accuracy in

Adapatation to data shifts with different

classifying calibration data from target

speeds: adjusting the adaptation rate to the

subject:

allows

specific characteristics of brain activity patterns

adapting the ensemble learned from other

of each subject is important to increase the

subjects to the specific signature of target

performance of the BCI. In online settings, the

subject. Online adaptation of base classifiers’

adaptation rate has to be set up beforehand.

this

weighting

scheme

weights during feedback phase will allow



Using

a

recursive

online

adaptation

tracking changes during the same session

technique: updating base classifiers’ weights

without compromising the diversity of the

recursively allows reducing computational

ensemble.

costs and making the prediction model more robust to outliers.

85

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

5.2.2 Experiments In order to assess the effectiveness of combining both inter-subjects classification and online adaptation techniques in the same learning algorithm, experimental evaluation using real motor imagery data sets was performed.

5.2.2.1 EEG data sets 1. Data set 2A from BCI competition IV This four-class motor imagery data set (i.e., six binary classification tasks) was described in the previous chapter. 2. Two-class motor imagery data set from BNCI Horizon 2020 project This data set is available online on the BNCI Horizon 2020 project’s website. It was provided by the Graz BCI group (Steyrl et al., 2014). 14 subjects performed sustained kinesthetic motor imagery of the right hand and feet. 5 subjects had previously performed BCI experiments and 9 subjects were naïve to the task. Each subject performed a training phase composed of 50 trials per class and a validation phase composed of 30 trials per class. EEG signals were recorded using 15 Ag/AgCl electrodes at 512 Hz sampling rate.

5.2.2.2 Preprocessing and classification procedure EEG measurements were band-pass filtered using a 5th order Butterworth filter in the frequency bands of 4 Hz width ranging from 8 Hz to 30 Hz with step size of 2 Hz and an additional wide band from 8 Hz to 30 Hz. For data set 2A in BCI competition IV, time segments 3-5 sec after the beginning of each trial were extracted. For the data set from BNCI Horizon 2020 project, time segments of length 3 sec starting at 3 sec after the beginning of each trial were used for the offline analysis. The three most discriminative CSP filters from each class were used in all experiments. The log-variance features were calculated using the preprocessed trials and assigned the corresponding class labels. It is important to note that no outliers removal technique was used in this experiment. In all experiments, evaluation was performed offline using leave-one-subject-out cross-validation. In each iteration, classifiers are learned using training sets of N-1 subjects and evaluated using test set of the Nth subject (N = 9 in the first data set and N = 14 in the second data set). Linear discriminant analysis (LDA) was used as a base classifier.

5.2.2.3 iErrPs simulation procedure Because the experimental paradigm used in the previously described data sets does not allow decoding error-related potentials, a simple procedure for simulating iErrPs potentials was used for evaluating the proposed framework. This procedure was proposed by (Llera et al., 2011) in order to understand the relationship between the accuracy of the iErrPs classifier and the accuracy of 86

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

the task classifier. It was also used in (Zeyl and Chau, 2014) for evaluating their adaptive classification method. Since the iErrPs decoder is not perfect, two types of errors may occur: -

False positive errors which occur when correctly classified trials are considered wrong (i.e., 𝑦̃𝑡 = 𝑦𝑡 𝑏𝑢𝑡 𝐸̃ = 1).

-

False negative errors which occur when wrongly classified trials are considered to be correct (i.e., 𝑦̃𝑡 ≠ 𝑦𝑡 𝑏𝑢𝑡 𝐸̃ = 0).

Let α1 and α2 be the false positive and false negative rates of the iErrPs classifier. Given the output of the AAWE classification model ỹt and the true class label yt at time step t, the iErrPs simulation for binary classification tasks is performed as follows: -

If 𝑦̃𝑡 = 𝑦𝑡 , we draw 𝐸̃ = 1 with probability 𝛼1 and 𝐸̃ = 0 with probability 1 − 𝛼1 and apply the procedure illustrated in figure 5.5.

-

If 𝑦̃𝑡 ≠ 𝑦𝑡 , we draw 𝐸̃ = 1 with probability 1 − 𝛼2 and 𝐸̃ = 0 with probability 𝛼2 and apply the procedure illustrated in figure 5.5.

Figure 5.5: Simulation procedure of iErrPs for binary classification tasks

For multiclass classification, the prediction model is not updated when an iErrP is detected. Thus, the ensembles of classifiers in which the predicted label exists are only updated in cases of true negative and false negative detections as shown in figure 5.6.

87

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Figure 5.6: Simulation procedure of iErrPs for multi-class classification

5.2.2.4 Investigation of EEG signals non-stationarity within the same session To illustrate the effect of non-stationarty within the same session on classification accuracy, two examples from the previously described EEG data sets are shown in figure 5.7 and figure 5.8. Figure 5.7 illustrates an example of data shift between calibration and feedback phases of the same session. Training and test sets of subject 3 in the motor imagery data set from BNCI Horizon 2020 project are represented in red and green, respectively. Circles correspond to right hand motor imagery while dots correspond to feet motor imagery. EEG signals were band-pass filtered in the 8 − 30 𝐻𝑧 frequency band and spatially filtered using CSP algorithm. LDA classifiers were trained using data projected on the most discriminative CSP direction from each class. As we can see, data from both conditions shifted in the feature space which caused a significant shift of the decision boundary of the classifier. Even though subject 3 is an experimented BCI user, the change of the experimental paradigm between calibration and online phases influenced his mental state which caused a change in the properties of EEG signals recorded from his brain.

88

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

most discriminative CSP filter for feet motor imagery

-0.5

-1

-1.5

-2

-2.5

-3 right hand - training feet - training LDA boundary - training right hand - testing feet - testing LDA boundary - testing

-3.5

-4 -4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

most discriminative CSP filter for right hand motor imagery

Figure 5.7: Example illustrating EEG signals non-stationarity between calibration and feedback phases of the same session

Figure 5.8 illustrates another example in which the decision boundary between a classifier trained using the first half of test set of subject 1 in data set 2A from BCI competition IV is completely different from a classifier trained using the second half of the same data set. Classifiers were trained using data processed in the same way as the previous example. In this case, trials corresponding to the left hand motor imagery remained in the same area of the feature space while trials corresponding to the right hand motor imagery shifted between the beginning and the end of the session which resulted in a rotation of the decision boundary of the classification model. The use of enough training data (50 trials from each class in the first example and 36 trials from each class in the second example) to learn spatial filters and corresponding classifier did not make them “robust” against non-stationarity. This highlights the need for online adaptation of classification models during the same session.

89

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

most discriminative CSP filter for right hand motor imagery

-0.5

-1

-1.5

-2

-2.5

-3

-3.5

-4

left hand - first half of test set right hand - first half of test set LDA boundary - first half of test set left hand - second half of test set right hand - second half of test set LDA boundary - second half of test set

-4.5 -4

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

most discriminative CSP filter for left hand motor imagery

Figure 5.8: Example illustrating EEG signals non-stationarity during feedback phase

5.2.2.5 Binary classification results 5.2.2.5.1 Classification performance of the static accuracy-weighted ensemble Before evaluating the accuracy of the adaptive accuracy-weighted ensemble method (AAWE), it is important to assess the accuracy of the static accuracy weighted ensemble (AWE) in performing inter-subjects classification. To do so, results of comparison with a baseline classification paradigm in which a LDA classifier is learned using only calibration data of target subject are reported. Figure 5.9 illustrates the average classification accuracies of the inter-subjects classification method and the baseline method for different binary classification tasks in the first data set. The size of calibration set from target subject was varied between 10 trials and 140 trials in order to highlight the influence of the size of training data on classification accuracy. As shown, the baseline method performs poorly when the size of calibration set is relatively small (10 trials) and increases significantly when the size of calibration set increases. Results are slightly different from those in the previous chapter because the number of trials from each class may not be the same. In all tasks, inter-subjects classification allows increasing classification accuracy when the size of calibration set is small. Increasing the size of calibration set is not important for adjusting base classifiers’ weights because of EEG signals non-stationarity between calibration and feedback phases. These weights should rather be adjusted during feedback phase in order to track non-stationarity.

90

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Left hand vs. Right hand

Left hand vs. Feet

80

90

78

mean classification accuracy (%)

mean classification accuracy

76 74 72 70 68 66 64

80

75

70 Baseline AWE

62 60 10

85

20

30

40

50

60 70 80 90 size of calibration set

Baseline AWE 65 10

100 110 120 130 140

20

30

40

50

Left hand vs. Tongue

60 70 80 90 size of calibration set

100 110 120 130 140

Right hand vs. Feet

90

90

mean classification accuracy (%)

mean classification accuracy (%)

85 85

80

75

80

75

70

65

70 60

Baseline AWE 65 10

20

30

40

50

60 70 80 90 size of calibration set

55 10

100 110 120 130 140

Baseline AWE 20

30

40

50

Right hand vs. Tongue

100 110 120 130 140

Feet vs. Tongue

90

75

85

mean classification accuracy (%)

mean classification accuracy (%)

60 70 80 90 size of calibration set

80

75

70

65

70 Baseline AWE 65 10

20

30

40

50

60 70 80 90 size of calibration set

Baseline AWE 60 10

100 110 120 130 140

20

30

40

50

60 70 80 90 size of calibration set

100 110 120 130 140

Figure 5.9: Classification accuracies of the inter-subjects classification method and the baseline classification method for all binary classification tasks in data set 2A from BCI competition IV

Table 5.2 shows the detailed classification accuracies of the baseline and AWE methods when the size of calibration set is equal to 10 trials. A paired t-test over all subjects and all binary classification tasks shows that inter-subjects classification allows significantly increasing performance (𝑝 = 0.049) when the size of calibration set is small. However, in some cases intersubjects classification deteriorates performance rather than increasing it (negative transfer). Online adaptation of the inter-subjects classification model may prevent this situation.

91

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Table 5.2: Classification accuracies of the inter-subjects classification method and baseline method when the size of calibration set is equal to 10 trials in data set 2A from BCI competition IV

Left – Right Left – Feet Left – Tongue Right – Feet Right – Tongue Feet – Tongue

Baseline AWE Baseline AWE Baseline AWE Baseline AWE Baseline AWE Baseline AWE

S1 77.1 81.3 86.7 89.5 58.3 86.8 89.5 97.2 97.9 93.1 56.6 63.6

S2 49.3 51.4 57.3 69.2 61.1 57.6 41.3 60.8 64.6 66.0 52.4 70.0

S3 80.6 95.8 79.7 75.5 75.0 86.8 60.1 95.1 94.4 94.4 64.3 73.4

S4 60.8 65.0 56.6 51.0 49.0 60.1 66.7 62.5 56.9 67.4 50.0 45.8

Subjects S5 45.8 48.6 56.6 50.3 61.8 62.5 51.0 53.1 44.4 59.0 49.0 56.0

S6 50.7 58.3 53.8 61.5 57.6 50.7 45.5 58.0 59.0 50.7 55.2 48.3

S7 61.8 64.6 91.6 92.3 84.7 84.7 42.0 95.8 76.4 95.1 58.7 64.3

S8 88.9 89.6 57.3 65.0 79.2 81.3 53.1 69.2 76.4 88.9 78.3 75.5

S9 66.0 56.3 72.7 82.5 88.9 89.6 60.8 55.9 58.3 65.3 83.9 81.8

Table 5.3 shows the average classification performances of the baseline method and the intersubjects classification method for different sizes of calibration set from target subject in the second data set. In this case, the increase in classification performance using inter-subjects classification is not significant in comparison to the baseline method. This consolidates the need for online adaptation of the inter-subjects classification method in order to increase performance. Table 5.3: Average classification accuracies of the baseline method and the inter-subjects classification method for different sizes of calibration set in the data set from BNCI Horizon 2020 project Size of calibration set

Baseline

AWE

10

20

30

40

50

60

70

80

90

100

56.0

64.2

66.8

69.5

69.9

72.5

72.7

75.5

73.3

75.7

±12.8

±14.6

±16.3

±15.4

±13.9

±14.6

±15.2

±13.7

±12.6

±12.7

59.1

65.7

66.2

68.9

66.5

68.6

66.5

68.7

68.2

67.5

±13.4

±13.8

±15.7

±15.7

±16.1

±16.3

±16.8

±17.2

±17.3

±16.4

5.2.2.5.2

Classification performance of the adaptive ensemble method using guided labeling Classification accuracy of the adaptive ensemble method using only guided labeling was compared to the static ensemble method in order to assess whether guided labeling allows increasing classification performance or not. The size of calibration set from target subject is set to 10 trials in all experiments. Figure 5.10 illustrates the classification accuracies of both methods for each subject in data set 1 when performing left hand vs. right hand motor imagery. As shown, online adaptation of base classifiers’ weights using only guided labeling leads to error accumulation in most of cases and 92

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

consequently deteriorates classification accuracy instead of increasing it. For the rest of the binary classification tasks, results are similar because the optimal update coefficient 𝑈𝐶 is subjectspecific. Table 5.4 shows the average classification accuracies for all binary classification tasks in data set 1. Results show that fixing the same values of the update coefficient for all subjects leads to performance decrease especially for high values of 𝑈𝐶. Results of the second data set confirm that online adaptation of base classifiers’ weights using only guided labeling is not possible because in most of cases it deteriorates performance (figure 5.11). For some subjects online adaptation increases performance but the value of the update coefficient is subject-specific and cannot be fixed beforehand for online applications. Thus, using an additional source of information for assessing the reliability of the predictions of the ensemble is necessary. Subject 1

Subject 2

85

70

65

70

classification accuracy (%)

75

65

60

55

80 75 70 65 60

50

static guided 0

0.1

0.2

0.4

0.5 UC

0.6

0.7

0.8

0.9

45

1

0

0.1

0.2

0.3

0.4

Subject 4

0.5 UC

0.6

0.7

0.8

0.9

45

1

classification accuracy (%)

65

60

55

50

0.4

0.5 UC

0.3

0.4

0.6

0.7

0.8

0.9

70

65

60

55

45

1

0.1

0.2

0.3

0.4

0.5 UC

0.6

0.7

0.8

0.9

0.8

0.9

1

static guided

65

60

55

0

0.1

0.2

0.3

0.4

Subject 8

75

0.7

70

45

1

0.5 UC

0.6

0.7

0.8

0.9

1

Subject 9

95 static guided

0.6

50

0

Subject 7 80

0.5 UC

75

50

0.3

0.2

Subject 6

75

70

0.2

0.1

80 static guided

classification accuracy (%)

75

0.1

0

Subject 5 80 static guided

0

static guided

50 0.3

80

classification accuracy (%)

85

55

60

45

95 90

classification accuracy (%)

classification accuracy (%)

100 static guided

75

80

55

Subject 3

80

80 static guided

90 75

65

60

55

classification accuracy (%)

classification accuracy (%)

classification accuracy (%)

85 70

80 75 70 65

70

65

60

60 55

50

45

static guided

55

0

0.1

0.2

0.3

0.4

0.5 UC

0.6

0.7

0.8

0.9

1

50

0

0.1

0.2

0.3

0.4

0.5 UC

0.6

0.7

0.8

0.9

1

50

0

0.1

0.2

0.3

0.4

0.5 UC

0.6

0.7

0.8

0.9

1

Figure 5.10: Classification accuracies of the static ensemble method and the adaptive ensemble method using guided labeling for left hand vs. right hand motor imagery in data set 2A from BCI competition IV. The size of calibration set from target subject is 10 trials

93

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Table 5.4: Average classification accuracies of the static ensemble method and the adaptive ensemble method based on guided labeling for different values of the update coefficient in data set 2A from BCI competition IV. The size of calibration set from target subject

Left – Right

static guided

Left – Feet

static guided

Left – Tongue

static guided

Right – Feet

static guided

Right – Tongue

static guided

Feet – Tongue

static guided

0 67.9 ±17.0 67.9 ±17.0 70.8 ±15.4 70.8 ±15.4 73.3 ±15.3 73.3 ±15.3 72.0 ±18.6 72.0 ±18.6 75.5 ±17.2 75.5 ±17.2 64.3 ±12.3 64.3 ±12.3

0.1 67.9 ±17.0 67.2 ±17.5 70.8 ±15.4 67.2 ±17.5 73.3 ±15.3 71.6 ±17.5 72.0 ±18.6 70.7 ±20.0 75.5 ±17.2 74.9 ±18.9 64.3 ±12.3 62.3 ±13.6

0.2 67.9 ±17.0 67.3 ±16.4 70.8 ±15.4 66.1 ±17.2 73.3 ±15.3 72.6 ±18.6 72.0 ±18.6 69.3 ±20.7 75.5 ±17.2 74.1 ±19.7 64.3 ±12.3 62.0 ±12.3

0.3 67.9 ±17.0 64.5 ±18.2 70.8 ±15.4 61.1 ±15.2 73.3 ±15.3 71.8 ±18.6 72.0 ±18.6 67.6 ±21.5 75.5 ±17.2 69.1 ±20.9 64.3 ±12.3 58.7 ±10.5

0.4 67.9 ±17.0 63.1 ±18.7 70.8 ±15.4 59.3 ±15.5 73.3 ±15.3 70.8 ±19.6 72.0 ±18.6 62.1 ±19.7 75.5 ±17.2 66.9 ±21.5 64.3 ±12.3 56.4 ±9.4

UC 0.5 67.9 ±17.0 64.5 ±18.7 70.8 ±15.4 58.0 ±15.2 73.3 ±15.3 67.2 ±18.2 72.0 ±18.6 62.2 ±19.8 75.5 ±17.2 61.3 ±19.6 64.3 ±12.3 54.8 ±8.9

0.6 67.9 ±17.0 57.7 ±14.8 70.8 ±15.4 53.6 ±9.8 73.3 ±15.3 61.0 ±16.5 72.0 ±18.6 56.1 ±16.0 75.5 ±17.2 51.3 ±2.7 64.3 ±12.3 53.5 ±8.1

0.7 67.9 ±17.0 56.1 ±13.3 70.8 ±15.4 52.9 ±8.2 73.3 ±15.3 55.1 ±11.8 72.0 ±18.6 56.1 ±15.7 75.5 ±17.2 51.2 ±2.6 64.3 ±12.3 53.0 ±7.5

0.8 67.9 ±17.0 55.4 ±12.3 70.8 ±15.4 52.8 ±7.3 73.3 ±15.3 54.3 ±11.2 72.0 ±18.6 50.2 ±0.3 75.5 ±17.2 51.4 ±2.6 64.3 ±12.3 51.4 ±2.7

0.9 67.9 ±17.0 55.4 ±12.3 70.8 ±15.4 51.9 ±5.7 73.3 ±15.3 53.7 ±10.8 72.0 ±18.6 50.3 ±0.1 75.5 ±17.2 50.3 ±0.7 64.3 ±12.3 50.5 ±0.9

1 67.9 ±17.0 55.4 ±8.0 70.8 ±15.4 51.4 ±3.2 73.3 ±15.3 53.0 ±8.8 72.0 ±18.6 51.8 ±2.4 75.5 ±17.2 51.1 ±3.2 64.3 ±12.3 51.7 ±3.4

100 static guided average classification accuracy (%)

90

80

70

60

50

40

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

UC

Figure 5.11: Average classification accuracies of the static ensemble method and the adaptive ensemble method based on guided labeling for different values of update coefficient UC in the motor imagery data set from BNCI Horizon 2020 project

94

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

5.2.2.5.3

Classification performance of the adaptive ensemble method based on guided labeling reinforced by interaction error-related potentials In order to investigate the utility of using interaction error potentials as reinforcement signals for ensemble’s predictions, two scenarios of online adaptation of base classifiers’ weights were compared to the static accuracy-weighted ensemble method: -

Realistic ErrPs detection: adaptation is performed using ensemble’s predictions reinforced by an iErrPs classifier with false positive rate 𝛼1 = 16.5% and false negative rate 𝛼2 = 20.8% as found in (Ferrez and Millán, 2008).

-

Perfect iErrPs detection: adaptation is performed using ensemble’s predictions reinforced by a perfect iErrPs classifier (𝛼1 = 𝛼2 = 0).

Figure 5.12 illustrates the comparative results for all binary classification tasks in data set 2A from BCI competition IV when the size of calibration set is equal to 10 trials. The x-axis corresponds to different values of the update coefficient and the y-axis to the average classification accuracy over all subjects. Because the outputs of the iErrPs classifier are randomly generated according to the false positive and false negative rates, the results of the adaptive ensemble method using realistic iErrPs classifier were averaged over 100 tests for each subject and each value of 𝑈𝐶. As we can see, using iErrPs as reinforcement signals for ensemble’s predictions allows preventing error accumulation and increasing performance in most of cases. The increase in classification accuracy depends on the value of the update coefficient 𝑈𝐶. Using a perfect iErrPs classifier allowed increasing classification performance in all binary classification tasks, especially for values of update coefficient around 0.5. Using a realistic iErrPs classifier allowed increasing classification performance except for the feet vs. tongue motor imagery. The decrease in performance for this binary classification task may be related to the high number of outliers. Table 5.5 shows the results for each subject when the value of the update coefficient is set to 0.5. In most of cases, except for the last binary classification task, online adaptation of base classifiers’ weights based on ensemble’s predictions reinforced by a realistic iErrPs classifier allowed increasing classification performance in comparison to the static accuracy-weighted ensemble method. This is important in online settings because the value of the update coefficient has to be fixed before the use of the BCI.

95

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Left hand vs. Right hand

Left hand vs. Feet 80

75

70

average classification accuracy (%)

average classification accuracy (%)

75

65

60

55

50

static realistic iErrP detection perfect iErrP detection 0

0.1

0.2

0.3

0.4

70

65

60

55

0.5 UC

0.6

0.7

0.8

0.9

50

1

static realistic iErrP detection perfect iErrP detection 0

0.1

0.2

0.3

80

75

75

70

65

60

55

50

static realistic iErrP detection perfect iErrP detectection 0

0.1

0.2

0.3

0.4

0.5 UC

0.7

0.8

0.9

50

1

75

average classification accuracy (%)

average classification accuracy (%)

75

70

65

60

static realistic iErrP detection perfect iErrP detection 0.1

0.2

0.3

0.4

0.6

1

0.7

0.8

0.9

1

0.7

0.8

0.9

1

static realistic iErrP detection perfect iErrP detection 0

0.1

0.2

0.3

0.4

0.5 UC

0.6

70

65

60

55

0.5 UC

0.9

Feet vs. Tongue 80

0

0.8

60

Right hand vs. Tongue

50

0.7

65

80

55

0.6

70

55

0.6

0.5 UC

Right hand vs. Feet

80

average classification accuracy (%)

average classification accuracy (%)

Left hand vs. Tongue

0.4

0.7

0.8

0.9

50

1

static realistic iErrP detection perfect iErrP detection 0

0.1

0.2

0.3

0.4

0.5 UC

0.6

Figure 5.12: Comparative results for all binary classification tasks in data set 2A from BCI competition IV when the size of calibration set is equal to 10 trials. The x-axis corresponds to different values of the update coefficient and the y-axis to the average classification accuracy

96

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Table 5.5: Comparative results for all binary classification tasks in data set 2A from BCI competition IV when the size of calibration set is equal to 10 trials and the update coefficient is equal to 0.5

Left – static Right realistic perfect Left – static Feet realistic perfect Left – static Tongue realistic perfect Right – static Feet realistic perfect Right – static Tongue realistic perfect Feet – static Tongue realistic perfect

S1 81.2 81.9 82.6 89.5 93.5 94.4 86.8 93.2 95.1 97.2 96.3 99.3 93.1 94.8 96.5 63.6 56.3 49.0

S2 51.4 60.2 59.7 69.2 69.2 71.3 57.6 54.3 53.5 60.8 72.4 74.1 66.0 58.4 63.2 69.9 66.7 76.9

S3 95.8 94.0 95.8 75.5 88.3 90.2 86.8 90.6 93.8 95.1 96.6 96.5 94.4 95.2 95.8 73.4 71.7 79.0

S4 65.0 55.5 60.8 51.0 73.0 77.6 60.1 68.1 77.6 62.5 67.3 72.2 67.4 68.1 75.7 45.8 50.7 50.0

S5 48.6 51.7 52.7 50.3 50.8 49.7 62.5 52.2 56.9 53.1 52.0 51.7 59.0 52.3 52.1 55.9 49.4 52.4

S6 58.3 54.7 57.6 61.5 53.1 51.7 50.7 54.1 55.6 58.0 59.4 64.3 50.7 52.8 51.4 48.3 51.4 50.3

S7 64.6 73.7 83.3 92.3 88.2 92.3 84.7 88.6 92.4 95.8 93.9 96.5 95.1 93.9 96.5 64.3 60.3 66.4

S8 89.6 90.2 90.3 65.0 65.8 67.1 81.3 91.3 93.1 69.2 74.9 75.5 88.9 86.4 88.9 75.5 78.7 79.7

S9 56.3 71.2 75.7 82.5 84.3 85.3 89.6 91.5 89.6 55.9 73.2 76.2 65.3 73.5 78.5 81.8 77.9 84.6

Average 67.9 ± 17.0 70.3 ± 15.9 73.2 ± 15.8 70.8 ± 15.4 74.0 ± 15.6 75.5 ± 16.9 73.3 ± 15.3 76.0 ± 18.4 78.6 ± 18.2 72.0 ± 18.6 76.2 ± 16.2 78.5 ± 16.1 75.5 ± 17.2 75.1 ± 18.1 77.6 ± 18.5 64.3 ± 12.3 62.6 ± 11.6 65.4 ± 15.0

Results of the data set from BNCI Horizon 2020 project are illustrated in figure 5.13 and table 5.6. Figure 5.13 shows the average performance of the static and adaptive ensemble methods for different values of the update coefficient 𝑈𝐶. Similar to the first data set, the use of realistic and perfect iErrPs classifier for assessing reliability of ensemble’s predictions allowed increasing performance of the task classifier especially for values of 𝑈𝐶 around 0.5. Table 5.6 shows the detailed results for all subjects when 𝑈𝐶 = 0.5. For most of subjects, online adaptation of base classifiers’ weights allowed increasing classification accuracy but many subjects in this data set performed poorly because they are naïve BCI users which did not allow drawing conclusions about them.

97

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

average classification accuracy (%)

70

65

60

55

50

45

static realistic iErrP detection perfect iErrP detection 0

0.1

0.2

0.3

0.4

0.5 UC

0.6

0.7

0.8

0.9

1

Figure 5.13: Comparative results for the binary classification data set from BNCI Horizon 2020 project when the size of calibration set is equal to 10 trials. The x-axis corresponds to different values of the update coefficient and the y-axis to the average classification accuracy over all subjects. Table 5.6: Comparative results for the binary classification data set from BNCI Horizon 2020 project when the size of calibration set is equal to 10 trials and the update coefficient is equal to 0.5.

Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Subject 9 Subject 10 Subject 11 Subject 12 Subject 13 Subject 14 Average

Static 51.0 78.3 75.7 51.7 55.6 55.0 67.3 51.6 90.7 47.7 50.0 50.3 51.4 51.7 59.1 ± 13.4

Realistic iErrP detection 47.5 72.9 87.8 64.8 61.1 69.9 76.7 53.3 91.3 46.4 56.2 54.5 54.7 59.2 64.0 ± 14.0

Perfect iErrP detection 50.7 80 91.6 75.0 63.3 78.3 85.0 60.0 93.3 46.7 58.3 58.4 63.3 63.2 69.0 ± 14.8

For further investigation of the behavior of the adaptive ensemble method, figure 5.14 illustrates the evolution of base classifiers’ weights between the beginning and the end of test session for two different cases in left hand vs. right hand motor imagery task in data set 2A from BCI competition IV. Figure 5.14 (a) and figure 5.14 (b) show the normalized weights of the base classifiers for subject 3 at the beginning and the end of test session, respectively. For this subject, the base classifier learned using EEG signals recorded from subject 4 and filtered in the 8-30Hz frequency band maintained the highest weight during all the test set (“robust” classifier) which is reflected in the classification accuracy of the static weighted-average ensemble that is equal to the accuracy of the adaptive ensemble using a perfect iErrPs classifier. Oppositely, both adaptive

98

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

ensemble method using realistic iErrPs classifier and adaptive ensemble method using perfect iErrPs classifier significantly increased classification accuracy for subject 7 in comparison to the static ensemble which is related to the huge change of base classifiers’ weights between the beginning of the test session (figure 5.14 (c)) and the end of it (figure 5.14 (d)). (a)

(b)

8-12Hz

8-12Hz

0.14

10-14Hz

0.16

10-14Hz 0.12

12-16Hz 14-18Hz

0.14

12-16Hz

0.1

16-20Hz

14-18Hz

0.12

16-20Hz

0.1

0.08 18-22Hz

18-22Hz 0.08

20-24Hz

20-24Hz

0.06

22-26Hz

0.06

22-26Hz 0.04

24-28Hz

24-28Hz

26-30Hz

0.04

26-30Hz

0.02

8-30Hz

0.02

8-30Hz S1

S2

S4

S5

S6

S7

S8

S9

S1

S2

S4

S5

(c)

S6

S7

S8

S9

0

(d)

8-12Hz

0.1

8-12Hz

10-14Hz

0.09

10-14Hz

12-16Hz

14-18Hz

0.07

16-20Hz

0.2 16-20Hz

0.06

18-22Hz

18-22Hz

0.05

20-24Hz

0.25

12-16Hz

0.08

14-18Hz

0.3

0.15

20-24Hz

0.04

22-26Hz

22-26Hz

0.1

0.03 24-28Hz

24-28Hz 0.02

26-30Hz

0.05

26-30Hz 0.01

8-30Hz

8-30Hz S1

S2

S3

S4

S5

S6

S8

S9

0

S1

S2

S3

S4

S5

S6

S8

S9

0

Figure 5.14: The evolution of base classifiers’ weights during the test session for two different subjects in data set 2A from BCI competition IV. (a) and (b) correspond to base classifiers’ weights at the beginning and the end of test session of subject 3. (c) and (d) correspond to base classifiers’ weights at the begin-ning and the end of test session of subject 7

5.2.2.5.4 Comparison with state-of-the-art adaptive classification model In order to assess its online predictive accuracy, the AAWE algorithm was compared to an adaptive prediction model in which parameters of the CSP filters and corresponding LDA classifier learned using calibration data of target subject are updated during feedback phase based on iErrPs detection. Class means and covariance matrices are updated in the same way as in the “Adaptive LDA.learnAll” algorithm presented in (Zeyl and Chau, 2014) (see algorithm 3).

99

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Algorithm 3: Adaptive CSP+LDA online update procedure 𝑗

𝑖 1: Given: new feature vector: 𝑥𝑡 ; old CSP and LDA covariance matrices: 𝛴𝑡−1 and 𝛴𝑡−1 (where 𝑗

𝑖 𝑖 is the predicted class label and 𝑗 is the opposite label); old LDA class means: 𝜇𝑡−1 and 𝜇𝑡−1 ; learning rates for class means and covariance matrices: 𝜏𝜇 and 𝜏𝛴 ; iErrPs classifier output 𝐸̃ 2: if 𝐸̃ = 1 𝑗 𝑗 3: 𝜇𝑡 = (1 − 𝜏𝜇 ). 𝜇𝑡−1 + 𝜏𝜇 . 𝑥𝑡 𝑗 𝑗 𝑗 𝑗 4: 𝛴𝑡 = (1 − 𝜏𝛴 ). 𝛴𝑡−1 + 𝜏𝛴 . (𝑥𝑡 − 𝜇𝑡 ). (𝑥𝑡 − 𝜇𝑡 )𝑇 5: else 𝑖 6: 𝜇𝑡𝑖 = (1 − 𝜏𝜇 ). 𝜇𝑡−1 + 𝜏𝜇 . 𝑥𝑡 𝑖 𝑖 7: 𝛴𝑡 = (1 − 𝜏𝛴 ). 𝛴𝑡−1 + 𝜏𝛴 . (𝑥𝑡 − 𝜇𝑡𝑖 ). (𝑥𝑡 − 𝜇𝑡𝑖 )𝑇 8: end 𝑗 𝑗 9: Return: 𝜇𝑡𝑖 ; 𝜇𝑡 ; 𝛴𝑡𝑖 ; 𝛴𝑡

Tables 5.7 and 5.8 illustrate comparative results of the two methods using realistic iErrPs detection (𝛼1 = 16.5% and 𝛼2 = 20.8%) for all binary classification tasks in data set 2A from BCI competition IV and the data set from BNCI Horizon 2020 project, respectively. For the first data set, the AAWE algorithm outperformed the adaptive CSP+LDA algorithm in most of cases. For the second data set, there is no significant difference between the two methods. For some subjects, the AAWE algorithm allowed getting better results and for other subjects the adaptive CSP+LDA algorithm was better. Furthermore, many subjects did not perform well which did not allow drawing conclusions from this data set. Table 5.7: Comparative results of the AAWE method and a state-of-the-art adaptive classification method using realistic iErrPs detection for all binary classification tasks in data set 2A from BCI competition IV

Left – AAWE Right Adaptive CSP+LDA Left – AAWE Feet Adaptive CSP+LDA Left – AAWE Tongue Adaptive CSP+LDA Right – AAWE Feet Adaptive CSP+LDA Right – AAWE Tongue Adaptive CSP+LDA Feet – AAWE Tongue Adaptive CSP+LDA

S1 S2 S3 81.9 60.2 94.0 74.6 49.1 80.2

S4 S5 S6 55.5 51.7 54.7 58.1 47.6 53.7

S7 S8 S9 73.7 90.2 71.2 64.1 87.3 85.7

Average 70.3 ± 15.9 66.7 ± 15.6

93.5 69.2 88.3 87.9 64.6 79.5

73.0 50.8 53.1 69.2 51.9 54.0

88.2 65.8 84.3 85.9 74.5 90.1

74.0 ± 15.6 73.1 ± 14.2

93.2 54.3 90.6 88.1 56.4 79.5

68.1 52.2 54.1 69.7 54.3 55.3

88.6 91.3 91.5 82.3 83.3 91.4

76.0 ± 18.4 73.4 ± 14.8

96.3 72.4 96.6 91.0 68.9 85.9

67.3 52.0 59.4 71.9 49.7 53.4

93.9 74.9 73.2 86.0 73.1 76.0

76.2 ± 16.2 72.9 ± 14.2

94.8 58.4 95.2 92.9 59.9 91.9

68.1 52.3 52.8 63.8 56.3 54.6

93.9 86.4 73.5 85.6 75.9 74.7

75.1 ± 18.1 72.9 ± 15.0

56.3 66.7 71.7 55.2 67.4 63.8

50.7 49.4 51.4 55.6 53.3 55.8

60.3 78.7 77.9 67.2 73.2 83.8

62.6 ± 11.6 63.9 ± 10.2

100

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Table 5.8: Comparative results of the AAWE method and a state-of-the-art adaptive classification method using realistic iErrPs detection for the binary classification data set from BNCI Horizon 2020 project

AAWE 47.5 72.9 87.8 64.8 61.1 69.9 76.7 53.3 91.3 46.4 56.2 54.5 54.7 59.2 64.0 ± 14.0

Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Subject 9 Subject 10 Subject 11 Subject 12 Subject 13 Subject 14 Average

Adaptive CSP+LDA 48.4 65.9 91.1 74.7 53.9 60.4 67.9 67.6 81.4 49.0 59.5 56.8 50.3 53.1 62.9 ± 12.8

Figure 5.15 gives a summary of the idea of combining knowledge transfer and online adaptation techniques. It corresponds to online classification accuracy of the baseline classification method, the adaptive classification method and the proposed method using perfect error-related potentials detection for subject 2 in the first data set when performing right hand vs. feet motor imagery task. As we can see, learning from other subjects allows starting with well performing prediction model while online adaptation guarantees maintaining optimality of this prediction model.

90

online classification accuracy (%)

80 70 AAWE adaptive CSP+LDA baseline

60 50 40 30

20

40

60

80

100

120

140

number of trials during feedback phase

Figure 5.15: Illustration of the adavantage of combining inter-subjects classification and online adaptation in MI-based BCIs. The figure corresponds to comparative results for subject 2 in data set 2A from BCI competition IV when performing right hand vs. feet motor imagery task

101

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

5.2.2.6 Multi-class classification results Data set 2A from BCI competition IV was used for assessing the performance of the adaptive accuracy-weighted ensemble for multi-class classification. 5.2.2.6.1 Three-class classification Comparative results of the static accuracy-weighted ensemble and the baseline classifier learned using only calibration data of target subject are illustrated in figure 5.16. Results for each threeclass classification task are shown in a separate graph. The x-axis represents different sizes of the calibration set from target subject and the y-axis corresponds to the average classification performance over all subjects. As we can see, even if it is above the chance level (33% for threeclass classification), accuracy of the baseline classifier when the size of calibration set is small is relatively low in comparison to its accuracy for bigger sizes of training data. Results of the accuracy-weighted ensemble show that learning from other subjects allows significantly increasing classification performance in comparison to the baseline classifier (p = 0.006 for a paired t-test over all classification tasks and all subjects when the size of calibration set is equal to 20 trials). Similarly to binary classification, increasing the size of calibration set does not have an effect on the accuracy of the inter-subjects classification model which remains constant. Detailed results of the baseline model and the static inter-subjects classification model for all three-class classification tasks and all subjects when the size of calibration set is equal to 20 trials are shown in table 5.9.

102

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

left hand - right hand - tongue 75

70

70

average classification accuracy (%)

average classification accuracy (%)

left hand - right hand - feet 75

65 60 55 50 45 40

60 55 50 45 40

baseline AWE

35 20

65

40

60

80

baseline AWE

35

100 120 140 160 180 200 220 240 260 280 size of calibration set

20

40

60

80

right hand - feet - tongue

75

75

70

70

average classification accuracy (%)

average classification accuracy (%)

left hand - feet - tongue

65

60

55

50

45

40 20

60

80

65

60

55

50

45

baseline AWE 40

100 120 140 160 180 200 220 240 260 280 size of calibration set

40 20

100 120 140 160 180 200 220 240 260 280 size of calibration set

baseline AWE 40

60

80

100 120 140 160 180 200 220 240 260 280 size of calibration set

Figure 5.16: Comparative results of the baseline classification method and the static ensemble-based intersubjects classification methods for all three-class classifications tasks in data set 2A from BCI competition IV Table 5.9: Detailed classification results of the baseline method and the static ensemble-based intersubjects classification method for all three-class classification tasks in data set 2A from BCI competition IV when the size of calibration set is equal to 20 trials S1

S2

S3

S4

S5

S6

S7

S8

S9

Average

43.4 10.0 58.6 19.8 50.0 14.6 60.5 20.9 48.7 15.5 55.6 15.7 48.6 12.9 57.8 17.7

Left Right Feet

Baseline

52.6

37.2

44.7

34.9

31.2

33.0

47.9

61.4

47.4

AWE

82.8

43.3

83.3

43.2

35.8

40.5

82.3

64.2

52.1

Left Right Tongue

Baseline

49.5

42.6

62.0

37.2

32.4

42.1

75.9

65.7

42.6

AWE

80.6

47.7

83.8

47.0

30.6

36.1

76.9

81.5

60.6

Left Feet Tongue

Baseline

45.1

39.1

54.0

37.2

34.4

40.0

61.4

43.7

83.3

AWE

72.1

49.8

63.7

34.4

37.2

39.1

68.8

61.9

73.5

Right Feet Tongue

Baseline

52.1

42.3

54.0

34.7

37.2

32.6

70.2

51.6

62.8

AWE

77.7

52.6

77.6

36.5

35.8

38.6

72.1

71.6

58.1

103

± ± ± ± ± ± ± ±

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Figure 5.17 illustrates the average classification performances of the adaptive ensemble method using guided labeling and the adaptive ensemble method using guided labeling reinforced by interaction error-related potentials for different values of the update coefficient 𝑈𝐶. On contrary to binary classification, both base classifiers’ weights adaptation methods did not allow increasing classification performance of the static inter-subjects classification model. Updating the weights of base classifiers only in the absence of iErrPs detection is not effective and can wrongly change the structure of the ensemble because of false negatives. As the static inter-subjects classification significantly increases performance when the size of calibration set is small, it should not be updated for multi-class classification in order to prevent error accumulation. left hand - right hand - tongue 70

65

65

average classification accuracy (%)

average classification accuracy (%)

left hand - right hand - feet 70

60 55 50 45 40 static guided perfect iErrP detection

35 30

0

0.1

0.2

0.3

0.4

60 55 50 45 40 static guided perfect iErrP detection

35

0.5 UC

0.6

0.7

0.8

0.9

30

1

0

0.1

0.2

0.3

70

65

65

60 55 50 45 40 static guided perfect iErrP detection

35 30

0

0.1

0.2

0.3

0.4

0.6

0.6

0.7

0.8

0.9

1

0.7

0.8

0.9

1

60 55 50 45 40 static guided perfect iErrP detection

35

0.5 UC

0.5 UC

right hand - feet - tongue

70

average classification accuracy (%)

average classification accuracy (%)

left hand - feet - tongue

0.4

0.7

0.8

0.9

30

1

0

0.1

0.2

0.3

0.4

0.5 UC

0.6

Figure 5.17: Classification results of the different base classifiers’ weights online adaptation techniques for all three-class classifications tasks in data set 2A from BCI competition IV

5.2.2.6.2 Four-class classification results For four-class classification, the chance level is equal to 25%. Results reported in figure 5.18 and table 5.10 confirm the conclusions drawn from the three-class classification study. Comparison with a baseline classifier learned using calibration data of target subject shows that inter-subjects classification allows increasing performance when the size of calibration set is relatively small (see figure 5.18).

104

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

70

average classification accuracy (%)

65 60 55 50 45 40 35 30 20

baseline AWE 40

60

80

100 120 140 160 180 200 220 240 260 280 UC

Figure 5.18: Comparative results of the baseline classification method and the static ensemble-based intersubjects classification methods for four-class classification in data set 2A from BCI competition IV Table 5.10: Detailed classification results of the baseline method and the static ensemble-based intersubjects classification method for all three-class classification tasks in data set 2A from BCI competition IV when the size of calibration set is equal to 20 trials

S1 Baseline 37.6 AWE 63.8

S2 32.4 39.0

S3 46.7 66.2

S4 28.2 38.0

S5 24.4 33.4

S6 27.2 31.4

S7 60.3 60.6

S8 49.8 59.9

S9 52.9 50.5

Mean 40.0 49.2

Std. 12.9 13.9

Similarly to three-class classification, online adaptation of base classifiers’ weights did not allow increasing performance of the inter-subjects classification model (figure 5.19). Increasing the number of classes increases ambiguity about which class label can be considered to be correct when an iErrP is detected. Updating the ensemble only when the predicted class label is considered to be correct is not an effective way for tracking EEG signals non-stationarity. Thus, the ensemble-based inter-subjects classification framework should be updated only for binary classification tasks. For multi-class classification, other approaches should be investigated.

105

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

60

average classification accuracy (%)

55

50

45

40

35 static guided perfect iErrP detection

30

25

0

0.1

0.2

0.3

0.4

0.5 UC

0.6

0.7

0.8

0.9

1

Figure 5.19: Classification results of the different base classifiers’ weights online adaptation techniques for four-class classification in data set 2A from BCI competition IV

5.3 Study of the case when there is no calibration data from target subject In this section, the zero-calibration case is investigated. The same weighting scheme is used as in the previously described algorithm. However, base classifiers’ weights are initialized using a static weighting scheme based on source data which allows getting rid of calibration data from target subject. Since base classifiers’ weights adaptation using ensemble’s predictions reinforced by interaction error-related potentials was not effective for multiclass classification, only binary classification is treated in this section.

5.3.1 Methods 5.3.1.1 Base classifiers’ weights initialization In the absence of labeled trials from target subject, initialization of base classifiers’ weights can be performed using static weighting based on source data. Each base classifier is weighted according to its capacity to generalize across subjects. Let {𝑊 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} and {ℎ 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} be the CSP filter banks and corresponding classifiers learned using the EEG data sets recorded from 𝑆 subjects and temporally filtered in 𝐹 frequency bands. The mean-squared error of the classifier ℎ 𝑠𝑓 in classifying data from other subjects can be written as follows: 𝑇𝑠𝑘

𝑀𝑆𝐸 𝑠𝑓

2 1 𝑠𝑓 𝑠𝑓 = ∑ ∑ (1 − ℎ𝑦𝑡 (𝑥𝑡 )) 𝑇𝑠𝑘 𝑠𝑘 =1:𝑆 𝑠𝑘 ≠𝑠

𝑡=1

106

(5.12)

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

where: -

𝑇𝑠𝑘 is the size of the EEG data set recorded from subject 𝑠𝑘 .

-

𝑥𝑡

𝑠𝑓

is the logarithmic variance feature vector corresponding to the multi-channel EEG

measurement 𝑋𝑡 temporally-filtered in the frequency band 𝑓 and spatially filtered using the CSP filter bank 𝑊 𝑠𝑓 . -

𝑠𝑓

𝑠𝑓

ℎ𝑦𝑡 (𝑥𝑡 ) ∈ [0 1] is the probabilistic output representing the support of the classifier ℎ 𝑠𝑓 for 𝑠𝑓

class 𝑦𝑡 given the feature vector 𝑥𝑡 . Given the mean-squared errors of all classifiers, their weights are initialized as follows: 𝑤𝑠𝑓 = max(0, 𝑀𝑆𝐸 𝑟 − 𝑀𝑆𝐸 𝑠𝑓 ) , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹

(5.13)

where 𝑀𝑆𝐸 𝑟 = ∑𝑦 𝑝(𝑦). (1 − 𝑝(𝑦))2 is the mean squared error of a random classifier. Initialization of the accuracy weighted ensemble learned from different BCI users is illustrated in algorithm 4. Algorithm 4 Base classifiers’ weights initialization 1: Given: CSP filter banks: {𝑊 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} and corresponding classifiers {ℎ 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} learned using EEG signals from 𝑆 subjects, filtered in 𝐹 frequency bands; Mean squared error of a random classifier: 𝑀𝑆𝐸 𝑟 . 2: for 𝑠 = 1: 𝑆 3: for 𝑓 = 1: 𝐹 4: 𝑀𝑆𝐸 𝑠𝑓 = 0 5: for 𝑠𝑘 = 1: 𝑆, 𝑠𝑘 ≠ 𝑠 6: for 𝑡 = 1: 𝑇𝑠𝑘 7: 𝑍𝑡 = 𝑓𝑖𝑙𝑡𝑒𝑟(𝑋𝑡 , 𝑓)∗ 8: 9:

𝑠𝑓

𝑑𝑖𝑎𝑔((𝑊 𝑠𝑓 .𝑍 ′ ).(𝑊 𝑠𝑓 .𝑍 ′ )′)

𝑥𝑡 = log (𝑡𝑟𝑎𝑐𝑒((𝑊 𝑠𝑓.𝑍𝑡 ′).(𝑊 𝑠𝑓.𝑍𝑡 ′ )′ )) 𝑡

𝑀𝑆𝐸 𝑠𝑓 = 𝑀𝑆𝐸 𝑠𝑓 +

1 . (1 𝑇𝑠𝑘

𝑡

𝑠𝑓

𝑠𝑓

2

− ℎ𝑦𝑡 (𝑥𝑡 ))

10: end for 11: end for 12: 𝑤𝑠𝑓 = max(0, 𝑀𝑆𝐸 𝑟 − 𝑀𝑆𝐸 𝑠𝑓 ) 13: end for 14: end for 15: Return: ∑𝑆𝑠=1 ∑𝐹𝑓=1 𝑤𝑠𝑓 ℎ 𝑠𝑓 * 𝑓𝑖𝑙𝑡𝑒𝑟(𝑋𝑡 , 𝑓) performs band-pass temporal filtering of 𝑋𝑡 in the frequency band 𝑓.

107

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

5.3.1.2 Base classifiers’ weights adaptation using ensemble’s predictions reinforced by interaction error-related potentials (iErrPs) During self-paced interaction mode, mean-squared errors of base classifiers are updated in a way slightly different from equation (5.8). Given mean squared errors of base classifiers at time step 𝑡: {𝑀𝑆𝐸 𝑠𝑓 (𝑡), 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} and the output of the iErrPs classifier 𝐸̃ ∈ {0,1} after predicting class label 𝑦̃𝑡+1 , the mean-squared errors of base classifiers at time step 𝑡 + 1 can be calculated as follows: 𝑠𝑓

2

𝑠𝑓

𝑀𝑆𝐸 𝑠𝑓 (𝑡 + 1) = (1 − 𝑈𝐶). 𝑀𝑆𝐸 𝑠𝑓 (𝑡) + 𝑈𝐶. ((1 − 𝐸̃ ) − ℎ𝑦̃𝑡+1 (𝑥𝑡+1 ))

(5.14)

where 𝑈𝐶 ∈ [0 1] is the update coefficient. Equation (5.8) was not used for updating the mean-squared errors of base classifiers to avoid huge changes in the ensemble’s structure at the beginning of feedback phase (𝑡 starts from 1 instead of 𝑠𝑖𝑧𝑒 𝑜𝑓 𝑐𝑎𝑙𝑖𝑏𝑟𝑎𝑡𝑖𝑜𝑛 𝑠𝑒𝑡 + 1). Algorithm 5 illustrates the online adaptation of base classifiers’ weights using ensemble’s predictions reinforced by iErrPs. Algorithm 5 Online adaptation of base classifiers’ weights using ensemble’s predictions reinforced by interaction error-related potentials for binary classification tasks in MI-based BCIs 1: Given: CSP filter banks: {𝑊 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} and corresponding classifiers {ℎ 𝑠𝑓 , 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹}; Mean-squared errors of base classifiers at time step 𝑡 of the new session: {𝑀𝑆𝐸 𝑠𝑓 (𝑡), 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹}; A new multichannel EEG measurement 𝑋𝑡+1 ; Mean squared error of a random classifier: 𝑀𝑆𝐸 𝑟 ; update coefficient: 𝑈𝐶. 2: for 𝑠 = 1: 𝑆 3: for 𝑓 = 1: 𝐹 4: 𝑍𝑡+1 = 𝑓𝑖𝑙𝑡𝑒𝑟(𝑋𝑡+1 , 𝑓) 5:

𝑠𝑓

𝑑𝑖𝑎𝑔((𝑊 𝑠𝑓 .𝑍

′ ).(𝑊 𝑠𝑓 .𝑍

′ )′

)

𝑥𝑡+1 = log (𝑡𝑟𝑎𝑐𝑒((𝑊 𝑠𝑓.𝑍𝑡+1 ′ ).(𝑊𝑠𝑓.𝑍𝑡+1 ′ )′ )) 𝑡+1

𝑡+1

6: 𝑤𝑠𝑓 (𝑡) = max(0, 𝑀𝑆𝐸 𝑟 − 𝑀𝑆𝐸 𝑠𝑓 (𝑡)) 7: end for 8: end for 𝑠𝑓 𝑠𝑓 9: 𝑦̃𝑡+1 = 𝑎𝑟𝑔𝑚𝑎𝑥𝑦 (∑𝑆𝑠=1 ∑𝐹𝑓=1 𝑤 𝑠𝑓 (𝑡). ℎ𝑦 (𝑥𝑡+1 )) 10: for 𝑠 = 1: 𝑆 11: for 𝑓 = 1: 𝐹 2

12:

𝑠𝑓 𝑠𝑓 𝑀𝑆𝐸 𝑠𝑓 (𝑡 + 1) = (1 − 𝑈𝐶). 𝑀𝑆𝐸 𝑠𝑓 (𝑡) + 𝑈𝐶. ((1 − 𝐸̃ ) − ℎ𝑦̃𝑡+1 (𝑥𝑡+1 ))

13: end for 14: end for 15: Return: {𝑀𝑆𝐸𝑠𝑓 (𝑡 + 1), 𝑠 = 1 … 𝑆, 𝑓 = 1 … 𝐹} * 𝐸̃ is given by the iErrPs classifier immediately after step 9

108

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

5.3.2 Experiements The six binary classification tasks in data set 2A from BCI competition IV and the binary classification data set from BNCI Horizon 2020 project were used for experimetal evaluation of the zero-calibration inter-subjects classification method. Data were processed in the same way as in the previous section and evaluation was performed offline using leave-one-subject-out crossvalidation.

5.3.2.1 Can we completely suppress calibration phase ? As reported in literature, subject-calibrated BCI systems are always more accurate than zerocalibration systems (Samek, 2014). In this section, I try to investigate whether combining knowledge transfer and online adaptation allows completely supressing calibration without decreasing classification accuracy of the BCI. Figures 5.20 and 5.21 illustrate comparative results of the classification model learned using only 10 trials from target subject (baseline), the static ensemble-based inter-subjects classification model (static) and the adaptive ensemble-based inter-subjects classification model using an iErrPs classifier with realistic and perfect detection. An update coefficient of 0.05 was used for online adaptation of base classifiers’ weights in the adaptive ensemble method. As shown, the static accuracy-weighted ensemble is outperformed by the baseline classification method in most of binary classification tasks because learning both base classifiers and their weights using data from other subjects allows capturing only subject-independent information which is not sufficient for extending to new subjects. Online adaptation of base classifiers’ weights based on ensemble’s predictions reinforced by an iErrPs classifier with realistic detection rate allows increasing performance especially for right hand vs. feet motor imagery in both data sets.

109

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

100

average classification accuracy (%)

90

baseline static realistic iErrP detection perfect iErrP detection

80

70

60

50

40

left hand - right hand

left hand - feet

left hand - tongue

right hand - feet

right hand - tongue

feet - tongue

Figure 5.20: Comparative results of different classification schemes for all binary classification tasks in data set 2A from BCI competition IV 100

average classification accuracy (%)

90

baseline static realistic iErrP detection perfect iErrP detection

80

70

60

50

40

Figure 5.21: Comparative results of different classification schemes for the binary classification data set from BNCI Horizon 2020 project

Figure 5.22 shows comparative results of the adaptive accuracy-weighted ensemble using a small calibration set from target subject (10 trials) and the adaptive accuracy-weighted ensemble without using a small calibration set from target subject for all binary classification tasks in data set 2A from BCI competition IV. Realistic iErrPs detection was used in both cases (results were averaged over 100 tests). The update coefficient 𝑈𝐶 was set to 0.5 in the first case and 0.05 in the second case. Even if the zero-calibration approach allowed achieving acceptable classification performance in most of binary classification tasks (the minimum acceptable classification rate for binary classification in BCIs is 70%) (Thomas et al., 2013), its performance is still significantly lower than the subject-calibrated approach (p = 0.032). This suggests that using a short calibration phase is always useful for increasing performance of the classification model as reported in previous work.

110

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

100 minimizing calibration suppressing calibration

average classification accuracy (%)

90

80

70

60

50

40

left hand - right hand

left hand - feet

left hand - tongue

right hand - feet

right hand - tongue

feet - tongue

Figure 5.22: Comparative results of the adaptive accuracy-weighted ensemble using a small calibration set from target subject (10 trials) and the adaptive accuracy-weighted ensemble without using calibration data from target subject for all binary classification tasks in data set 2A from BCI competition IV

5.3.2.2 Comparison with dynamic classifiers weighting approaches One of the main motivations of the proposed online adaptation of base classifiers’ weights in weighted average ensemble methods is to provide an alternative to dynamic classifiers weighting approaches that are often used for ensemble classification in nonstationary environments (Kuncheva, 2004a). In this section, classification performances of the two previously described dynamic classifiers weighting approaches are assessed and results are compared to the proposed approach. Tables 5.11 to 5.16 illustrate the comparative results of different classification approaches investigated in this section for all binary classification tasks in data set 2A from BCI competition IV. These approaches are the following: -

Baseline: classification model learned using a small calibration set (10 trials) from target subject.

-

AWE: static accuracy-weighted ensemble method.

-

AAWE – realistic: adaptive accuracy-weighted ensemble method using an iErrPs classifier with realistic detection rate.

-

AAWE – perfect: adaptive accuracy-weighted ensemble method using an iErrPs classifier with perfect detection rate.

-

DWEC: dynamic classifiers weighting according to classes’ centers.

-

DWEN: dynamic classifiers weighting according to the neighborhood.

Base classifiers in the two dynamic classifiers weighting approaches were learned in the same way as in the proposed adaptive accuracy-weighted ensemble method. The update coefficient 𝑈𝐶

111

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

in the proposed approach was set to 0.05 and the neighborhood parameter 𝑑 in the DWEN approach was set to 0.1. Results show that the two dynamic classifiers weighting approaches perform poorly in classifying data from target subject in most of cases. Their performances are relatively similar to the performance of the static accuracy-weighted ensemble method which is outperformed by the baseline classification method. This is mainly related to the fact that dynamic classifiers weighting approaches do not take into consideration the stochastic dependence between trials which makes them very sensitive to outliers. In the proposed approach, this stochastic dependence is taken into consideration by integrating classification errors of past trials in base classifiers’ weights calculation of the present trial. Table 5.11: Comparative results of different classification schemes for left hand vs. right hand motor imagery in data set 2A from BCI competition IV

baseline AWE AAWE - realistic AAWE - perfect DWEC DWEN

S1 77.1 77.8 78.5 84.0 73.6 78.5

S2 49.3 44.4 54.4 53.5 54.2 60.4

S3 80.6 77.7 89.4 93.8 63.2 77.7

S4 60.8 51.0 56.2 65.0 51.0 51.0

S5 45.8 62.5 50.0 52.1 51.4 50.7

S6 50.7 52.8 54.9 52.8 68.1 53.5

S7 61.8 61.1 69.4 76.4 70.1 66.7

S8 88.9 85.4 84.0 90.3 85.4 84.7

S9 66.0 54.2 75.1 79.9 54.9 54.2

Mean 64.6 63.0 68.0 72.0 63.5 64.2

Std. 15.0 14.2 14.6 16.5 11.8 13.2

Table 5.12: Comparative results of different classification schemes for left hand vs. feet motor imagery in data set 2A from BCI competition IV

baseline AWE AAWE - realistic AAWE - perfect DWEC DWEN

S1 86.7 86.0 90.5 95.1 72.7 84.6

S2 57.3 49.0 60.4 71.3 47.6 49.0

S3 79.7 54.5 82.5 87.4 53.1 53.1

S4 56.6 55.9 65.2 68.5 58.7 56.6

S5 56.6 50.3 50.2 49.7 50.3 50.3

S6 53.8 50.3 54.2 57.3 53.8 51.7

S7 91.6 74.1 81.7 88.1 64.3 74.1

S8 57.3 65.0 60.4 64.3 63.6 65.7

S9 72.7 69.9 84.3 86.7 76.9 69.9

Mean 68.1 61.7 69.9 74.3 60.1 61.7

Std. 14.8 12.9 14.8 15.7 10.0 12.5

Table 5.13: Comparative results of different classification schemes for left hand vs. tongue motor imagery in data set 2A from BCI competition IV

baseline AWE AAWE - realistic AAWE - perfect DWEC DWEN

S1 58.3 62.5 86.7 91.7 52.8 62.5

S2 61.1 51.4 53.2 56.3 50.0 50.7

S3 75.0 59.0 81.2 88.9 54.9 58.3

S4 49.0 69.2 68.7 75.5 70.6 68.5

112

S5 61.8 50.7 52.1 59.7 50.0 50.0

S6 57.6 54.2 53.8 54.9 57.6 55.6

S7 84.7 67.4 81.2 91.7 58.3 67.4

S8 79.2 85.4 82.7 91.0 84.7 84.0

S9 88.9 84.0 87.8 89.6 91.0 84.1

Mean 68.4 64.9 71.9 77.7 63.3 64.6

Std. 13.9 13.0 15.2 16.4 15.3 12.8

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Table 5.14: Comparative results of different classification schemes for right hand vs. feet motor imagery in data set 2A from BCI competition IV

baseline AWE AAWE - realistic AAWE - perfect DWEC DWEN

S1 89.5 59.4 95.3 98.6 60.8 58.0

S2 41.3 49.7 62.9 72.0 50.3 50.3

S3 60.1 64.3 91.0 97.2 65.7 63.6

S4 66.7 50.0 67.8 77.8 52.8 52.1

S5 51.0 50.3 49.7 49.0 50.3 50.3

S6 45.5 53.8 57.3 60.8 59.4 64.3

S7 42.0 56.6 82.2 93.0 55.2 56.6

S8 53.1 56.6 68.7 75.5 58.0 55.9

S9 60.8 53.1 66.9 77.6 56.6 58.7

Mean 56.7 54.9 71.3 78.0 56.6 56.7

Std. 15.1 4.9 15.3 16.6 5.1 5.2

Table 5.15: Comparative results of different classification schemes for right hand vs. tongue motor imagery in data set 2A from BCI competition IV

baseline AWE AAWE - realistic AAWE - perfect DWEC DWEN

S1 97.9 82.6 92.8 95.8 51.4 51.4

S2 64.6 56.9 55.8 58.3 50.0 50.7

S3 94.4 63.2 84.4 95.8 60.4 63.9

S4 56.9 61.8 69.3 72.2 54.9 61.8

S5 44.4 52.1 50.0 48.6 50.0 50.0

S6 59.0 47.2 51.0 51.4 59.7 53.5

S7 76.4 63.2 86.9 94.4 54.2 56.3

S8 76.4 76.4 82.0 87.5 70.8 75.7

S9 58.3 69.4 69.4 71.5 63.2 68.0

Mean 69.8 63.7 71.3 75.1 57.2 59.0

Std. 17.9 11.2 16.2 19.2 7.0 8.9

Table 5.16: Comparative results of different classification schemes for feet vs. tongue motor imagery in data set 2A from BCI competition IV

baseline AWE AAWE - realistic AAWE - perfect DWEC DWEN

S1 56.6 56.6 53.8 49.7 53.8 56.6

S2 52.4 51.0 61.0 66.4 49.7 50.3

S3 64.3 60.1 68.5 80.4 57.3 60.1

S4 50.0 49.3 53.5 53.5 43.0 49.3

S5 49.0 49.7 49.9 49.0 49.7 49.7

S6 55.2 51.0 49.3 46.2 49.0 49.7

S7 58.7 56.6 58.4 68.5 55.9 56.6

S8 78.3 76.9 74.1 80.4 77.6 76.2

S9 83.9 74.1 72.7 81.1 78.3 74.1

Mean 61.0 58.4 60.1 63.9 57.2 58.1

Std. 12.4 10.4 9.6 14.7 12.5 10.5

Table 5.17 shows comparative results for the binary classification data set from BNCI Horizon 2020 project. Similarly to other data sets, dynamic classifiers weighting approaches did not allow achieving good classification performance. The adaptive accuracy-weighted ensemble method allowed increasing classification performance for many subjects but conclusions could not be drawn from this data set because many subjects were naïve BCI users and all investigated classification approaches did not achieve classification accuracy higher than chance level.

113

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

Table 5.17: Comparative results of different classification schemes for the binary classification data set from BNCI Horizon 2020 project

Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6 Subject 7 Subject 8 Subject 9 Subject 10 Subject 11 Subject 12 Subject 13 Subject 14 Average

baseline

AWE

38.3 56.7 96.6 56.7 51.7 60.0 56.6 50.0 56.5 48.3 56.6 50.0 53.3 53.3 56.0 ± 12.8

53.3 70.0 68.3 51.7 66.7 81.7 66.6 50.0 83.3 50.0 50.0 50.1 55.0 50.0 60.5 ± 12.1

AAWE – realistic 52.7 69.5 83.2 61.8 67.8 78.7 66.6 54.3 86.0 53.3 53.5 54.5 53.0 56.7 63.7 ± 11.9

AAWE – perfect 53.3 75.0 91.7 63.3 80.0 81.6 70.0 61.8 93.3 51.7 58.3 53.3 50.0 60.0 67.4 ± 14.7

DWEC

DWEN

50.0 73.3 61.7 50.0 51.6 80.0 78.3 58.3 80.0 50.0 50.0 50.0 55.0 51.7 60.0 ± 12.4

53.3 70.0 58.3 51.7 66.7 80.0 66.6 50.0 83.3 50.0 50.0 50.0 55.0 50.0 59.6 ± 11.7

5.3.2.3 Analysis of transferability between subjects In order to analyze transferability of classifications models learned using data from different subjects, an investigation of base classifiers’ weights behavior was performed. Figure 5.23 illustrates the normalized base classifiers’ weights at the beginning of feedback phase for different target subjects in left hand vs. right hand motor imagery data set from BCI competition IV. From this figure, many conclusions about transferability of classification models between different subjects can be drawn. First of all, most of classification models that generalize well are learned using EEG signals filtered in the 8 − 30𝐻𝑧 frequency band. This suggests that subject-independent information is more likely to be embedded in wide bands of the spectrum. Furthermore, classification models that generalize well are not necessarily learned using data from well performing subjects. For example, although its test performance is not high (68.8%), the classification model trained on EEG signals recorded from subject 4 and filtered in the 8 − 30𝐻𝑧 frequency band generalizes well across subjects. On contrary, the classification model learned using EEG signals recorded from subject 3 and filtered in the same frequency band does not generalize well across subjects even if its test performance is higher than 90%.

114

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

subject 1

subject 2 0.4

8-12Hz 10-14Hz

0.35

subject 3

8-12Hz

8-12Hz 0.35

10-14Hz

0.45 10-14Hz 0.4

12-16Hz

12-16Hz

0.3

12-16Hz

0.3 14-18Hz

14-18Hz

16-20Hz

16-20Hz

18-22Hz

0.2

18-22Hz

20-24Hz

0.2

20-24Hz

0.15

0.15 22-26Hz

22-26Hz

24-28Hz 26-30Hz

0.1

24-28Hz

0.05

26-30Hz

8-30Hz

0.1 0.05

8-30Hz S2

S3

S4

S5

S6

S7

S8

S9

0.35

14-18Hz 0.25

0.25

0

16-20Hz

0.3

18-22Hz

0.25

20-24Hz

0.2

22-26Hz

0.15

24-28Hz

0.1

26-30Hz 0.05 8-30Hz

S1

S3

S4

subject 4

S5

S6

S7

S8

S9

0

S1

S2

S4

subject 5

8-12Hz

S5

S6

S7

S8

S9

0

subject 6 0.4

8-12Hz

8-12Hz 0.3

10-14Hz

0.5

12-16Hz

10-14Hz

0.35

12-16Hz

14-18Hz

0.4

16-20Hz

0.3

14-18Hz

0.3

20-24Hz

18-22Hz

0.2

20-24Hz 0.2

22-26Hz 24-28Hz

0.1

26-30Hz 8-30Hz S2

S3

S5

S6

S7

S8

S9

0.25

0.2

16-20Hz 18-22Hz

0.15 20-24Hz

0.15 22-26Hz

22-26Hz

24-28Hz

0.1

24-28Hz

26-30Hz

0.05

26-30Hz

8-30Hz S1

12-16Hz 14-18Hz

0.25

16-20Hz

18-22Hz

10-14Hz

0

0.1

0.05

8-30Hz S1

S2

S3

subject 7

S4

S6

S7

S8

S9

0

S1

S2

S3

subject 8

S4

S5

S7

S8

S9

0

subject 9 0.35

8-12Hz

8-12Hz

8-12Hz

0.45

0.5 10-14Hz

10-14Hz

0.3

0.45 12-16Hz

12-16Hz

0.4 12-16Hz

0.4 14-18Hz

10-14Hz

0.25

14-18Hz

0.35

14-18Hz

0.35 16-20Hz

16-20Hz

0.2

0.3 18-22Hz

18-22Hz 0.25

20-24Hz

0.15

20-24Hz

0.3

16-20Hz 18-22Hz

0.25

20-24Hz

0.2

22-26Hz

0.15

0.2 22-26Hz

22-26Hz

0.1

0.15 24-28Hz

24-28Hz

24-28Hz

0.1

0.1 26-30Hz

0.05

26-30Hz

26-30Hz

0.05

0.05 8-30Hz

8-30Hz S1

S2

S3

S4

S5

S6

S8

S9

0

8-30Hz S1

S2

S3

S4

S5

S6

S7

S9

0

S1

S2

S3

S4

S5

S6

S7

S8

0

Figure 5.23: Normalized base classifiers’ weights at the beginning of feedback phase for different target subjects in left hand vs. right hand motor imagery data set from BCI competition IV

Base classifiers’ weights in the end of feedback phase of each target subject in the same binary classification task are shown in figure 5.24. As expected, the structure of the ensemble changes enormously in most of cases. Classification models that initially generalize well across subjects may disappear from the ensemble and other classifications models that perform well in classifying data from target subject may appear. This is mainly related to changes in data distribution during the same sessions. Online adaptation of base classifiers’ weights allows tracking these changes and maintaining good classification performance during all the session.

115

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

subject 1

subject 2 0.45

8-12Hz 10-14Hz

10-14Hz

12-16Hz

0.35

12-16Hz

14-18Hz

0.3

16-20Hz

subject 3 1

8-12Hz

0.4

8-12Hz 0.9 0.8

0.15

10-14Hz 12-16Hz

14-18Hz

0.7

14-18Hz

16-20Hz

0.6

16-20Hz

18-22Hz

0.5

18-22Hz

20-24Hz

0.4

20-24Hz

0.1

0.25 18-22Hz 0.2

20-24Hz 22-26Hz

0.15

22-26Hz

24-28Hz

0.1

24-28Hz

26-30Hz

0.05

8-30Hz

0.3 0.2

26-30Hz

S3

S4

S5

S6

S7

S8

0.05

24-28Hz 26-30Hz

0.1 8-30Hz

S2

22-26Hz

S9

8-30Hz S1

S3

S4

subject 4

S5

S6

S7

S8

S9

0

S1

S2

S4

subject 5

S5

S6

S7

S8

S9

0

subject 6

0.9 8-12Hz

8-12Hz 0.8

10-14Hz

0.45

10-14Hz

8-12Hz 10-14Hz

0.5

0.4 12-16Hz

0.7

14-18Hz

0.6

16-20Hz

12-16Hz

12-16Hz 0.35

14-18Hz

0.3

16-20Hz

0.4

14-18Hz 16-20Hz

0.5 18-22Hz 0.4

20-24Hz

18-22Hz

0.25

18-22Hz

20-24Hz

0.2

20-24Hz

0.15

22-26Hz

22-26Hz

0.3

22-26Hz

24-28Hz

0.2

24-28Hz

26-30Hz

0.1

8-30Hz

0.1

26-30Hz

0.05

8-30Hz S1

S2

S3

S5

S6

S7

S8

S9

0

0.3

0.2

24-28Hz 0.1

26-30Hz 8-30Hz

S1

S2

S3

subject 7

S4

S6

S7

S8

S9

0

S1

S2

S3

subject 8

8-12Hz

0.35

S4

S5

S7

S8

S9

0

subject 9

8-12Hz

8-12Hz

0.4

10-14Hz

0.35

0.09 10-14Hz

10-14Hz 0.3

12-16Hz

0.08 12-16Hz

14-18Hz

0.25

16-20Hz

12-16Hz 0.07

14-18Hz

0.3 14-18Hz

16-20Hz

0.06

16-20Hz

0.25

18-22Hz

0.05

18-22Hz

0.2

20-24Hz

0.04

20-24Hz

22-26Hz

0.03

22-26Hz

0.2 18-22Hz 20-24Hz

0.15

0.15 22-26Hz 0.1 24-28Hz

24-28Hz

26-30Hz

0.05

8-30Hz

0.02

26-30Hz

26-30Hz

S2

S3

S4

S5

S6

S8

S9

0

0.05

0.01 8-30Hz

S1

0.1

24-28Hz

8-30Hz S1

S2

S3

S4

S5

S6

S7

S9

0

S1

S2

S3

S4

S5

S6

S7

S8

0

Figure 5.24: Normalized base classifiers’ weights at the end of feedback phase for different target subjects in left hand vs. right hand motor imagery data set from BCI competition IV

Finally, figure 5.25 illustrates the initial states of the accuracy-weighted ensemble in all binary classification tasks in data set 2A from BCI competition IV when subject 1 is considered as target subject. As we can see, transferability between subjects is task-dependent and there is no subject that generalizes well for all binary classification tasks.

116

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

left hand - right hand

left hand - feet 0.4

8-12Hz

left hand - tongue

8-12Hz

8-12Hz 0.45

10-14Hz

0.35

12-16Hz

10-14Hz

14-18Hz

0.35

14-18Hz

12-16Hz 0.5 14-18Hz

16-20Hz

0.25

16-20Hz

0.3

16-20Hz

18-22Hz

0.2

18-22Hz

0.25

18-22Hz

20-24Hz

0.2

20-24Hz

22-26Hz

0.15

22-26Hz

20-24Hz

0.6

10-14Hz 0.4

12-16Hz 0.3

0.4

0.3

0.15 22-26Hz 24-28Hz 26-30Hz

0.1

24-28Hz

0.05

26-30Hz

8-30Hz

0.1

S3

S4

S5

S6

S7

S8

S9

0.05

0

8-30Hz S2

S3

right hand - feet

S4

S5

S6

S7

S8

S9

0

S2

8-12Hz

0.45

8-12Hz

0.4

10-14Hz

12-16Hz

0.35

14-18Hz 16-20Hz

12-16Hz

14-18Hz

0.3

14-18Hz

0.25

24-28Hz

0.1

24-28Hz

0.1

24-28Hz

26-30Hz

0.05

8-30Hz S7

S8

S9

0

0.3 0.25 0.2

22-26Hz

S6

0.35

20-24Hz 0.15

S5

0

18-22Hz

22-26Hz

8-30Hz

S9

0.2 20-24Hz

0.15

0.05

S8

16-20Hz

22-26Hz

26-30Hz

S7

0.4

12-16Hz

18-22Hz 0.2

20-24Hz

S6

10-14Hz

0.35

16-20Hz

18-22Hz

S5

8-12Hz 0.4

0.25

S4

S4

feet - tongue

0.3

S3

S3

right hand - tongue

10-14Hz

S2

0.1

26-30Hz

8-30Hz S2

0.2

24-28Hz

0.15 0.1

26-30Hz

0.05

8-30Hz S2

S3

S4

S5

S6

S7

S8

S9

0

S2

S3

S4

S5

S6

S7

S8

S9

0

Figure 5.25: Iinitial states of the accuracy-weighted ensemble in all binary classification tasks in data set 2A from BCI competition IV when subject 1 is considered as target subject

5.4 Conclusion In this chapter, an online inter-subjects classification framework for MI-based BCIs was presented. This framework, called AAWE, aims at taking advantage from both knowledge transfer and online adaptation techniques in order to minimize calibration time in BCIs without reducing their decoding performance. Inter-subjects classification ensures good start of the prediction model. Online adaptation allows avoiding negative transfer in cases where brain activity patterns of the new BCI user are very different from other subjects and ensuring efficiency of the prediction model. Inter-subjects classification is performed using a weighted-average ensemble in which base classifiers are learned using EEG signals recorded from different subjects and weighted according to their accuracies in classifying brain signals of target BCI user. In order to manage inter-subjects variability, EEG signals were filtered in different frequency bands and multiple classification models were learned for each subject. Online adaptation was performed by updating base classifiers’ weights in a semi-supervised way based on ensemble predictions reinforced by interaction error-related potentials. Using iErrPs as reinforcement signals allows minimizing uncertainty about ensemble’s predictions and preventing error accumulation. The proposed online adaptation approach allows traking concept changes without using a detection mechanism. It is also time-efficient and can be tuned to concept changes with different rates. Experimental evaluation highlighted the importance of combining both online adaptation and inter-subjects classification techniques. It also allowed studying the behavior of linear combiners 117

Chapter 5: On combining knowledge transfer and online adaptation in motor imagery-based braincomputer interfaces

when performing inter-subjects classification in BCIs. Last but not least, it highlighted the need of conceiving new classifiers combinations techniques for ensemble classification in changing envirnonments as alternatives to the commonly used dynamic classifiers weighting approaches. Although the proposed online adaptation technique has been demonstrated to be efficient for binary classification, it was not effective for multi-class classification because increasing the number of classes increases uncertainty about which label can be considered correct when an error-related potential is detected. The use of error-related potentials as reinforcement signals in BCIs is still not a well studied issue. Their use in multi-class BCI systems is a worth investigating issue in future work.

118

6 Conclusion

6

In this chapter, I give a summary of main contributions of this thesis along with their advantages and limitations. I will then discuss some important topics related to pattern classification in BCIs. I finish the chapter by discussing the proposed methods in a general context beyond BCI applications and make a link with some areas of research in machine learning.

6.1 Summary of contributions BCI technology is still challenged by many problems that limit its use in daily life. Reducing calibration time while maintaining good classification accuracy has been one of the most challenging problems in BCI research community during the last years. This problem is particularly met in active BCIs in which the user voluntarily modulates his brain rhythms in order to induce different brain states. Motor imagery-based BCIs are the most investigated ones in this category because they offer a natural way of interaction, especially for severely disabled users. Although many machine learning techniques have been attempted, reducing calibration time while maintaining good classification accuracy in BCIs is still not a solved problem (Lotte, 2015). In this thesis, a thorough investigation of supervised machine learning techniques in motor imagery-based BCIs was performed. Some important points that have not been addressed in previous work and that should be taken into consideration in future BCI systems are highlighted and new methods are proposed. In the first contribution of this thesis, an analysis of linear combinations of classifiers and their use in knowledge transfer-based BCI systems was performed. A novel ensemble-based knowledge transfer framework for reducing calibration time in MI-based BCIs was proposed. The effectiveness of the classifiers weighting scheme used in this framework was assessed using simulations and real EEG data. Issues such as effects of non-stationarity of EEG signals and the size of calibration set on classification performance were also investigated. Furthermore, situations in which knowledge transfer has a negative impact on target learning task were studied in order to see to which extent inter-subjects and/or inter-sessions classification are useful in BCI applications.

119

Chapter 6: Conclusion

Other issues such as assessing to which extent knowledge transfer-based BCI systems are “robust” to non-stationarity of brain signals within the same session should be investigated in future work. The second contribution of this thesis aimed at highlighting the advantages of conceiving classification algorithms that combine both knowledge transfer and online adaptation techniques in BCIs. Knowledge transfer allows rapidly achieving good classification accuracy without the need of long calibration phase. Online adaptation allows avoiding negative transfer and guarantees accuracy of the classification model. An inter-subjects classification framework in which inter-subjects classification was performed using a weighted average ensemble of classifiers learned using data recorded from different subjects and online adaptation was performed by updating base classifiers’ weights in a semi-supervised way based on ensemble predictions reinforced by interaction error-related potentials was proposed. Experimental evaluation using several binary classification data sets showed the effectiveness of the proposed framework. For multi-class classification, the proposed online adaptation technique was not effective because there is an ambiguity about which class label can be considered to be correct when an error-related potential is detected. The proposed framework presents some limitations that should be addressed in future work. The first limitation is that no outlier detection and removal mechanism was used in this study. In realistic settings, outliers handling is important because they may have drastic effects on classification performance, especially for adaptive classification algorithms. The main limitation of this work is that all experiments were conducted offline using benchmark EEG data sets that did not allow detecting real interaction error-related potentials. Online experiments with real iErrPs detection should be conducted in future work in order to assess the online performance of the proposed methods. In order to ensure that the data sample used for knowledge transfer is representative, the number of subjects recruited for the experiment should be as maximum as possible. Experimental conditions should be varied between subjects in order to mimic realistic conditions. For evaluating the online adaptation technique based on ensemble predictions reinforced by error-related potentials, the duration of the feedback phase should be long enough to assess the long-term accuracy of the adaptation. It was not possible to accomplish this task because this thesis was conducted in the context of a BCI project that is still in its early stages. As stated before, conceiving a BCI and conducting experiments with humans is a complex task that needs the efforts of a multidisciplinary team.

120

Chapter 6: Conclusion

6.2 Discussions Beyond the issues discussed in this thesis, some important points related to BCIs research in general and pattern classification in BCIs specifically should be discussed. Last but not least, a link between the proposed methods and some areas of machine learning research should be established.

6.2.1 Next generation brain-computer interfaces 6.2.1.1 Towards new generation of pattern classification paradigms in BCIs In EEG-based BCIs generally and in motor imagery-based BCIs particularly, the learning problem is always divided into at least two stages. In the first stage, feature extraction using methods such as common spatial patterns (CSP) or independent component analysis (ICA) is performed. In MI-based BCIs, CSP algorithm is the gold standard method for feature extraction and all the work stated in this thesis is based on this method. In the second stage, the extracted features are fed to the classification algorithm in order to learn assigning different class labels to different brain activity patterns. This decomposition of the learning problem is not consistent with statistical learning theory which advocates that “one should solve the classification problem directly and never solve a more general problem as an intermediate step” (Vapnik, 2000). In this spirit, new methods started to be investigated in order to unify the preprocessing, feature extraction and classification steps in BCI systems. (Tomioka and Müller, 2010) proposed a framework for EEG signals analysis that unifies tasks such as feature extraction, feature selection, feature combination, and classification under a regularized empirical risk minimization problem. This framework was successfully applied to P300-based BCIs. A very promising classification framework based on recent advances in Riemannian geometry was proposed in (Congedo et al., 2013). Riemannian geometry allows working directly on sample covariance matrices in the space of symmetric and positive-definite (SPD) matrices. The proposed classification algorithm is generic and applies to different types of BCIs such as MI-based BCIs, SSVEP-based BCIs and P300-based BCIs. It is also a good candidate for next generation knowledge transfer-based BCI systems.

6.2.1.2 Towards new generation of experimental paradigms in BCIs Many open issues related to experimental paradigms in BCIs need to be investigated in order to bring this technology out of the lab and generalize its use across different categories of users. In fact, most of today’s BCI experiments are performed in controlled environments that do not allow assessing to which extent this technology can be used in realistic interaction settings. In everyday life, a BCI must work while the user performs different tasks simultaneously and freely interacts with his environment (Lance et al., 2012). These constrains are not taken into consideration in 121

Chapter 6: Conclusion

today’s BCI systems. Although some experiments in pseudo realistic environments have been attempted (Brandl et al., 2015; Leeb et al., 2011), this issue is worth investigating because most of machine learning techniques attempted so far may fail to reliably decode brain activity patterns in everyday life conditions. Another important issue is performing experiments with disabled persons. Although BCIs were originally meant to provide new means of communication to severely disabled persons, most of today’s BCI experiments target able-bodied populations of users (Ang et al., 2011). Last but not least, BCI illiteracy is still the most challenging problem in today’s BCI research (Vidaurre et al., 2011c). New experimental paradigms should be investigated in order to allow users to take control of their brain rhythms. Taking advantage from recent advances in human-computer interaction may be a promising direction (Tan, 2006).

6.2.1.3 Towards new generation of neuroimaging techniques Signal processing and machine learning techniques aim only at improving the quality of the recorded signals and extracting relevant information from them. But if the recorded signals do not carry enough relevant information, the most powerful techniques may fail in decoding them. Thus, extracting “good” signals is the cornerstone of BCI technology. In order to be used in everyday life, BCI systems must rely on neuroimaging techniques that are portable, without side effects and allow recording good quality signals. None of today’s neuroimaging techniques satisfies these conditions. But looking at recent achievement in neurotechnology, one may say such a feat is around the corner (Lance et al., 2012).

6.2.2 Next generation machine learning algorithms During the last years, different trends in machine learning research have attracted much attention. Among them, transfer learning is considered as one of the most important ones (Pan and Yang, 2010). In supervised learning, knowledge transfer allows reducing data labelling efforts by taking advantage of abundant labeled data from different domains that may be distributed differently. In order to successfully perform knowledge transfer between different learning tasks, ensemble methods have been extensively used in literature (Dai et al., 2007; Samdani and Yih, 2011; Gao et al., 2008; Eaton and DesJardins, 2006; Acharya et al., 2012; Kamishima et al., 2009). Another important research direction in today’s machine learning research is online learning (Cornuéjols, 2010). Unlike traditional machine learning techniques which assume that training data are available at the beginning of the learning task, online learning considers situations in which training data arrive sequentially and the learning process continues during the entire learning task. Due to their efficiency and scalability, online learning techniques allow reducing computational costs and adapting the learning system to changes in data distribution.

122

Chapter 6: Conclusion

Although they have been successfully used in different application fields, knowledge transfer and online learning techniques have evolved separately. As stated in (Zhao and Hoi, 2010), “most existing work on transfer learning were often studied in an offline learning fashion, which has to assume that training data in the new domain is given a priori”. Furthermore, even if it can be considered as an adaptive transfer learning because distribution of data may change during the learning task (Jaber, 2013), online learning do not allow taking advantage from abundant labeled data from different sources and different domains. Today, many applications fields of machine learning require learning systems that take advantage from both knowledge transfer across domains and tasks and online learning during the same task. BCI systems are among the application fields in which different machine learning techniques have to be combined in order to overcome challenging problems such as long calibration time and brain signals non-stationarity.

123

Bibliography Abibullaev, B., An, J., Lee, S. H., & Moon, J. I. (2013). A Study on Stroke Rehabilitation through Task-Oriented Control of a Haptic Device via Near-Infrared Spectroscopy-Based BCI. arXiv preprint arXiv:1308.4017. Acharya, A., Hruschka, E. R., Ghosh, J., & Acharyya, S. (2012). Transfer Learning with Cluster Ensembles. ICML Unsupervised and Transfer Learning, 27, 123-132. Ahn, M., Cho, H., & Jun, S. C. (2011). Calibration time reduction through source imaging in brain computer interface (BCI). In HCI International 2011–Posters’ Extended Abstracts (pp. 269273). Springer Berlin Heidelberg. Alamgir, M., Grosse-Wentrup, M., & Altun, Y. (2010). Multitask learning for brain-computer interfaces. In International Conference on Artificial Intelligence and Statistics (pp. 17-24). Alpaydin, E. (2010). Introduction to Machine Learning (Second edition). Massachusets: The MIT Press. Ang, K. K., Guan, C., Chua, K. S. G., Ang, B. T., Kuah, C. W. K., Wang, C., ... & Zhang, H. (2011). A large clinical study on the ability of stroke patients to use an EEG-based motor imagery brain-computer interface. Clinical EEG and Neuroscience, 42(4), 253-258. Artusi, X., Niazi, I. K., Lucas, M. F., & Farina, D. (2011). Performance of a simulated adaptive BCI based on experimental classification of movement-related and error potentials. IEEE Journal on Emerging and Selected Topics in Circuits and Systems, 1(4), 480-488. Baillet, S., Mosher, J. C., & Leahy, R. M. (2001). Electromagnetic brain mapping. IEEE Signal Processing Magazine, 18(6), 14-30. Bamdadian, A., Guan, C., Ang, K. K., & Xu, J. (2015). Towards improvement of MI-BCI performance of subjects with BCI deficiency. In 7th International IEEE/EMBS Conference on Neural Engineering (NER), 2015 (pp. 17-20). Başar, E., Başar-Eroglu, C., Karakaş, S., & Schürmann, M. (2001). Gamma, alpha, delta, and theta oscillations govern cognitive processes. International Journal of Psychophysiology, 39(2), 241-248.

Bibliography

Bashashati, A., Fatourechi, M., Ward, R. K., & Birch, G. E. (2007). A survey of signal processing algorithms in brain–computer interfaces based on electrical brain signals. Journal of Neural engineering, 4(2), R32. Ben-David, S., & Schuller, R. (2003). Exploiting task relatedness for multiple task learning. In Learning Theory and Kernel Machines (pp. 567-580). Springer Berlin Heidelberg. Berger, H. (1929). Über das elektrenkephalogramm des menschen. European Archives of Psychiatry and Clinical Neuroscience, 87(1), 527-570. Bießmann, F., Plis, S., Meinecke, F. C., Eichele, T., & Müller, K. R. (2011). Analysis of multimodal neuroimaging data. IEEE Reviews in Biomedical Engineering, 4, 26-58. Biggio, B., Fumera, G., & Roli, F. (2007). Bayesian analysis of linear combiners. In Multiple Classifier Systems (pp. 292-301). Springer Berlin Heidelberg. Biggio, B., Fumera, G., & Roli, F. (2009). Bayesian Linear Combination of Neural Networks. In Innovations in Neural Information Paradigms and Applications (pp. 201-230). Springer Berlin Heidelberg. Birbaumer, N., & Cohen, L. G. (2007). Brain–computer interfaces: communication and restoration of movement in paralysis. The Journal of physiology, 579(3), 621-636. Blankertz, B. (2008). BCI Competition IV website: http://www.bbci.de/competition/iv/ Blankertz, B., Dornhege, G., Krauledat, M., Müller, K. R., Kunzmann, V., Losch, F., & Curio, G. (2006). The Berlin Brain-Computer Interface: EEG-based communication without subject training. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2), 147-152. Blankertz, B., Lemm, S., Treder, M., Haufe, S., & Müller, K. R. (2011). Single-trial analysis and classification of ERP components-a tutorial. NeuroImage, 56(2), 814-825. Blankertz, B., Tomioka, R., Lemm, S., Kawanabe, M., & Muller, K. R. (2008). Optimizing spatial filters for robust EEG single-trial analysis. IEEE Signal Processing Magazine, 25(1), 41-56. Blum, A. S. and Rutkove, S. B (2007). The Clinical Neurophysiology Primer. Totowa, NJ: Humana Press. Blumberg, J., Rickert, J., Waldert, S., Schulze-Bonhage, A., Aertsen, A., & Mehring, C. (2007). Adaptive classification for brain computer interfaces. In 29th Annual International Conference of the IEEE EMBS Engineering in Medicine and Biology Society, 2007. (pp. 2536-2539). Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge university press.

Bibliography

Brandl, S., Hohne, J., Muller, K. R., & Samek, W. (2015). Bringing bci into everyday life: Motor imagery in a pseudo realistic environment. In 7th International IEEE/EMBS Conference on Neural Engineering (NER), 2015 (pp. 224-227). Buttfield, A., & Millán, J. D. R. (2006). Online classifier adaptation in brain-computer interfaces (No. EPFL-REPORT-85978). IDIAP. Chavarriaga, R., Sobolewski, A., & Millán, J. D. R. (2014). Errare machinale est: the use of errorrelated potentials in brain-machine interfaces. Frontiers in neuroscience, 8. Cheung, K. C. (2007). Implantable microscale neural interfaces. Biomedical microdevices, 9(6), 923-938. Cincotti, F., Mattia, D., Aloise, F., Bufalari, S., Schalk, G., Oriolo, G., ... & Babiloni, F. (2008). Non-invasive

brain–computer

interface

system:

towards

its

application

as

assistive

technology. Brain research bulletin, 75(6), 796-803. Cohen, D. (1968). Magnetoencephalography: evidence of magnetic fields produced by alpharhythm currents. Science, 161(3843), 784-786. Coleman, T., Branch, M. A., & Grace, A. (1999). Optimization Toolbox for Use with MATLAB: User's Guide, Version 2. Math Works, Incorporated. Congedo, M., Barachant, A., & Andreev, A. (2013). A New Generation of Brain-Computer Interface Based on Riemannian Geometry. arXiv preprint arXiv:1310.8115. Cornuéjols, A. (2010). On-line learning: where are we so far?. In Ubiquitous knowledge discovery (pp. 129-147). Springer Berlin Heidelberg. Coyle, S., Ward, T., Markham, C., & McDarby, G. (2004). On the suitability of near-infrared (NIR) systems for next-generation brain–computer interfaces. Physiological measurement, 25(4), 815. Cui, X., Bray, S., Bryant, D. M., Glover, G. H., & Reiss, A. L. (2011). A quantitative comparison of NIRS and fMRI across multiple cognitive tasks. Neuroimage, 54(4), 2808-2821. Dai, W., Yang, Q., Xue, G. R., & Yu, Y. (2007, June). Boosting for transfer learning. In Proceedings of the 24th international conference on Machine learning (pp. 193-200). Daly, J. J., & Wolpaw, J. R. (2008). Brain–computer interfaces in neurological rehabilitation. The Lancet Neurology, 7(11), 1032-1043.

Bibliography

Derosière, G. (2014). Disentangling the neural correlates of attention: from cognitive neuroscience to cognitive engineering. (Doctoral dissertation, University of Montpellier 1, France). Devlaminck, D., Wyns, B., Grosse-Wentrup, M., Otte, G., & Santens, P. (2011). Multisubject learning for common spatial patterns in motor-imagery BCI. Computational intelligence and neuroscience, 2011, 8. Didaci, L., Fumera, G., & Roli, F. (2013). Diversity in classifier ensembles: Fertile concept or dead end?. In Multiple Classifier Systems (pp. 37-48). Springer Berlin Heidelberg. Douglas Fields, R. (2009). The Other Brain: The Scientific and Medical Breakthroughs That Will Heal Our Brains and Revolutionize Our Healths. New York: Simon & Schuster. Eason, R. G. (1981). Visual evoked potential correlates of early neural filtering during selective attention. Bulletin of the Psychonomic Society, 18(4), 203-206. Eaton, E., & DesJardins, M. (2006). Knowledge transfer with a multiresolution ensemble of classifiers. In ICML Workshop on Structural Knowledge Transfer for Machine Learning. Erdogan, H., & Sen, M. U. (2010). A unifying framework for learning the linear combiners for classifier ensembles. In 20th International Conference on Pattern Recognition (ICPR), 2010 (pp. 2985-2988). Farwell, L. A., & Donchin, E. (1988). Talking off the top of your head: toward a mental prosthesis utilizing event-related brain potentials. Electroencephalography and clinical Neurophysiology, 70(6), 510-523. Fazli, S., Grozea, C., Danóczy, M., Blankertz, B., Popescu, F., & Müller, K. R. (2009). Subject independent EEG-based BCI decoding. In Advances in Neural Information Processing Systems (pp. 513-521). Fazli, S., Mehnert, J., Steinbrink, J., Curio, G., Villringer, A., Müller, K. R., & Blankertz, B. (2012).

Enhanced

performance

by

a

hybrid

NIRS–EEG

brain

computer

interface. Neuroimage, 59(1), 519-529. Ferrez, P. W., & Millán, J. D. R. (2008). Error-related EEG potentials generated during simulated brain–computer interaction. IEEE Transactions on Biomedical Engineering, 55(3), 923-929. Fumera, G., & Roli, F. (2003). Linear combiners for classifier fusion: Some theoretical and experimental results. In Multiple Classifier Systems (pp. 74-83). Springer Berlin Heidelberg.

Bibliography

Fumera, G., & Roli, F. (2005). A theoretical and experimental analysis of linear combiners for multiple

classifier

systems. IEEE

Transactions

on

Pattern

Analysis

and

Machine

Intelligence, 27(6), 942-956. Galán, F., Nuttin, M., Lew, E., Ferrez, P. W., Vanacker, G., Philips, J., & Millán, J. D. R. (2008). A brain-actuated wheelchair: asynchronous and non-invasive brain–computer interfaces for continuous control of robots. Clinical Neurophysiology, 119(9), 2159-2169. Gama, J., Žliobaitė, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys (CSUR), 46(4), 44. Gao, J., Fan, W., Jiang, J., & Han, J. (2008). Knowledge transfer via multiple model local structure mapping. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 283-291). Grizou, J., Iturrate, I., Montesano, L., Oudeyer, P. Y., & Lopes, M. (2014). Calibration-free BCI based control. In Twenty-Eighth AAAI Conference on Artificial Intelligence (pp. 1-8). Grosse-Wentrup, M., Liefhold, C., Gramann, K., & Buss, M. (2009). Beamforming in noninvasive brain–computer interfaces. IEEE Transactions on Biomedical Engineering, 56(4), 1209-1219. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: data mining, inference and prediction. New York: Springer-Verlag, 1(8), 371-406. Heger, D., Putze, F., Herff, C., & Schultz, T. (2013). Subject-to-subject transfer for CSP based BCIs: Feature space transformation and decision-level fusion. In 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2013 (pp. 56145617). Herrmann, C. S. (2001). Human EEG responses to 1–100 Hz flicker: resonance phenomena in visual cortex and their potential correlation to cognitive phenomena. Experimental brain research, 137(3-4), 346-353. Hinterberger, T., Schmidt, S., Neumann, N., Mellinger, J., Blankertz, B., Curio, G., & Birbaumer, N. (2004). Brain-computer communication and slow cortical potentials. IEEE Transactions on Biomedical Engineering, 51(6), 1011-1018. Ho, T. K. (1998). The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8), 832-844.

Bibliography

Hochberg, L. R., Serruya, M. D., Friehs, G. M., Mukand, J. A., Saleh, M., Caplan, A. H., ... & Donoghue, J. P. (2006). Neuronal ensemble control of prosthetic devices by a human with tetraplegia. Nature, 442(7099), 164-171. Jaber, G. (2013). An approach for online learning in the presence of concept change (Doctoral dissertation, Université Paris Sud 11). Jasper, H., & Penfield, W. (1949). Zur Deutung des normalen Elektrencephalogramms und selner Veriinderungen. Electroeorticograms in man: effect of voluntary movement upon the electrical activity of the preentral gyrus. Arch. Psychiatr. Z. Neurol, 174, 163-174. Jobsis, F. F. (1977). Noninvasive, infrared monitoring of cerebral and myocardial oxygen sufficiency and circulatory parameters. Science, 198(4323), 1264-1267. Kamishima, T., Hamasaki, M., & Akaho, S. (2009). TrBagg: A simple transfer learning method and its application to personalization in collaborative tagging. In Ninth IEEE International Conference on Data Mining, 2009. ICDM'09. (pp. 219-228). Kang, H., & Choi, S. (2011). Bayesian multi-task learning for common spatial patterns. In International Workshop on Pattern Recognition in NeuroImaging (PRNI), 2011 (pp. 61-64). Kang, H., Nam, Y., & Choi, S. (2009). Composite common spatial pattern for subject-to-subject transfer. IEEE Signal Processing Letters, 16(8), 683-686. Krauledat, M. (2008). Analysis of nonstationarities in EEG signals for improving brain-computer interface performance (Doctoral dissertation, Berlin Institute of Technology). Krauledat, M., Schröder, M., Blankertz, B., & Müller, K. R. (2006). Reducing calibration time for brain-computer interfaces: A clustering approach. In Advances in Neural Information Processing Systems (pp. 753-760). Krauledat, M., Tangermann, M., Blankertz, B., & Müller, K. R. (2008). Towards zero training for brain-computer interfacing. Plos One. DOI: 10.1371/journal.pone.0002967 Kuncheva, L. I. (2004a). Classifier ensembles for changing environments. In Multiple classifier systems (pp. 1-15). Springer Berlin Heidelberg. Kuncheva, L. I. (2004b). Combining pattern classifiers: methods and algorithms. John Wiley & Sons.

Bibliography

Kuncheva, L. I., Whitaker, C. J., & Narasimhamurthy, A. (2008). A case-study on naïve labelling for the nearest mean and the linear discriminant classifiers. Pattern Recognition, 41(10), 30103020. Lal, T. N., Schröder, M., Hill, N. J., Preissl, H., Hinterberger, T., Mellinger, J., ... & Schölkopf, B. (2005). A brain computer interface with online feedback based on magnetoencephalography. In Proceedings of the 22nd international conference on Machine learning (pp. 465-472). Lance, B. J., Kerick, S. E., Ries, A. J., Oie, K. S., & McDowell, K. (2012). Brain–Computer interface technologies in the coming decades. Proceedings of the IEEE, 100(Special Centennial Issue), 1585-1599. Leeb, R., Al-Khodairy, A., Biasiucci, A., Perdikis, S., Tavella, M., Tonin, L., ... & J. d. R. Millán. (2011). Are we ready? Issues in transferring BCI technology from experts to users. na. Leinweber, M., Zmarz, P., Buchmann, P., Argast, P., Hübener, M., Bonhoeffer, T., & Keller, G. B.

(2014).

Two-photon

Calcium

Imaging

in

Mice

Navigating

a

Virtual

Reality

Environment. JoVE (Journal of Visualized Experiments), (84), e50885-e50885. Lemm, S., Blankertz, B., Dickhaus, T., & Müller, K. R. (2011). Introduction to machine learning for brain imaging. Neuroimage, 56(2), 387-399. Leuthardt, E. C., Miller, K. J., Schalk, G., Rao, R. P., & Ojemann, J. G. (2006). Electrocorticography-based brain computer interface-the Seattle experience. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 14(2), 194-198. Li, Y., & Guan, C. (2006). An extended EM algorithm for joint feature extraction and classification in brain-computer interfaces. Neural Computation, 18(11), 2730-2761. Lindig-León, C., & Bougrain, L. (2014). A comparison between different Multiclass Common Spatial Pattern approaches for identification of motor imagery tasks. In 6th International Braincomputer interface conference 2014. Liyanage, S. R., Guan, C., Zhang, H., Ang, K. K., Xu, J., & Lee, T. H. (2013). Dynamically weighted ensemble classification for non-stationary EEG processing. Journal of neural engineering, 10(3), 036007. Llera, A., van Gerven, M. A., Gómez, V., Jensen, O., & Kappen, H. J. (2011). On the use of interaction error potentials for adaptive brain computer interfaces. Neural Networks, 24(10), 1120-1127.

Bibliography

Lotte, F. (2008). Study of electroencephalographic signal processing and classification techniques towards the use of brain-computer interfaces in virtual reality applications (Doctoral dissertation, INSA de Rennes). Lotte, F. (2014). A Tutorial on EEG Signal-processing Techniques for Mental-state Recognition in Brain–Computer Interfaces. In Guide to Brain-Computer Music Interfacing (pp. 133-161). Springer London. Lotte, F. (2015). Signal Processing Approaches to Minimize or Suppress Calibration Time in Oscillatory Activity-Based Brain–Computer Interfaces. In Proceedings of the IEEE, 103(6). Lotte, F., & Guan, C. (2010). Learning from other subjects helps reducing brain-computer interface calibration time. In IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2010 (pp. 614-617). Lotte, F., & Guan, C. (2011). Regularizing common spatial patterns to improve BCI designs: unified theory and new algorithms. IEEE Transactions on Biomedical Engineering, 58(2), 355362. Lotte, F., Congedo, M., Lécuyer, A., & Lamarche, F. (2007). A review of classification algorithms for EEG-based brain–computer interfaces. Journal of neural engineering, 4. Machado, S., Almada, L. F., & Annavarapu, R. N. (2013). Progress and Prospects in EEG-Based Brain-Computer Interface: Clinical Applications in Neurorehabilitation. Journal of Rehabilitation Robotics, 1(1), 28-41. Margalit, E., Weiland, J. D., Clatterbuck, R. E., Fujii, G. Y., Maia, M., Tameesh, M., ... & Humayun, M. S. (2003). Visual and electrical evoked response recorded from subdural electrodes implanted above the visual cortex in normal dogs under two methods of anesthesia. Journal of neuroscience methods, 123(2), 129-137. Millán, J. D. R., Rupp, R., Müller-Putz, G. R., Murray-Smith, R., Giugliemma, C., Tangermann, M., ... & Mattia, D. (2010). Combining brain–computer interfaces and assistive technologies: state-of-the-art and challenges. Frontiers in neuroscience, 4. Nicolas-Alonso,

L.

F.,

&

Gomez-Gil,

J.

(2012).

Brain

computer

interfaces,

a

review. Sensors, 12(2), 1211-1279. Obermaier, B., Muller, G. R., & Pfurtscheller, G. (2003). " Virtual keyboard" controlled by spontaneous EEG activity. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 11(4), 422-426.

Bibliography

Oskoei, M. A., Gan, J. Q., & Hu, H. (2009). Adaptive schemes applied to online SVM for BCI data classification. In 31st Annual International Conference of the IEEE EMBS (pp. 2600-2603). Pan, S. J., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345-1359. Pfurtscheller, G., & Neuper, C. (2001). Motor imagery and direct brain-computer communication. Proceedings of the IEEE, 89(7), 1123-1134. Pfurtscheller, G., Brunner, C., Schlögl, A., & Da Silva, F. L. (2006). Mu rhythm (de) synchronization

and

EEG

single-trial

classification

of

different

motor

imagery

tasks. Neuroimage, 31(1), 153-159. Pfurtscheller, G., Müller, G. R., Pfurtscheller, J., Gerner, H. J., & Rupp, R. (2003). ‘Thought’– control of functional electrical stimulation to restore hand grasp in a patient with tetraplegia. Neuroscience letters, 351(1), 33-36. Plumpton, C. O. (2012). Online semi-supervised ensemble updates for fmri data. In Partially Supervised Learning (pp. 8-18). Springer Berlin Heidelberg. Plumpton, C. O. (2014). Semi-supervised ensemble update strategies for on-line classification of fMRI data. Pattern Recognition Letters, 37, 172-177. Qin, J., & Li, Y. (2006). An improved semi-supervised support vector machine based translation algorithm for BCI systems. In 18th International Conference on Pattern Recognition, 2006. ICPR 2006. (Vol. 1, pp. 1240-1243). Quionero-Candela, J., Sugiyama, M., Schwaighofer, A., & Lawrence, N. D. (2009). Dataset shift in machine learning. The MIT Press. Roset, S. A., Gonzalez, H. F., & Sanchez, J. C. (2013). Development of an EEG based reinforcement learning Brain-Computer Interface system for rehabilitation. In Annual International Conference of the IEEE Engineering in Medicine and Biology Society (Vol. 2013, pp. 1563-1566). Samdani, R., & Yih, W. T. (2011). Domain adaptation with ensemble of feature groups. In International Joint Conference on Artificial Intelligence. IJCAI 2011. (Vol. 22, No. 1, p. 1458). Samek, W. (2014). On robust spatial filtering of EEG in nonstationary environments (Doctoral dissertation, Berlin Institute of Technology).

Bibliography

Samek, W., Binder, A., & Muller, K. R. (2013a). Multiple kernel learning for brain-computer interfacing. In 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2013 (pp. 7048-7051). Samek, W., Meinecke, F. C., & Muller, K. R. (2013b). Transferring Subspaces Between Subjects in Brain--Computer Interfacing. IEEE Transactions on Biomedical Engineering, 60(8), 22892298. Shenoy, P., Krauledat, M., Blankertz, B., Rao, R. P., & Müller, K. R. (2006). Towards adaptive classification for BCI. Journal of neural engineering, 3(1), R13. Sitaram, R., Caria, A., & Birbaumer, N. (2009). Hemodynamic brain–computer interfaces for communication and rehabilitation. Neural Networks, 22(9), 1320-1328. Smith, E., & Delargy, M. (2005). Locked-in syndrome. BMJ: British Medical Journal, 330(7488), 406. Steyrl, D., Scherer, R., Förstner, O., & Müller-Putz, G. R. (2014). Motor Imagery BrainComputer Interfaces: Random Forests vs Regularized LDA-Non-linear Beats Linear. In 6th International Brain-computer interface conference 2014. Sugiyama, M., Krauledat, M., & Müller, K. R. (2007). Covariate shift adaptation by importance weighted cross validation. The Journal of Machine Learning Research, 8, 985-1005. Sui, J., Adali, T., Yu, Q., Chen, J., & Calhoun, V. D. (2012). A review of multivariate methods for multimodal fusion of brain imaging data. Journal of neuroscience methods, 204(1), 68-81. Sun, S., & Zhou, J. (2014). A review of adaptive feature extraction and classification methods for EEG-based brain-computer interfaces. In International Joint Conference on Neural Networks (IJCNN), 2014 (pp. 1746-1753). Sun, S., Lu, Y., & Chen, Y. (2011). The stochastic approximation method for adaptive Bayesian classifiers: towards online brain–computer interfaces. Neural Computing and Applications, 20(1), 31-40. Tan, D. S. (2006). Brain-Computer Interfaces: applying our minds to human-computer interaction. Informal proceedings “What is the Next Generation of Human-Computer Interaction?”. In Workshop at CHI 2006. Thomas, E., Dyson, M., & Clerc, M. (2013). An analysis of performance evaluation for motorimagery based BCI. Journal of neural engineering, 10(3), 031001.

Bibliography

Tomioka, R., & Müller, K. R. (2010). A regularized discriminative framework for EEG analysis with application to brain–computer interface. NeuroImage, 49(1), 415-432. Trans Cranial Technologies ltd. (2012). Cortical Functions reference. Retrieved January, 12, 2015, from https://www.trans-cranial.com/local/manuals/cortical_functions_ref_v1_0_pdf.pdf Tu, W., & Sun, S. (2012). A subject transfer framework for EEG classification. Neurocomputing, 82, 109-116. Tumer, K., & Ghosh, J. (1995). Theoretical foundations of linear and order statistics combiners for neural pattern classifiers. IEEE Trans. Neural Networks. Tumer, K., & Ghosh, J. (1996). Analysis of decision boundaries in linearly combined neural classifiers. Pattern Recognition, 29(2), 341-348. Vapnik, V. N. (2000). The nature of statistical learning theory. Statistics for Engineering and Information Science. Springer-Verlag, New York. Vidal, J. J. (1973). Toward direct brain-computer communication. Annual review of Biophysics and Bioengineering, 2(1), 157-180 Vidaurre, C., Cabeza, R., Scherer, R., & Pfurtscheller, G. (2007). Study of on-line adaptive discriminant analysis for EEG-based brain computer interfaces. IEEE Transactions on Biomedical Engineering, 54(3), 550-556. Vidaurre, C., Kawanabe, M., Von Bunau, P., Blankertz, B., & Muller, K. R. (2011a). Toward unsupervised adaptation of LDA for brain–computer interfaces. IEEE Transactions on Biomedical Engineering, 58(3), 587-597. Vidaurre, C., Sander, T. H., & Schlögl, A. (2011b). BioSig: the free and open source software library for biomedical signal processing. Computational intelligence and neuroscience, 2011. Vidaurre, C., Sannelli, C., Müller, K. R., & Blankertz, B. (2011c). Machine-learning-based coadaptive calibration for brain-computer interfaces. Neural computation, 23(3), 791-816. Vidaurre, C., Schlogl, A., Cabeza, R., Scherer, R., & Pfurtscheller, G. (2006). A fully on-line adaptive BCI. IEEE Transactions on Biomedical Engineering, 53(6), 1214-1219. Wang, H., Fan, W., Yu, P. S., & Han, J. (2003). Mining concept-drifting data streams using ensemble classifiers. In Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 226-235).

Bibliography

Waterhouse, S. R. (1998). Classification and regression using mixtures of experts (Doctoral dissertation, University of Cambridge). Weiskopf, N., Mathiak, K., Bock, S. W., Scharnowski, F., Veit, R., Grodd, W., ... & Birbaumer, N. (2004). Principles of a brain-computer interface (BCI) based on real-time functional magnetic resonance imaging (fMRI). IEEE Transactions on Biomedical Engineering, 51(6), 966-970. Wessberg, J., Stambaugh, C. R., Kralik, J. D., Beck, P. D., Laubach, M., Chapin, J. K., ... & Nicolelis, M. A. (2000). Real-time prediction of hand trajectory by ensembles of cortical neurons in primates. Nature, 408(6810), 361-365. Wolpaw, J. R., Birbaumer, N., McFarland, D. J., Pfurtscheller, G., & Vaughan, T. M. (2002). Brain–computer interfaces for communication and control. Clinical neurophysiology, 113(6), 767-791. Wolpaw, J. R., McFarland, D. J., & Vaughan, T. M. (2000). Brain-computer interface research at the Wadsworth Center. IEEE Transactions on Rehabilitation Engineering, 8(2), 222-226. Yu, J., Ang, K. K., Guan, C., & Wang, C. (2013). A Multimodal fNIRS and EEG-Based BCI study on motor imagery and passive movement. In 6th International IEEE/EMBS Conference on Neural Engineering (NER), 2013 (pp. 5-8). Zeyl, T. J., & Chau, T. (2014). A case study of linear classifiers adapted using imperfect labels derived from human event-related potentials. Pattern Recognition Letters, 37, 54-62. Zhao, P., & Hoi, S. C. (2010). OTL: A framework of online transfer learning. In Proceedings of the 27th International Conference on Machine Learning (ICML) (pp. 1231-1238). Zilles, K., & Amunts, K. (2010). Centenary of Brodmann's map—conception and fate. Nature Reviews Neuroscience, 11(2), 139-145.

Résumé Cette synthèse donne un aperçu général sur les travaux menés au cours de cette thèse intitulée en anglais « On pattern classification in motor imagery-based brain-computer interfaces ». Dans un premier temps, nous introduisons le cadre général de cette thèse ainsi que le positionnement scientifique de nos travaux. Dans un second temps, nous présentons brièvement les différentes contributions de cette thèse. Enfin, nous finissons ce résumé par une discussion des verrous scientifiques et les perspectives de nos travaux. Tous ces éléments seront détaillés dans la partie en anglais de ce manuscrit.

1 Cadre général 1.1 Interfacer le cerveau humain: une quête déjà ancienne Le cerveau reste aujourd’hui l’un des sujets étudiés les plus complexes. Depuis longtemps, l’Homme n’a eu de cesse à en percer le mystère afin de comprendre son méchanisme. Cette quête a commencé par des techniques classiques de chirurgie invasive qui ont permis d’apprendre beaucoup de choses sur la structure du cerveau mais pas beaucoup sur son fonctionnement. L’apparition de techniques de neuroimagerie modernes a permis d’avoir une meilleure compréhension du mécanisme de notre cerveau et le lien entre ce mécanisme et certaines tâches sensorimotrices du corps humain. Selon les connaissances actuelles, notre cerveau contient à peu près 1010 cellules neuronales et 1011 cellules gliales (Douglas Fields, 2009). Les cellules neuronales sont les unités de traitement de l’information dans le cerveau et les cellules gliales veillent à assurer le bon acheminement des informations entre cellules neuronales. D’une manière simplifiée, un neurone reçoit des informations sous forme de signaux électrochimiques d’autres neurones ou cellules sensorielles, traite ces signaux et en fonction de leur amplitude il décide de les acheminer ou non vers d’autres neurones. Cette activité neuronale provoque des changements au niveau des champs électromagnétiques extracellulaires qui peuvent être mesurés grâce à des techniques de neuroimagerie invasives telles que les microélectrodes (Cheung, 2007). Plus important, la synchronisation de l’activité électrochimique d’une large population de neurones proches de la surface du cuir chevelu (au niveau du cortex cérébral) provoque des changements au niveau des champs électromagnétiques qui peuvent être mesurés par des techniques de neuroimagerie noninvasives

telles

que

l’electroencéphalographie

(EEG)

(Berger,

1929)

ou

la

Résumé

magnétoencéphalographie (MEG) (Cohen, 1968). Afin d’assurer l’énergie nécessaire à cette activité neuronale, un large réseau vasculaire couvre toutes les cellules du cerveau. Un type particulier de cellules gliales appelées « astrocytes » assure le maintien du métabolisme du cerveau en transportant l’oxygène et le glucose des vaisseaux sanguins aux neurones. L’augmentation de consommation de glucose et d’oxygène dans une zone cérébrale est accompagnée par une augmentation du débit sanguin dans cette zone. Le taux d’oxygène acheminé est généralement d’un ordre de magnitude plus grand que sa consommation ce qui provoque l’augmentation du taux d’hémoglobine oxygénée et la diminution du taux d’hémoglobine non oxygénée dans la zone d’activité. Ces changements de l’activité hémodynamique du cerveau peuvent être mesurés par des techniques de neuroimagerie non invasives telles que la spectroscopie proche-infrarouge (Jobsis, 1977) pour avoir une information indirecte sur l’activité cérébrale. L’utilisation des techniques de neuroimagerie en diagnostic médical et en recherche clinique a permis d’établir plusieurs liens entre différents motifs d’activité cérébrale et différentes fonctions sensorimotrices et cognitives (Trans Cranial Technologies, 2012). L’étude de l’activité électrique de larges populations de neurones a permis de localiser avec précision les zones du cerveau liées à des fonctions somatosensorielles et motrices (Krauledat, 2008). Ces zones, appelées respectivement cortex somatosensoriel et cortex moteur, jouent un rôle très important dans notre perception du monde extérieur et notre interaction avec lui. L’activité électrique générée au niveau du cortex moteur est acheminée à travers le système nerveux central et le système nerveux périphérique vers différents muscles du corps afin d’effectuer différents mouvements. A cause de maladies neurodégénératives telles que la sclérose latérale amyotrophique (SLA), le tissu nerveux peut être sévèrement endommagé et le canal de communication entre le cortex cérébral et les muscles peut être rompu. Les individus qui souffrent de ces maladies peuvent perdre complètement leurs capacités motrices et rester enfermés dans leurs corps (syndrome d’enfermement) (Douglas Fields, 2009). Parce que dans la plupart des cas ces maladies sont incurables, les scientifiques ont essayé depuis longtemps de trouver de nouveaux moyens pour assurer un niveau de communication minimum et d’autonomie pour ces individus. Ce n’est qu’avec l’apparition des outils modernes de traitement numérique du signal, que l’idée d’interfacer directement le cerveau avec des dispositifs extérieurs afin de fournir de nouveaux moyens de communication est apparue. Le terme interface cerveau-machine (ICM) est ainsi apparu pour la première fois dans les travaux de Jacques J. Vidal en 1973 (Vidal, 1973).

Résumé

1.2 Les interfaces cerveau-machine : vers une nouvelle approche de communication Une interface cerveau-machine (ICM) est un système qui permet d’établir une communication directe entre le cerveau et un dispositif extérieur (Wolpaw et al., 2002). Cette définition générale couvre les systèmes qui permettent d’introduire des signaux dans le cerveau et les systèmes qui permettent de mesurer des signaux à partir du cerveau. Dans la première catégorie, nous pouvons citer les implants cochléaires qui permettent de récupérer la capacité d’entendre. Cette catégorie d’ICM ne fait pas l’objet de cette thèse. Dans la suite de cette note de synthèse, le terme interface cerveau-machine désigne tout système qui permet de mesurer des signaux à partir du cerveau afin d’établir une communication avec des dispositifs extérieurs. Décoder les signaux mesurés à partir du cerveau pour distinguer différents états cognitifs n’est pas une tâche simple. Afin d’extraire l’information utile à partir de signaux très bruités, ces signaux doivent passer par différentes étapes de traitement et de décodage. Une ICM est généralement composée des six étapes suivantes (voir figure 1) : -

Acquisition des signaux : cette étape consiste à mesurer l’activité électrique ou hémodynamique du cerveau grâce à différentes techniques de neuroimagerie. Dans cette thèse, nous nous concentrons sur l’électroencéphalographie (EEG) qui permet de mesurer l’activité électrique du cortex cérébral grâce à des électrodes placées sur le scalp (Baillet et al., 2001).

-

Traitement des signaux : cette étape sert à augmenter le rapport signal/bruit grâce à des techniques de filtrage.

-

Extraction des caractéristiques : dans cette étape, les caractéristiques qui permettent de distinguer différents états mentaux sont extraites à partir des signaux prétraités.

-

Classification : cette étape consiste à créer une fonction (appelée classifieur) qui permet d’assigner différents labels à des signaux liés à différents états cognitifs. Dans le domaine de l’apprentissage statistique, cette tâche est appelée apprentissage supervisé. Généralement, le classifieur est créé en utilisant des signaux dont on connaît déjà leurs labels.

-

Commande : une fois le classifieur établi, les labels qu’il affecte à chaque séquence de signaux sont transformés en commandes. Ces commandes dépendendent de l’application utilisée pour communiquer avec des dispositifs extérieurs.

-

Rétroaction : cette étape consiste à renvoyer une information visuelle, auditive ou tactile à l’utilisateur. C’est une étape importante pour que l’utilisateur puisse apprendre à maitriser le système.

Résumé

Figure 1 : Architecture générale d'une interface cerveau-machine

Selon la nature des signaux mesurés, les interfaces cerveau-machine peuvent être classifiées en trois catégories: ICMs actives, ICMs réactives et ICMs passives (Tan, 2006). Dans les ICMs actives, l’utilisateur essaie de changer volontairement les motifs de son activité cérébrale en effectuant différentes tâches cognitives. L’exemple le plus connu de ce type d’ICMs correspond aux ICMs basées sur l’imagerie motrice dans lesquelles l’utilisateur imagine bouger différents membre de son corps (e.g., main gauche et main droite) afin de générer différents motifs d’activité cérébrale qui peuvent être distingués par le système (Pfurtscheller, 2001). Les ICMs réactives mesurent les changements de l’activité cérébrale suite à des stimuli visuels, auditifs ou tactiles. Enfin, les ICMs passives mesurent les changements de l’activité cérébrale qui se produisent passivement durant l’exécution d’une tâche cognitive ou motrice. Dans cette thèse, nous nous focalisons sur les ICMs basées sur l’imagerie motrice qui sont les plus utilisées dans des applications dédiées à des personnes sévèrement paralysées. Initialement, la recherche sur les interfaces cerveau-machine visait à fournir de nouveaux moyens de communications pour des personnes souffrant du syndrome d’enfermement. Avec le temps, plusieurs prototypes d’ICMs dédiés à des personnes en situation d’handicap moins sévère voire à des individus sains ont été développés (Wolpaw et al., 2002 ; Nicolas-Alonso and Gomez-Gil, 2012). Aujourd’hui, les ICMs sont utilisées dans plusieurs applications telles que la communication et le contrôle, la locomotion, la thérapie, le jeu vidéo, etc.

Résumé

1.3 Approche classique de classification des signaux cérébraux en ICMs basées sur l’imagerie motrice: compromis entre ergonomie et fiabilité Avec les premières générations d’interfaces cerveau-machine basées sur l’imagerie motrice, l’utilisateur devait effectuer des semaines voire des mois d’entrainement afin d’apprendre à bien maitriser ses rythmes sensorimoteurs et de pouvoir par la suite utiliser une ICM. Ceci est lié non seulement au fait que l’autorégulation de l’activité cérébrale par imagerie motrice est une tâche difficile, mais aussi au fait que les méthodes de décodage des signaux n’étaient pas très efficaces. L’introduction des méthodes modernes d’apprentissage automatique dans le processus de décodage des signaux pour les interfaces-cerveau machine a permis de changer le paradigme de « laisse l’utilisateur apprendre » à « laisse la machine apprendre » (Samek, 2014). Avec cette seconde génération, les méthodes d’apprentissage automatique adaptées aux caractéristiques de chaque utilisateur ont permis l’utilisation d’une ICM dès la première session. Une session est généralement composée de deux phases : une phase de calibrage et une phase de feedback (figure 2).

1.3.1 Phase de calibrage Avant d’être en mesure de prédire l’intention de l’utilisateur par le biais de son activité cérébrale, le système doit apprendre les caractéristiques spatiotemporelles des signaux qui permettent de distinguer différentes tâches d’imagerie motrice (ex., bouger la main gauche et bouger la main droite). Pour ce faire, une phase de calibrage pendant laquelle l’utilisateur interagit avec le système d’une façon guidée est nécessaire. Au cours de cette phase, un repère apparaît à plusieurs reprises sur un écran d'ordinateur demandant à l'utilisateur d'effectuer les deux tâches d'imagerie motrice pendant des intervalles temporels bien définis. Les signaux mesurés pendant ces intervalles sont par la suite labélisés. Après traitement, les caractéristiques extraites de ces signaux sont utilisées pour l’apprentissage d’un modèle de classification. Afin de pouvoir labéliser correctement de nouveaux signaux mesurés pendant la même session, les méthodes d’apprentissage classiques requièrent un temps de calibrage de 20 à 30 minutes (Fazli et al., 2009).

1.3.2 Phase de feedback Après la phase calibrage, le système peut opérer en mode asynchrone. L’utilisateur n’est plus contraint de suivre les instructions du système, mais peut interagir librement. Beaucoup d’utilisateurs réussissent à utiliser une ICM basée sur l’imagerie motrice dès la première session, mais certains n’arrivent pas à maitriser leurs rythmes sensorimoteurs. Ce phénomène, appelé

Résumé

« analphabétisme d’ICMs », est l’un des principaux problèmes en ICMs basées sur l’imagerie motrice.

Figure 2 : Processus de classification des signaux EEG en interfaces cerveau-machine basées sur l’imagerie motrice

Certes les méthodes d’apprentissage automatique adaptées aux caractéristiques de chaque utilisateur ont permis de réduire considérablement la durée d’entrainement en ICMs basées sur l’imagerie motrice, mais ces méthodes ont aussi leurs propres limites. En effet, l’utilisation de cette approche d’apprentissage présente un compromis entre ergonomie et fiabilité des interfaces cerveau-machine. D’un côté, 20 à 30 minutes de calibrage avant chaque utilisation du système n’est pas viable pour une utilisation dans la vie quotidienne surtout pour des personnes handicapées qui peuvent avoir des capacités de concentration limitées. D’un autre côté, réduire la durée de calibrage signifie moins de temps pour l’utilisateur pour pouvoir générer des motifs d’activité cérébrale stables et moins de données labélisées pour le classifieur pour pouvoir décoder efficacement les signaux cérébraux. Même avec une longue durée de calibrage, les signaux cérébraux ne sont jamais stationnaires, ce qui pose un grand problème pour les méthodes d’apprentissage supervisées classiques qui sont basées sur l’hypothèse que les données sont indépendantes et identiquement distribuées (i.i.d.). La non-stationnarité des signaux dans les ICMs peut être liée à plusieurs facteurs (Krauledat, 2008). Elle peut être liée à des artéfacts comme le mouvement des électrodes, le clignement des yeux ou la déglutition. Elle peut être également associée aux changements de l’état mental de l’utilisateur tels que la fatigue, ou le déficit d’attention. Le changement de paradigme expérimental entre phase de calibrage et phase de feedback est aussi l’un des principaux facteurs de non-stationnarité des signaux cérébraux en ICMs. Pour ces raisons, la communauté de recherche en ICMs s’est intéressée ces dernières

Résumé

années à la conception de méthodes d’apprentissage permettant de concevoir des systèmes à la fois érgonomiques et fiables.

2 Contributions de cette thèse 2.1 Etude approfondie de la problématique d’apprentissage supervisé en ICMs basées sur l’imagerie motrice Pour concevoir des ICMs qui soient capables de classifier d’une façon fiable des signaux EEG en utilisant peu de données de calibrage, plusieurs approches d’apprentissage supervisé ont été explorées ces dernières années. Une vision globale de ces approches nous a permis d’identifier deux principales catégories: les approches qui permettent de créer des modèles d’apprentissage robustes à la non-stationnarité des signaux EEG et les approches qui permettent de créer des modèles d’apprentissage qui s’adaptent à la non-stationnarité.

2.1.1 Modèles d’apprentissage robustes à la non-stationnarité Dans cette catégorie, plusieurs approches ont été explorées comme, par exemple, l’utilisation des informations physiologiques a priori (Grosse-Wentrup et al., 2009 ; Ahn et al., 2011) ou la génération de données artificielles à partir d’un petit échantillon de données de calibrage (Lotte, 2015). Mais l’approche la plus explorée est l’utilisation de méthodes d’apprentissage par transfert. L’apprentissage par transfert consiste à incorporer les connaissances acquises durant les tâches d’apprentissage précédentes à l’apprentissage d’une nouvelle tâche afin de minimiser les coûts et/ou d’augmenter les performances (Pan and Yang, 2010). En interfaces cerveau-machine, l’apprentissage par transfert consiste à incorporer des données mesurées à partir de plusieurs sujets et/ou plusieurs sessions du même sujet dans la phase de calibrage du système pour un nouveau sujet afin de réduire la durée de cette phase sans réduire les performances du système (Tu and Sun, 2012). Quand il est effectué correctement, l’apprentissage par transfert permet de capturer l’information qui généralise sur plusieurs utilisateurs. Cependant, ce type d’apprentissage n’est pas simple à effectuer dans les applications ICMs à cause de la grande variabilité inter-sujets et inter-sessions des signaux cérébraux. Pour pallier ce problème, différentes techniques d’apprentissage adaptées aux données hétérogènes ont été utilisées. Parmi lesquelles on peut citer : la régularisation (Lotte and Guan, 2010), l’apprentissage multitâche (Alamgir et al., 2010) et les méthodes d’ensemble (Fazli et al., 2009 ; Liyanage et al., 2013).

2.1.2 Modèles d’apprentissage qui s’adaptent à la non-stationnarité Dans cette catégorie, l’apprentissage ne s’arrête pas à la phase de calibrage, mais continue pendant la phase de feedback. Cela permet de réduire le temps de calibrage et d’adapter le

Résumé

système aux changements de distribution de données (Sun and Zhou, 2014). Différentes techniques ont été utilisées pour l’adaptation des paramètres du modèle d’apprentissage en ICMs basées sur l’imagerie motrice. Elles peuvent être classifiées en trois catégories : adaptation nonsupervisée, adaptation supervisée et adaptation semi-supervisée. L’adaptation non-supervisée permet de mettre à jour les paramètres du modèle de classification indépendants des classes ce qui permet de pallier le problème d’absence de labels réels pendant la phase d’interaction asynchrone (Vidaurre et al., 2011a). Cependant, l’adaptation non-supervisée ne permet pas de s’adapter aux changements de distribution de données spécifiques à chaque classe. D’un autre côté, l’adaptation supervisée permet de poursuivre tout type de changement de distribution de données (Buttfield and Millán, 2006). Ce type d’adaptation n’est cependant pas envisageable en mode d’interaction asynchrone, qui est le plus souvent utilisé en ICMs basées sur l’imagerie motrice, car les labels réels ne sont pas connus. Finalement, la méthode d’adaptation la plus explorée reste l’adaptation semi-supervisée. Pour effectuer l’adaptation des paramètres liés à chaque classe sans avoir besoin de labels réels, deux approches semi-supervisées ont été explorées. La première approche consiste à utiliser les prédictions du modèle de classification pour faire la mise à jour (Blumberg et al., 2007). Cette approche peut être efficace dans certains cas, mais elle est sujette à l’accumulation d’erreurs. L’approche qui semble la plus prometteuse pour créer des modèles de classification qui s’adaptent à la non-stationnarité est l’adaptation basée sur les potentiels d’erreurs (Llera et al., 2011). Ces potentiels sont détectés dans les signaux EEG après une erreur commise par le système ou l’utilisateur (Chavarriaga and Millán, 2014). Plusieurs travaux ont montré que l’utilisation de ces potentiels pour l’adaptation en ligne des modèles de classification peut augmenter significativement les performances des ICMs basées sur l’imagerie motrice (Blumberg et al., 2007 ; Llera et al., 2011 ; Artusi et al., 2011, Zeyl and Chau, 2014). Malgré ce grand nombre de travaux, réduire le temps de calibrage sans réduire les performances des ICMs basées sur l’imagerie motrice reste aujourd’hui un problème non résolu (Lotte, 2015). Une étude approfondie des méthodes d’apprentissage existantes et l’investigation de nouvelles méthode est nécessaire pour espérer un jour utiliser cette technologie dans la vie quotidienne. L’objectif de cette thèse est de mettre l’accent sur des problèmes liés aux approches d’apprentissages existantes et de proposer de nouvelles approches qui permettent de pallier les problèmes liés à chaque catégorie d’approches.

Résumé

2.2 Etude de l’apprentissage par transfert basé sur les méthodes d’ensemble: application au transfert inter-sujets en ICMs basées sur l’imagerie motrice Une étude approfondie du problème d’apprentissage par transfert basé sur les combinaisons linéaires de classifieurs est effectuée. Cette étude se focalise sur l’application au transfert intersujects en ICMs, mais elle peut être généralisée à n’importe quelle autre application. Dans ce cadre, nous proposons un algorithme d’apprentissage basé sur les combinaisons linéaires de classifieurs et qui a pour objectif de réduire la durée de calibrage en ICMs basées sur l’imagerie motrice sans réduire leurs performances. Nous évaluons l’efficacité de la méthode de combinaison linéaire de classifieurs utilisées dans cet algorithme et nous étudions son comportement en fonction de la durée de calibrage de l’ICM. Dans cet algorithme, des classifieurs construits en utilisant des sigaux EEG collectés à partir de plusieurs sujets sont accordés des poids proportionnels à leurs performances de classification des signaux EEG d’un nouvel utilisateur de l’ICM. Cela permet au modèle d’apprentissage de capturer les motifs d’activité cérébrale communs entre sujets tout en s’adaptant aux spécificités du nouveau sujet. A travers la méthode proposée, nous étudions aussi à quel point l’apprentissage par transfert est utile dans la conception d’ICMs. Dans certains cas, l’apprentissage par transfert a un impact négatif sur les performances de la nouvelle tâche d’apprentissage (Pan and Yang, 2010). Cette situation, appelée « transfert négatif », est suceptible de se produire en ICMs à cause de la grande variabilité inter-sujets et inter-sessions des signaux cérébraux. A cause de cette variabilité, un modèle d’apprentissage appris en utilisant les signaux EEG d’autres sujets peut ne pas etre efficace dans la classification de signaux du nouveau sujet. Dans le cas de classification inter-sujets basée sur les méthodes d’ensemble, une pondération approapriée des classifieurs est une condition nécessaire pour créer un modèle d’apprentissage efficace. Néanmoins, ce n’est pas une condition suffisante car dans certains cas aucune combinaison linéaire de classifieurs appris en utilisant des signaux d’autres sujets ne permet de bien classifier les signaux du nouveau sujet. Pour cela, le transfert négatif doit etre sérieusement pris en considération lors de l’apprentissage inter-sujets en ICMs.

Résumé

2.3 Nouvelle méthode d’apprentissage qui combine l’apprentissage par transfert et l’adaptation en ligne pour les ICMs basées sur l’imagerie motrice D’un côté, les approches d’apprentissage par transfert en ICMs basées sur l’imagerie motrice ont pour objectif de capturer les motifs d’activité cérébrale communs à tous les sujets, permettant de créer des modèles d’apprentissage qui généralisent à d’autres sujets sans avoir besoin de beaucoup de données de calibrage. Comme nous l’avons déjà expliqué, cette hypothèse d’existence de motifs d’activité cérébrale communs à tous les sujets peut ne pas tenir à cause de la grande variabilité des signaux cérébraux. D’un autre côté, les méthodes adaptatives ont pour objectif d’adapter en ligne un modèle d’apprentissage appris à partir de quelques données labélisées sans tirer profit de grands volumes de données mesurées à partir d’autres sujets et/ou sessions. Combiner les deux approches peut avoir plusieurs avantages. L’apprentissage par transfert permet d’assurer un bon départ de l’interface cerveau-machine. L’adaptation en ligne permet d’éviter implicitement le transfert négatif et d’améliorer la fiabilité du modèle d’apprentissage. Dans cette thèse, nous proposons une méthode d’apprentissage qui permet de combiner l’apprentissage par transfert et l’adaptation en ligne. Dans cette méthode, appelée en anglais « adaptive accuracy-weighted ensemble » (AAWE), l’apprentissage par transfert est effectué à travers une combinaison linéaire de classifieurs appris à partir de signaux EEG mesurés sur plusieurs sujets. L’adaptation en ligne est effectuée en mettant à jour les poids des classifieurs en utilisant les prédictions de l’ensemble renforcées par les potentiels d’erreurs. La nouveauté de cette approche réside dans le fait que l’adaptation des poids prend en considération la dépendance stochastique entre données contrairement aux approches dynamiques qui ne prennent en considération que le point actuel pour le calcul des poids (Tu and Sun, 2012 ; Liyanage et al., 2013). Une étude expérimentale a montré les avantages de combiner les deux approches d’apprentissage en un seul algorithme. Cette étude a aussi permis d’étudier le comportement des combinaisons linéaires de classifieurs lors de la classification inter-sujets en ICMs basées sur l’imagerie motrice.

Résumé

3 Synthèse et perspectives Minimiser la durée de calibrage sans réduire les performances de classification reste aujourd’hui l’un des principaux obstacles pour l’utilisation de la technologie des interfaces-cerveau machine dans la vie quotidienne. Bien que plusieurs techniques d’apprentissage automatique comme l’apprentissage par transfert et l’apprentissage en ligne aient été éprouvées, ce problème reste encore non résolu. Dans cette thèse, nous avons réalisé une étude approfondie de ce problème dans le cas des ICMs basées sur l’imagerie motrice qui sont les plus prometteuses, mais aussi les plus difficiles à utiliser. Nous avons repéré quelques points intéressants qu’il faudrait prendre en considération dans les futures ICMs et nous avons proposé de nouvelles méthodes d’apprentissage qui permettent de s’adresser à ces points. Les futures interfaces cerveau-machine ont besoin de techniques d’apprentissage qui englobent tous les aspects de l’apprentissage humain, à savoir le transfert, le maintien et la consolidation des connaissances. Parmi les autres points importants liés à l’apprentissage automatique en ICMs et qui n’ont pas été abordés dans cette thèse, nous pouvons citer l’unification des étapes de traitement, d’extraction de caractéristiques et de classification. La plupart des travaux en ICMs traitent ces étapes séparément ce qui est en contradiction avec la théorie de l’apprentissage statistique (Vapnik, 2000). Des approches comme l’utilisation de la géométrie Riemannienne (Congedo, 2013) doivraient être davantage explorées dans le futur.