and Evidence-Based Approaches for EEG Signal Classification

New Margin- and Evidence-Based Approaches for EEG Signal Classification

N. Jeremy Hill and Jason Farquhar

Max Planck Institute for Biological Cybernetics, T¨ ubingen

Brain-Computer Interface researchers at the Department of Empirical Inference For Machine Learning And Perception:

Felix Bießmann, Cornelius Raths Suzanne Martens Jason Farquhar, Jeremy Hill Bernhard Schölkopf

Jeremy Hill

Cambridge, February 2007

slide 1 of 25

Goals

Develop systems which completely paralysed people (such as late-stage sufferers of Amyotrophic Lateral Sclerosis, ALS) can use to communicate, without relying on: • muscles

Jeremy Hill


slide 2 of 25

Goals

Develop systems which completely paralysed people (such as late-stage sufferers of Amyotrophic Lateral Sclerosis, ALS) can use to communicate, without relying on: • muscles • peripheral nerves

Jeremy Hill


slide 2 of 25

Goals

Develop systems which completely paralysed people (such as late-stage sufferers of Amyotrophic Lateral Sclerosis, ALS) can use to communicate, without relying on: • muscles • peripheral nerves • (vision)

Jeremy Hill


slide 2 of 25

Goals

Develop systems which completely paralysed people (such as late-stage sufferers of Amyotrophic Lateral Sclerosis, ALS) can use to communicate, without relying on: • muscles • peripheral nerves • (vision) • (motor cortex)

Jeremy Hill


slide 2 of 25

BCI projects • Attention shifts to auditory stimuli.

Jeremy Hill


slide 3 of 25

BCI projects • Attention shifts to auditory stimuli. • Attention shifts to tactile stimuli. 5-class paradigm in MEG (incl. NIC). 50–85% correct, avg. 70% across 9 subjects. (Cornelius Raths, MSc awarded 2007)

Jeremy Hill


slide 3 of 25

BCI projects • Attention shifts to auditory stimuli. • Attention shifts to tactile stimuli. 5-class paradigm in MEG (incl. NIC). 50–85% correct, avg. 70% across 9 subjects. (Cornelius Raths, MSc awarded 2007) • Improvement of visual “speller” paradigms – Manipulation of stimulus type. – Optimization of stimulus code according to information-theoretic and psychophysiological factors. (Felix Bießmann, MSc project in progress)

Jeremy Hill


slide 3 of 25

BCI projects • Attention shifts to auditory stimuli. • Attention shifts to tactile stimuli. 5-class paradigm in MEG (incl. NIC). 50–85% correct, avg. 70% across 9 subjects. (Cornelius Raths, MSc awarded 2007) • Improvement of visual “speller” paradigms – Manipulation of stimulus type. – Optimization of stimulus code according to information-theoretic and psychophysiological factors. (Felix Bießmann, MSc project in progress) • Ongoing work with Prof. Niels Birbaumer’s group in T¨ ubingen to analyse EEG/ECoG data from ALS patients.

Jeremy Hill


slide 3 of 25

BCI projects • Attention shifts to auditory stimuli. • Attention shifts to tactile stimuli. 5-class paradigm in MEG (incl. NIC). 50–85% correct, avg. 70% across 9 subjects. (Cornelius Raths, MSc awarded 2007) • Improvement of visual “speller” paradigms – Manipulation of stimulus type. – Optimization of stimulus code according to information-theoretic and psychophysiological factors. (Felix Bießmann, MSc project in progress) • Ongoing work with Prof. Niels Birbaumer’s group in T¨ ubingen to analyse EEG/ECoG data from ALS patients. • Algorithm development. Jeremy Hill


slide 3 of 25

Role of Machine Learning in BCI

• Get results quickly: Shift the burden of learning from the patient to the computer. Hours to recognize the relevant features, rather than weeks/months training a patient to modulate pre-specified features.

Jeremy Hill


slide 4 of 25

Role of Machine Learning in BCI

• Get results quickly: Shift the burden of learning from the patient to the computer. Hours to recognize the relevant features, rather than weeks/months training a patient to modulate pre-specified features. • Let the system run itself: No intervention from experts.

Jeremy Hill


slide 4 of 25

Example BCI setting

L. hem.

R. hem.

log bandpower R. hem.

Event-Related Desynchronization in motor imagery: classify imagined left hand movement vs. imagined right hand movement based on α-band power of estimated pre-motor cortex sources in the left and right hemispheres.

log bandpower L. hem.

Jeremy Hill


slide 5 of 25

Spatial Filtering Given multichannel time-series X, we want appropriately spatially filtered time-series S = FX that contain only task-relevant information. F

...

]

...

...

...

...

... ... ...

[ ][ a1a2 ...an

S ...

=

...

A

...

...

...

...

fn>

X

]

... ... ...

[ ][ ...

=

X ...

S

f1> f2>

m×t

e.g. • Independent Component Analysis (ICA) • Common Spatial Pattern (CSP) — Koles 1991. Jeremy Hill


slide 6 of 25

EEG Example Amplitude spectra of raw EEG signals: AUC 1

0.9 8

0.78

15

0.8

0.7 22

electrode

0.6 29 0.5 36

0.39

0.4

43 0.3 50 0.2 57 0.1

64 0

6.25

12.5

18.75

25

31.25

37.5

43.75

50

frequency (Hz)

Jeremy Hill


slide 7 of 25

EEG Example Amplitude spectra of sources estimated by Independent Component Analysis: AUC 1

0.93 0.9 8 0.8 15

Independent Component

0.7 22 0.6 29 0.5 36 0.4 43

0.31

0.3

50 0.2 57 0.1

64 0

6.25

12.5

18.75

25

31.25

37.5

43.75

50

frequency (Hz)

Jeremy Hill


slide 7 of 25

source 1

Whitening and rotation

source 2

time

source 2

ground truth

source 1

time

Jeremy Hill


slide 8 of 25

sensor 1


sensor 2

time

sensor 2

mixed

sensor 1

time

Jeremy Hill


slide 8 of 25

signal 1


signal 2

time

signal 2

whitened

signal 1

time

Jeremy Hill


slide 8 of 25

I.C. 1


I.C. 2

time

I.C. 2

whitened and rotated

I.C. 1

time

Jeremy Hill


slide 8 of 25

sensor 1


sensor 2

time

sensor 2

mixed

sensor 1

time

Jeremy Hill


slide 8 of 25

Cheap supervised rotation with CSP

Jeremy Hill


slide 9 of 25


Jeremy Hill


slide 9 of 25


Jeremy Hill


slide 9 of 25


Jeremy Hill


slide 9 of 25


Jeremy Hill


slide 9 of 25


Jeremy Hill


slide 9 of 25

CSP: outlier- (artifact-) sensitivity

Jeremy Hill


slide 10 of 25


Jeremy Hill


slide 10 of 25


Jeremy Hill


slide 10 of 25


Jeremy Hill


slide 10 of 25

CSP: overfitting

AUC

80 training trials 1

0.97 0.9

8

0.8

15

0.7

channel

22 0.6 29 0.5 36 0.4 43 0.3 50

0.2 0.15

57

0.1

64 0

Jeremy Hill

6.25

12.5 18.75 25 31.25 37.5 43.75 frequency (Hz)


50

slide 11 of 25

CSP: overfitting

80 test trials

AUC

1 0.95 0.9

8

0.8

15

0.7

channel

22 0.6 29 0.5 36 0.4 43 0.29 50

0.3 0.2

57

0.1

64 0

Jeremy Hill

6.25

12.5 18.75 25 31.25 37.5 43.75 frequency (Hz)


50

slide 11 of 25

Problems with CSP

• Outlier-sensitivity, overfitting (due to poor objective) • How to pick which components to use?

Jeremy Hill


slide 12 of 25

Problems with CSP

• Outlier-sensitivity, overfitting (due to poor objective) • How to pick which components to use? • Sensitivity to initial assumptions: – Which frequency band? – Which time window?

Jeremy Hill


slide 12 of 25

Problems with CSP

• Outlier-sensitivity, overfitting (due to poor objective) • How to pick which components to use? • Sensitivity to initial assumptions: – Which frequency band? – Which time window? The exact frequency of sensory-motor rhythms varies between individuals. In practice, component / band / time-window selection is often best performed by hand.

Jeremy Hill


slide 12 of 25

Problems with CSP

• Outlier-sensitivity, overfitting (due to poor objective) • How to pick which components to use? • Sensitivity to initial assumptions: – Which frequency band? – Which time window? The exact frequency of sensory-motor rhythms varies between individuals. In practice, component / band / time-window selection is often best performed by hand.

The ideal BCI algorithm would be a “glass box” requiring no such

intervention.

Jeremy Hill


slide 12 of 25

Get a better objective (I) Approach #1: Margin Maximization (à la Support Vector Machine)

Maximize the margin in the space of log bandpower features ψ(X; F). > F ) ψ(Xi ; F) = log diag (FXi X> i

Jeremy Hill


slide 13 of 25

Get a better objective (I) Approach #1: Margin Maximization (à la Support Vector Machine) Given time-series Xi and class labels yi , simultaneously optimize • spatial filtering coefficients F • classifier weight-vector w in log-bandpower space • classifier bias b in log-bandpower space to minimize the SVM-like objective function: >

λw w +

X

max(0, 1 − yi (ψ(Xi ; F)> w + b))

i

Regularization parameter λ can be found by cross-validation.

Jeremy Hill


slide 13 of 25

Get a better objective (II) Approach #2: “Evidence” Maximization (using Gaussian Process classifiers) The marginal likelihood or evidence of a probabilistic model with hyperparameters F is given by integrating the lower-level parameters (e.g. a classifier’s weight vector w) out of the likelihood for data D: Z P (D|F) =

Pr(D|w, F)Pr(w|F)dw

It is a probability density function, so it normalizes over the space of possible datasets. Maximizing evidence can be an effective means of complexity control and hence model selection:

model too simple model too complex

model just complex enough for dataset D D

Jeremy Hill


data space

slide 14 of 25

Get a better objective (II) Approach #2: “Evidence” Maximization (using Gaussian Process classifiers) • Define a covariance function in the log-bandpower space, e.g. a linear covariance function k(Xi , Xj ) = 1 + ψ(Xi ; F)> ψ(Xj ; F) or some other function of ψ for non-linear classification. • Plug this into a Gaussian Process Classifier (using Probit likelihood, and the Expectation-Propagation algorithm to approximate it—see Kuss & Rasmussen 2005, Journal of Machine Learning Research 6).

• The Gaussian Process framework yields an expression for the evidence, which is easily differentiable with respect to F. • So optimize F by conjugate gradient descent.

Jeremy Hill


slide 14 of 25

Experiments Both methods were tested on motor-imagery EEG data from 15 subjects: • 9 from BCI competitions (Comp 2:IIa, Comp 3:IVa,IVc) • 6 recorded at the MPI (Lal et al 2004, IEEE Trans. Biomed. Eng. 51)

Jeremy Hill


slide 15 of 25

Experiments Both methods were tested on motor-imagery EEG data from 15 subjects: • 9 from BCI competitions (Comp 2:IIa, Comp 3:IVa,IVc) • 6 recorded at the MPI (Lal et al 2004, IEEE Trans. Biomed. Eng. 51) Preprocessing: • select time-windows 0.5–4 sec after stimulus presentation • band-pass filtered in the broad 8–25Hz band.

Jeremy Hill


slide 15 of 25

Experiments Both methods were tested on motor-imagery EEG data from 15 subjects: • 9 from BCI competitions (Comp 2:IIa, Comp 3:IVa,IVc) • 6 recorded at the MPI (Lal et al 2004, IEEE Trans. Biomed. Eng. 51) Preprocessing: • select time-windows 0.5–4 sec after stimulus presentation • band-pass filtered in the broad 8–25Hz band. Design: • Two spatial filters were optimized in each case. • Performance was assessed as a function of training set size, ntrain ∈ {50, 100, 150, 200}. • Each assessment was repeated using 8 random training subsets.

Jeremy Hill


slide 15 of 25

Results (I)

margin−maximized

n

train

= 50

n

train

= 100

n

train

= 150

n

train

0.5

0.5

0.5

0.5

0.4

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

0 0.1 0.2 0.3 0.4 0.5

0

0 0.1 0.2 0.3 0.4 0.5

0

0 0.1 0.2 0.3 0.4 0.5

0

= 200

0 0.1 0.2 0.3 0.4 0.5

CSP + SVM

Binary classification error rates: new approach vs. traditional two-stage CSP + classifier approach.

Jeremy Hill


slide 16 of 25

Results (I)

margin−maximized

n

train

= 50

n

train

= 100

n

train

= 150

n

train

0.5

0.5

0.5

0.5

0.4

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

0 0.1 0.2 0.3 0.4 0.5

0

0 0.1 0.2 0.3 0.4 0.5

0

0 0.1 0.2 0.3 0.4 0.5

0

= 200

0 0.1 0.2 0.3 0.4 0.5

CSP + SVM

Note the consistent improvement, most markedly when we have: • poor subject performance; • small numbers of training trials. This is encouraging from a clinical viewpoint.

Jeremy Hill


slide 16 of 25

evidence−maximized

Results (II)

n

train

= 50

n

train

= 100

n

train

= 150

n

train

0.5

0.5

0.5

0.5

0.4

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

0 0.1 0.2 0.3 0.4 0.5

0

0 0.1 0.2 0.3 0.4 0.5

0

0 0.1 0.2 0.3 0.4 0.5

0

= 200

0 0.1 0.2 0.3 0.4 0.5

CSP + GPC


Jeremy Hill


slide 16 of 25

evidence−maximized

Results (II)

n

train

= 50

n

train

= 100

n

train

= 150

n

train

0.5

0.5

0.5

0.5

0.4

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.2

0.2

0.2

0.2

0.1

0.1

0.1

0.1

0

0 0.1 0.2 0.3 0.4 0.5

0

0 0.1 0.2 0.3 0.4 0.5

0

0 0.1 0.2 0.3 0.4 0.5

0

= 200

0 0.1 0.2 0.3 0.4 0.5

CSP + GPC


Jeremy Hill


slide 16 of 25

Spatial Effect of Optimization Example of spatial patterns “fixed” by evidence-maximization:

Jeremy Hill


slide 17 of 25

Spatial, temporal, spectral. . .

Jeremy Hill


slide 18 of 25


Ideally we want to optimize automatically over space

Jeremy Hill


slide 18 of 25


Ideally we want to optimize automatically over space Ideally we want to optimize automatically over time

Jeremy Hill


slide 18 of 25


Ideally we want to optimize automatically over space Ideally we want to optimize automatically over time Ideally we want to optimize automatically over frequency Jeremy Hill


slide 18 of 25


Weightings over time or frequency can be incorporated into our feature mapping:  ..  . > > ψ(X; F) = log diag F X X F  

Jeremy Hill


slide 19 of 25


Weightings over time or frequency can be incorporated into our feature mapping: 

..



.   ψ(X; F, G) = log diag F X G X> F> 

Jeremy Hill


slide 19 of 25


Weightings over time or frequency can be incorporated into our feature mapping: 

..



 ˜ . ˜ † > ψ(X; F, H) = log diag F X H X F 

Jeremy Hill


slide 19 of 25

Spatial, temporal, spectral. . . Preliminary experiments by Jason Farquhar show that iterated optimization of F, then G, then H. . . can yield sensible results with flat initialization over time and frequency, i.e. without requiring domain knowledge.

Jeremy Hill


slide 20 of 25


Jeremy Hill


slide 20 of 25


Jeremy Hill


slide 20 of 25


Jeremy Hill


slide 20 of 25


Jeremy Hill


slide 20 of 25


Jeremy Hill


slide 20 of 25


Jeremy Hill


slide 20 of 25


Jeremy Hill


slide 20 of 25

Linear or non-linear?

1.5 1 0.5 0 −0.5 −1

−1.5

"

k(Xi , Xj ) = v 1 +

Jeremy Hill

−1

−0.5

m X

σk−2

0 >

0.5 >

1 >

>

log (fk Xi Xi fk ) log fk Xj Xj fk

k=1


#

slide 21 of 25


1.5 1 0.5 0 −0.5 −1

−1.5

−1

−0.5

0

k(Xi , Xj ) = v exp −

0.5

1 2

m X

1

σk−2 d2k

k=1

!

where >

>

>

>

dk = log (fk Xi Xi fk ) − log fk Xj Xj fk Jeremy Hill


slide 21 of 25


ntrain = 100, x 8 folds, x 15 subjects linear covariance function in log−variance space

average generalization error

0.25

0.2

0.15

CSP+GPC Optimized

0.1

2

4

8

12

16

20

30

all

number of spatial filters Jeremy Hill


slide 21 of 25


ntrain = 100, x 8 folds, x 15 subjects RBF covariance function in log−variance space


0.25

0.2

0.15

CSP+GPC Optimized

0.1

2

4

8

12

16

20

30

all



slide 21 of 25


Use of the classifier’s criterion to optimize preprocessing parameters means • projection into higher-dimensional feature spaces via a non-linear kernel can help; • not just “any classifier will do.”

See also: Tomioka et al. (NIPS 2006) - logistic regression on (non-logged) variance features.

Jeremy Hill


slide 21 of 25

Model selection: how many filters?

ntrain = 100, x 8 folds, x 15 subjects RBF covariance function in log−variance space


0.25

0.2

0.15 CSP+GPC Optimized −log evidence (CSP+GPC) −log evidence (after optimization)

0.1

2

4

8

12

16

20

30

all



slide 22 of 25


CSP + GPC, RBF covariance function choose no. of filters by evidence

0.5

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5

choose no. of filters randomly

Jeremy Hill


slide 22 of 25


CSP + GPC, RBF covariance function choose no. of filters by evidence

0.5

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5


Jeremy Hill


slide 22 of 25


Optimized GPC, RBF covariance function choose no. of filters by evidence

0.5

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5


Jeremy Hill


slide 22 of 25


Optimized GPC, RBF covariance function choose no. of filters by evidence

0.5

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5


Jeremy Hill


slide 22 of 25


Evidence−based selection 0.5

Optimized GPC (RBF)

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5

CSP + GPC (RBF)

Jeremy Hill


slide 22 of 25


Evidence−based selection 0.5

Optimized GPC (RBF)

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5

CSP + GPC (RBF)

Jeremy Hill


slide 22 of 25

Further goals

• Beyond optimization: MCMC sampling and prediction averaging within the GPC framework.

Jeremy Hill


slide 23 of 25

Further goals

• Beyond optimization: MCMC sampling and prediction averaging within the GPC framework. • Adapting the system to cope with shifts in background activity between training and test sessions. – Hill, Farquhar & Schölkopf 2006 (Proc. 3rd Intl. BCI Workshop) – Tomioka, Hill, Blankertz & Aihara 2006 (IBIS, Osaka.)

Jeremy Hill


slide 23 of 25

Further goals

• Beyond optimization: MCMC sampling and prediction averaging within the GPC framework. • Adapting the system to cope with shifts in background activity between training and test sessions. – Hill, Farquhar & Schölkopf 2006 (Proc. 3rd Intl. BCI Workshop) – Tomioka, Hill, Blankertz & Aihara 2006 (IBIS, Osaka.) • Constraining spatial filters to use a small number of electrodes (for convenience of setup). – Farquhar, Hill & Schölkopf 2006 (Proc. 3rd Intl. BCI Workshop)

Jeremy Hill


slide 23 of 25

Further goals

• Beyond optimization: MCMC sampling and prediction averaging within the GPC framework. • Adapting the system to cope with shifts in background activity between training and test sessions. – Hill, Farquhar & Schölkopf 2006 (Proc. 3rd Intl. BCI Workshop) – Tomioka, Hill, Blankertz & Aihara 2006 (IBIS, Osaka.) • Constraining spatial filters to use a small number of electrodes (for convenience of setup). – Farquhar, Hill & Schölkopf 2006 (Proc. 3rd Intl. BCI Workshop) • Application of the same approach to features in the time domain (and automatic combination of time-domain & band-power features).

Jeremy Hill


slide 23 of 25

Conclusion • Significant improvements can be made by applying the principles of marginand evidence- maximization to the automatic extraction of bandpower features in EEG (and other signal sources..?)

Jeremy Hill


slide 24 of 25

Conclusion • Significant improvements can be made by applying the principles of marginand evidence- maximization to the automatic extraction of bandpower features in EEG (and other signal sources..?) • Benefits are greatest in the difficult cases: high noise and/or small amounts of data.

Jeremy Hill


slide 24 of 25

Conclusion • Significant improvements can be made by applying the principles of marginand evidence- maximization to the automatic extraction of bandpower features in EEG (and other signal sources..?) • Benefits are greatest in the difficult cases: high noise and/or small amounts of data. • Simultaneous optimization of filters and classifier weights eliminates the need to select filters by hand.

Jeremy Hill


slide 24 of 25

Conclusion • Significant improvements can be made by applying the principles of marginand evidence- maximization to the automatic extraction of bandpower features in EEG (and other signal sources..?) • Benefits are greatest in the difficult cases: high noise and/or small amounts of data. • Simultaneous optimization of filters and classifier weights eliminates the need to select filters by hand. • Early indications are that interpretable, optimal weightings across space, time and frequency can be obtained – simultaneously; – without being very sensitive to prior assumptions.

Jeremy Hill


slide 24 of 25

Thank you for your attention.

Jeremy Hill


slide 25 of 25

and Evidence-Based Approaches for EEG Signal Classification

and Evidence-Based Approaches for EEG Signal Classification

Suggest Documents

Signal Processing and Classification Approaches for ...

EEG Signal Classification using Empirical Mode

EEG Signal Analysis for Silent Visual Reading Classification

EEG Signal Analysis for Silent Visual Reading Classification

Characterization and Robust Classification of EEG Signal from ... - PLOS

EEG Signal Classification Using Power Spectral Features and ... - laccei

EEG Signal Classification using Neural Network and Support Vector ...

EEG Signal Processing for Epilepsy

Electroencephalogram signal classification for

Classification of EEG Signal Using Wavelet Transform ... - IEEE Xplore

Classification of the emotional states based on the EEG signal ...

Sleep Stage Classification Using EEG Signal Analysis - MDPI

Classification of the emotional states based on the EEG signal ...

EEG signal classification with feature selection based ...

EEG Signal Classification Based On A ... - Semantic Scholar

EEG Signal Classification Based On A ... - Semantic Scholar

Nonlinear Blind Source Separation for EEG Signal

Evolutionary Neural Classification Approaches for Strategic and ...

(RVM) for four-class EEG classification - PLOS

SVM Classification of EEG Signals for Brain

ABC optimized RBF network for classification of EEG signal for ... - Core

EvidenceBased Anatomy

Amplitude-integrated EEG Classification and Interpretation in ...

Classification of Focal and Nonfocal EEG Signals