EEG Signal Classification Based On A Riemannian Distance Measure Yili Li
K.M. Wong
H. deBruin
Electrical and Computer Engineering McMaster University Email:
[email protected]
Electrical and Computer Engineering McMaster University Email:
[email protected]
Electrical and Computer Engineering McMaster University Email:
[email protected]
Abstract—We proposed a k-nearest neighbor EEG signal classification algorithm using a dissimilarity measure defined with a Riemannian distance. The EEG signals are characterized by curves on the manifold of power spectral density matrices. By endowing the manifold with a Riemannian metric we obtain the Riemanian distance between two points on the manifold. Based on this, the measure of dissimilarity is then defined. To best facilitate the classification of similar and dissimilar EEG signal sets, we obtain the optimally weighted Riemannian distance aiming to render signals in different classes more separable while those in the same class more compact. The motivation of the algorithm design and verification method are also provided. Experimental results are presented showing the superior performance of the new metric in comparison to the k-nearest neighbor EEG signal classification algorithm using the commonly used KullbackLeibler (KL) dissimilarity measure.
I. I NTRODUCTION EEG signal classification is of paramount importance in applications such as Brain Computer Interface (BCI) and sleep staging. Classification techniques can be divided into supervised learning and unsupervised learning methods [7]. In supervised learning, the classes are known in advance and the goal is to establish a rule whereby we can classify a new observation into one of the existing classes. In unsupervised learning, the aim is to establish the existence of classes based on a given set of observations. In this paper, we focus on a particular problem of supervised classification of EEG signals. EEG signals occupy the frequency range of 0 − 60Hz, which is roughly divided into five constituent physiological EEG subbands: δ (0 − 4Hz), θ (4 − 7Hz), α (8 − 12Hz), β (13 − 30Hz), and γ (30 − 60Hz). They represent the effects of the superimposition of diverse processes in the brain. EEG signals are often contaminated by noise and artifacts due to eye blinking or other muscular activities. To carry out EEG signal classification, filtering is first applied to extract the portion of signals of interest (features) and to remove interfering components. This is then followed by a signal classifier in which the signals are classified based on the extracted features characterizing EEG signals. A typical classification scheme is shown in Figure 1. Feature extraction in EEG signals is a process to obtain those signal features which are suitable for the purpose of classification. There is a variety of features that can be employed for classification. As well, there are various
EEG Signal (Input)
Filtering
Feature Extraction
Classification
Decision (OUtput)
Feature representation
Fig. 1.
A typical classification scheme
methods to extract them. For example, the power spectral densities (PSD), the auto regression coefficients, or the timefrequency distributions, etc. of the EEG signals could be the various chosen features. Then, different methods such as neural network feature selector and fuzzy entropy-based feature ranking, etc., could be used for feature extraction and selection. However, feature extraction is not a trivial problem in the sense that there is no way to guarantee that features extracted from EEG observations are good for the particular problem of classification which, in turn, may be carried out by a variety of algorithms categorized as linear and nonlinear (or combinations of both) classifiers [15]. Most of the classification algorithms require low-dimension extracted feature vectors to avoid high complexity in computation. However, such extracted feature vectors may only retain part of the signal information and may have different performance implications on different classification algorithms. Indeed, for the same extracted feature space, different classifiers often yield very different classification performance [12]. The k-nearest neighbor classifier is one algorithm which makes use of the total information contained in the signals without having to consider the restriction on the dimensionality of the features. The idea here is to assign a feature vector to a class according to majority of its k nearest neighbors. The neighbors can be the feature vectors from the training sets, or the prototypes of the different classes, if a distance measure is defined [3], [5]. For infinitely large sample-size, the performance of k-nearest neighbor classifier has been shown to approach the optimum Bayesian classifier with k → ∞ and k/N → 0 (N is the sample size) [6]. An automatic EEG signal classification using the k-nearest neighbor rule together with the Kullback-Leibler (KL) measure has been proposed [11]. However, the KL measure is not a true distance in the sense that the triangle inequality does not hold [1]. Furthermore, the explicit formula of KL measure used in [11] assumes the data
are Gaussian distributed. In this paper we also employ the k-nearest neighbor approach for EEG signal classification. However, we develop a dissimilarity measure based on a Riemannian distance and apply it to the features. We choose the starting point of Riemannian geometry because the features chosen here are the power spectral densities of the measured multi-channel EEG signals. These features are functions of frequency which, together, describe different curves in the feature space. Thus, dissimilarity here is a measure of distances between the curves and Euclidean distances between vectors is no longer appropriate. The paper is structured as follows: Section II introduces EEG signal characterization and defines the dissimilarity measure. In Section III, we introduce the metric learning to find the optimum Riemannian distance. In Section IV, the k-nearest neighbor classification algorithm using this dissimilarity measure is presented. Section V presents the experimental results, and conclusion is given in Section VI.
B. EEG signal feature
II. EEG SIGNAL FEATURES AND MEASURES We measure the EEG data of the ith patient for the determination of sleep stages through M channels. These collected EEG signals may last for hours in duration and are divided into T -second epochs. In general, the signals collected are non-stationary. However, it is widely accepted [17] that if T ≤ 30sec, then each epoch of the measured EEG data represents a wide-sense stationary signal. Hence in this paper, the data are all in 30-second segments. For the nth epoch, at the time instant t, the signal from the mth channel is (i) {snm (t), m = 1, · · · , M }.
It can be shown [4] that the covariance matrix Γn (τ ) is positive semi-definite and the information contained in the covariance can be expressed equivalently in terms of the power spectral density matrix of the signal 1 X Pn (ω) = Γn (τ )e−jωτ (7) 2π τ
A. The EEG signal Collecting all the M channel measurements, for the ith patient, we can represent the nth epoch of these multi-channel data at t as a vector:
where kΓn (τ )k denotes the `1 -norm of the matrix. The fact that Pn (ω) is Hermitian follows from Equation (7) and [Γn (τ )]T = Γn (−τ ). We can, therefore, characterize the nth epoch multichannel EEG signal matrix Sn by its positive definite power spectral density matrix Pn (ω) in a frequency range [ωmin , ωmax ]. To obtain the power spectral density matrix from finite measured data, it is necessary to estimate the covariance matrix Γn (τ ), the unbiased estimate of which may fail to be positive definite. The positive definiteness can be guaranteed if biased estimators are used, but the spectral resolution may suffer and the bias may becomes too severe if the data record is short. The Nuttall-Strand algorithm [18], [20] of power spectral density matrix estimation overcomes such difficulties and is employed in this paper. The algorithm uses forward-backward linear prediction to iteratively estimate the residual covariance matrix arriving at an accurate positive semi-definite estimate of the power spectral density matrix with high frequency resolution. This estimate will be used as the power spectral density feature to characterize the EEG signals. Each power spectral density matrix (at frequency ω) can be regarded as a point in the manifold M of the space of all positive definite matrices. Thus, for the nth epoch of a multichannel time series Sn characterized by its positive definite power spectral density matrix Pn (ω) in a frequency range [ωmin , ωmax ], we can represented it as a series of
(i)
(i)
T s(i) n (t) = [sn1 (t), · · · , snM (t)] ,
t = 0, · · · , T − 1 (1)
Thus, the nth epoch data matrix for the ith patient is given by (i) (i) S(i) n = [sn (0), · · · , sn (T − 1)],
n = 1, · · · , N (i) (2)
Each epoch of measured EEG data is carefully inspected by clinical experts and is labeled indicating that it represents a particular state of sleep for the patient. Thus, for the ith patient, we have the following labeled sample EEG signals: (" # " # " #) (i) (i) (i) SN (i) S1 Sn (i) D = ,··· , ,··· , (3) (i) (i) (i) `1 `n `N (i) where
`(i) n ∈ L = {1, 2, · · · , L}
(4)
denotes the label of the nth epoch of the EEG signal belonging to any of the L states of sleep. A library of these labeled sampled EEG are stored up such S signals from several patientsP that D = i D(i) having a total of N = i N (i) epochs of data. The idea is when there is a new measured EEG signal S0 , we want to derive a classification method to automatically determine its label `0 without any expert judgement.
Since the clinical experts usually determine the patient’s state of sleep by examining the power contents in the subbands of the EEG signal, therefore, it is appropriate to use the power spectral density of the signal as the feature for the classification of sleep stages. Here, we examine the cross power spectral density matrix of the recorded EEG data. The nth epoch of the multi-channel data measured at different time instants, sn (t) = [sn1 (t), · · · , snM (t)]T , t = 0, · · · , T − 1, can be considered as a wide-sense stationary M -channel EEG signal vector. Therefore, its ensemble mean and the covariance are independent of the measurement instant and are defined respectively as µn = E[sn ]
(5)
Γn (τ ) = E[{sn (t + τ ) − µn }{sn (t) − µn }T ]
(6)
and
which is positive semi-definite if X kΓn (τ )k < ∞
(8)
τ
points forming a curve on M parameterized by the frequency variable ω, i.e., Pn (ω) :
[ωmin , ωmax ] → M
(9)
This concept is shown in Figure 2.
It is easy to show this Riemann integral satisfies the axioms of a distance function, and can be approximated as X d(P1 (ω), P2 (ω)) ≈ dG (ωi )∆ωi =
i X
dG (P1 (ωi ), P2 (ωi ))∆ωi (12)
i M(ω )
If equal frequency increment is used, i.e., ∆ωi = c, a constant, then without loss of generality we can define the dissimilarity between the given two curves as X d(P1 (ω), P2 (ω)) = dG (P1 (ωi ), P2 (ωi )) (13)
X(ω) P(ω min)
P(ω)
P(ω max)
i
III. A N OVEL R IEMANNIAN D ISTANCE FOR M ETRIC ω min EEG signals
ω
ω max
ω
BASED LEARNING
Representation as curves in the manifold M
Fig. 2.
EEG signal representation
C. The similarity/dissimilarity measure To facilitate the process of EEG signal classification, we have to establish a measure which leads to a short distance between similar power spectral densities (i.e., EEG signals of same state of sleep) and a large distance between dissimilar power spectral densities (i.e., EEG signals of different states of sleep). The commonly used distance in a linear vector space is Euclidean distance. However, since for each frequency, the power spectral density matrix is a point on the Riemannian manifold M, the Euclidean measure will not be appropriate. Rather, we should establish a geodesic distance function of the manifold M to measure the Riemannian distance between two points. (A Riemannian distance has been successfully applied in the image analysis [10]. However, this Riemannian distance is invariant to weighting matrices which renders it not appropriate for our purpose of signal classification.) Furthermore, since each power spectral density matrix is a function of ω and, with the variation of ω, describes a curve on the Riemannian manifold M, we must established the similarity/dissimilarity between two power spectral density matrices as a distance between two curves. Suppose we have established the geodesic distance function dG for the manifold M. For the two curves on the manifold described by two power density matrices P1 (ω) and P2 (ω) due to the variation of ω, dG can be thought of as a nonnegative real valued function of ω measuring the distance between the two curves, i.e., dG (ω) = dG (P1 (ω), P2 (ω))
(10)
At each frequency ω = ωi it measures the dissimilarity between two points, P1 (ωi ) and P2 (ωi ). As the frequency ω varies, we can define the total distance between the curves P1 (ω) and P2 (ω) as the integral of dG with respect to ω, i.e., Z ωmax d(P1 (ω), P2 (ω)) = dG (ω)dω (11) ωmin
The measure of dissimilarity in the above section necessitates the establishment of a geodesic distance function dG . Here in this section, we attempt to establish a Riemannian distance to measure the similarity/dissimilarity between different power spectral density matrices so that classification of the EEG signals can be facilitated. To develop a Riemannian distance on the differentiable manifold M, the space of positive definite matrices, we must first endow it with a Riemannian metric [13]. Such a Riemannian metric g is defined as an inner product on the tangent space, TP M, of each point P on M. Thus, we obtain a Riemannian manifold (M, g). Since there are infinitely many possible Riemannian metrics on a differentiable manifold, a suitable one for our purpose of signal classification has to be chosen. ˜ For a point P ∈ M, there exists a representation of P, P, H ˜ ˜ in a Hilbert space H such that P = PP . Let HP ˜ be the horizontal subspace of the tangent space, TP ˜ H, which is the ˜ For A, ˜ B ˜ ∈ H ˜ , we can endow H ˜ tangent space of H at P. P P with an inner product ˜ ˜ +B ˜ H A) ˜ HB ˜ Bi ˜ = 1 T r(A hA, 2
(14)
By establishing an isomorphism between TP M and HP ˜ , we ˜ Bi ˜ in can let the inner product defined on TP M equals hA, ˜ B ˜ in terms of A, B ∈ TP M, we Eq. (14). Expressing A, obtain the following natural Riemannian metric on M as gP (A, B) =
1 T rAK 2
(15)
where P ∈ M, A, B ∈ TP M, and K is a Hermitian matrix such that KP + PK = B (16) The Riemannian metric in the Equation (15) represents the squared distance between two points (say P and P+ M P) in an infinitesimal region on M, i.e., d2 (P, P+ M P) = gP (A, B). To establish the global Riemannian distance between two points on the manifold M from gP (A, B), we have to carry out the following:
The global geodesic distance between two points P1 and P2 on the Riemannian manifold (M, g) is defined as the shortest length of curves connecting the two points, i.e., dG (P1 , P2 )
, min{L(P)|P(θ) : [θ1 , θ2 ] → M} (17)
where P(θ) is a piecewise smooth curve joining P1 to P2 such that P(θ1 ) = P1 and P(θ2 ) = P2 with θ being the parameter defining the curve. At each point on the curve P(θ), we can establish a co-ordinate system {x1 (θ), · · · , xM0 (θ)} of its tangent space where M0 is the total dimension defining the power spectral density matrix. Then the components of the Riemannian metric as given by Eq. (15) are: gij (P(θ)) = gP(θ) (xi (θ), xj (θ)), i, j = 1, · · · , n
(18)
Hence, the length of a curve P(θ) from P(θ1 ) to P(θ2 ) can be written as !1/2 Z θ2 Ã X n L(P) = gij (P(θ))x˙ i (θ)x˙ j (θ) dθ (19) θ1
i,j=1
d dθ (xi (θ)).
The geodesic distance may then be where x˙ i (θ) , obtained by minimizing L(P) with respect to P(θ) leading to a nonlinear ordinary differential equation system which may be difficult to solve. To overcome this difficulty, we lift the ˜ curve P(θ) : [θ1 , θ2 ] → M to a curve P(θ) : [θ1 , θ2 ] → H ˜ ˜ ˜ ˜ with P(θ1 ) = P1 and P(θ2 ) = P2 horizontally such that ˜ P(θ) is uniquely determined. Then the length of P(θ) depends only on the metric associated with the horizontal subspace HP are isomorphic we must have ˜ . Since TP(θ) M and HP(θ) ˜ ˜ Thus, the minimum of L(P) can be achieved L(P) = L(P). ˜ Now, the shortest curve by finding the minimum of L(P). ˜ 1 and P ˜ 2 in a Hilbert space H is connecting the two points P ˜1 − P ˜ 2 ||, hence we have the straight line, i.e., min L(˜ γ ) = ||P the following theorem: Theorem 1: If V∗ is the solution of 1/2
1/2
max