AbstractâWe present a shrinkage estimator for the EEG spatial covariance matrix of the background activity. We show that such an estimator has some ...
32nd Annual International Conference of the IEEE EMBS Buenos Aires, Argentina, August 31 - September 4, 2010
Shrinkage Approach for EEG Covariance Matrix Estimation Leandro Beltrachini, Nicol´as von Ellenrieder, and Carlos H. Muravchik Laboratorio de Electr´onica Industrial, Control e Instrumentaci´on Facultad de Ingenier´ıa, Universidad Nacional de La Plata, Argentina
Abstract— We present a shrinkage estimator for the EEG spatial covariance matrix of the background activity. We show that such an estimator has some advantages over the maximum likelihood and sample covariance estimators when the number of available data to carry out the estimation is low. We find sufficient conditions for the consistency of the shrinkage estimators and results concerning their numerical stability. We compare several shrinkage schemes and show how to improve the estimator by incorporating known structure of the covariance matrix.
I. INTRODUCTION Covariance matrix estimation plays a key role in many problems arising in different areas, such as economy, biology, genomics or array signal processing. The electroencephalography (EEG) inverse problem consists on finding neural activity sources given the electric potential measured in some electrodes placed on the scalp. As this problem cannot be solved uniquely [1], numerical and statistical tools must be used to reach a solution with some optimality criterion. Most of these methods require knowledge of the covariance matrix, obtained usually via the maximum likelihood estimator (MLE) [2] or the sample covariance estimator among other possible choices. When the number of trials is small with respect to the number of space-time entries of the measurements, these estimators may have unacceptable numerical problems. To overcome them we propose to use shrinkage estimators. The shrinkage approach for covariance matrix estimation has been previously studied (e.g. [3]–[5]) for different situations. Its basic idea is to represent the covariance matrix as a convex linear combination of two covariance matrix estimators: one with less variance than bias and the other with less bias than variance. The estimation is completed by finding the optimum value of the coefficient of the convex combination. While the proposed estimator is easy to obtain, we want it to be both numerically and statistically well behaved, i.e. stable and consistent. We find sufficient conditions for consistency and a result regarding the numerical stability of the shrinkage estimators. In the next section we present the problem formulation and show how traditional estimators may fail when the number of trials is small. In section III we review the shrinkage covariance matrix estimation, and give conditions for it to be well behaved. In section IV we evaluate the performance
of several shrinkage schemes and show how the inclusion of known covariance matrix structure leads to better results. In the last section we present conclusions and future work. II. PROBLEM FORMULATION In this section we present a frequently adopted EEG signal model and some problems that appear in it with the usual covariance matrix estimators. We consider, as in [2], that we measure different trials of the same brain response contaminated with electronic noise and spatiotemporal correlated background activity. Let P be the number of sensors placed on the EEG cap, N the number of temporal samples measured in each trial and K the number of trials. Let Y k be the P × N matrix of the k-th trial measurements. We assume a normal matrix distribution for the measurements [2], independent between trials, with mean M associated with the signal term, temporal covariance Σ and spatial covariance Ω, i.e. Y ∼ NP ×N (M, Σ, Ω), where we choose Ω = E{(Y − M )(Y − M )T } and Σ = E{(Y − M )T (Y − M )}/c, where c is a constant which depends on Ω and ensures appropriate power normalization. Under these assumptions, the problem consists in finding well behaved estimators of Σ and Ω. The solution of the EEG inverse problem once the covariance matrix is known could be performed with different algorithms, but is not the aim of this work. One option is to find the maximum likelihood estimator (MLE), as in [2]. The MLE of the spatial covariance matrix is given by b ML = Ω
(1)
k=1
PK where Y = 1/K k=1 Y k is the sample mean MLE and b ML is the MLE of the temporal covariance matrix. It can Σ b ML ) ≤ K min(P, N ). Then, it is easily be shown that rank(Ω b verified that ΩML has problems if K min(P, N ) ≤ P . If we assume a typical situation with P = 150 and N = 15, b ML has problems if K ≤ 10, which is a it implies that Ω common situation for certain experimental setups. In such cases, the MLE is not an appropriate estimator. Another estimation option is to consider the sample covariance estimator. For K trials, the biased sample spatial covariance matrix estimator is given by
This work was supported by ANPCyT PICT 2007-00535, CONICET, CICpBA, and UNLP. e-mail: {lbeltra,ellenrie,carlosm}@ing.unlp.edu.ar
978-1-4244-4124-2/10/$25.00 ©2010 IEEE
K 1 X k b −1 (Y k − Y )T , (Y − Y )Σ ML NK
1654
K X bK = 1 Ω (Y k − Y )(Y k − Y )T . K k=1
(2)
Again, it is easily shown that (2) has problems in the same b ML , so Ω b K is not a well behaved estimator situations as Ω either. Similar conclusions can be obtained for the covariance matrix estimator of the generalized least squares (GLS) approach [6]. Note that a similar analysis can be done with the temporal covariance matrix, but it does not lead to the same conclub ML and Σ b K do not sions. Indeed, it can be shown that Σ have numerical problems for a small number of trials. So, from now on we only deal with the problem of estimating Ω. III. SHRINKAGE APPROACH The shrinkage covariance estimation consists on finding a biased estimator that minimizes the mean square error (MSE) under a certain norm [4]. That is, we look for a structured b S of the real covariance matrix Ω such that it estimator Ω b S − Ωk2 } for some matrix norm. In the minimizes E{kΩ present work we always consider, without loss of generality, the Frobenius norm. In order to minimize the MSE, in [5] a convex linear combination between two covariance matrix estimators is proposed; the first S with less bias than norm variance, and the second T with less norm variance than bias, b S = αT + (1 − α)S, Ω
(3)
where α ∈ [0, 1] is called shrinkage parameter. The next step is to find the shrinkage parameter that gives the minimum MSE. Using existing results [5], we find the optimal shrinkage parameter as, αo =
V ar{S} − C{S, T } + tr{E{S − T }B(S)} , E{kS − T k2F }
(4)
where V ar{S} = E{kS − E{Ω}k2F } is the norm variance of S under the Frobenius norm, C(S, T ) = tr{E{(T − E{T })(S − E{S})T }} is the covariance between S and T under the same norm, and B(S) = E{S} − Ω is the bias matrix of S. We choose S as the unbiased sample estimator based on b K . All the results presented K trials, i.e. S = K/(K − 1)Ω from now on are valid for this choice of S. First, the optimal shrinkage parameter reduces to V ar{S} − C{S, T } , αo = E{kS − T k2F }
(5)
where the dependence with the trial number K is implicit. Next, we show that the MSE of the shrinkage estimator is indeed smaller than the sample covariance one, and that if the eighth central moment of the measurements is finite, the estimator is consistent. b S be the shrinkage estimator as deTheorem 3.1: Let Ω fined in equations (3) and (5). Then, the relation b S − Ωk2F } ≤ E{kS − Ωk2F } E{kΩ
(6)
always holds. Moreover, if there exists a constant K1 ink dependent of K such that E{(Ypn − Mpn )8 } ≤ K1 for all b S is consistent under p = 1, . . . , P and n = 1, . . . , N , then Ω Frobenius norm.
b S −Ωk2 } ≤ Proof: Firstly, we have to prove that E{kΩ F 2 E{kS − ΩkF }. Successively applying the Frobenius norm property kA + Bk2F = kAk2F + kBk2F + 2tr{AB T } for A and B squares matrices [7], and using previous definitions, we arrive to b S − Ωk2F } = E{kS − Ωk2F } − α2o E{kT − Sk2F }, E{kΩ
which is obviously lower or equal than E{kS − Ωk2F }. Also note that the equality holds if αo = 0 and that larger αo values imply an improvement of the shrinkage estimator over the unbiased sample estimator. b S − Ωk2 } → 0 Secondly, we have to prove that E{kΩ F as K → ∞. From the first part of the theorem, it is quite obvious that it is sufficient to prove that, under the hypothesis, S is consistent under Frobenius norm. We define k k Xpn = Ypn − Y pn for notation convenience. Also, we define ωij and sij , i, j = 1, . . . , P , as the elements of the matrices Ω and respectively. It is easily seen that P S, P N k k sij = 1/(K − 1) K k=1 n=1 Xin Xjn . Then, E{kS − Ωk2F } =
P X
E{(sij − ωij )2 }
i,j=1
8" !#2 9 P N K = K X k k 1 X < X E Xin Xjn − ωij = 2 ; K i,j=1 : K − 1 n=1 k=1 9 8 !2 P N = < X X K 2K + 1 k k = E − kΩk2F , X X in jn 2 ; K(K − 1) : (K − 1) i,j=1
n=1
where we use the fact that different trials are independent. Using the inequality of arithmetic and geometric means and the Cauchy inequality, it is easy to prove that P X
i,j=1
E
8 N < X :
k k Xin Xjn
n=1
!2 9 = ;
v u P N uX X k 8 E{(Xin ) } ≤ K3 , ≤ K2 t i=1 n=1
where K2 and K3 are constants depending only on N , P and K1 , but not on K. Therefore, E{kS − Ωk2F } ≤
K 2K + 1 K3 − kΩk2F , (K − 1)2 K(K − 1)
which tends to zero as K goes to infinity. Although Theorem 3.1 states an important property of shrinkage estimators, it says nothing about how to improve the numerical stability of the covariance matrix estimator. The following shows that a well behaved biased estimate T helps in that sense. Theorem 3.2: Let λi (C) be the i-th eigenvalue of the matrix C such that λm (C) = λ1 (C) ≤ λ2 (C) ≤ · · · ≤ b S are bounded λP (C) = λM (C). Then, the eigenvalues of Ω by b S ) ≤ αλp (T ) + (1 − α)λM (S), αλp (T ) + (1 − α)λm (S) ≤ λp (Ω (7)
for p = 1, . . . , P.
Proof: By Weyl inequality [7] we know that if A and B are two P × P hermitian matrices, then λp (A) + λm (B) ≤ λp (A + B) ≤ λp (A) + λM (B) holds for p = 1, . . . , P . Replacing A = αT and B = (1 − α)S, and noting that b S , the theorem is proved. A+B =Ω
1655
We see then, that to obtain a well behaved shrinkage estimator, the matrix T should be well conditioned, according to Theorem 3.2. Also, it should have a similar structure to Ω if possible, since from Theorem 3.1 we see that as αo approaches 1, the shrinkage estimator improves over the sample covariance. As V ar{S} is data dependent (i.e. it cannot be changed), we must make C(S, T ) and E{kT − Sk2F } as small as possible. This can be done reducing the randomness of T and making T as close to Ω as possible. In some situations there is no knowledge regarding the structure of the covariance matrix, so T must be based on the sample covariance estimator S [4], [5]. Otherwise, if we have some a priori information regarding the structure, a better biased estimator could be obtained, as done in [3] for knowledge aided space-time adaptive processing (KASTAP) applications. In the next section some schemes are studied for the EEG covariance matrix estimation. IV. SIMULATIONS AND RESULTS In this section we show the benefits of using the shrinkage covariance matrix estimator. We simulate EEG signals using a real shape head model with 120 electrodes, placed according to an extension of the 10 − 20 standard. We simulate a dipolar source on the cortex with amplitude function a(n) = 5 exp{(−(n − 7)2 )/32 } − exp{(−(n − 12)2 )/42 }, and we consider 20 time samples per trial. Also, we assume there is additive Gaussian white noise, corresponding to electronic noise of sensors and amplifiers, and spatio-temporal correlated background activity, with the models given in [6] and [8]. The first task is to estimate V ar{S} and C(S, T ) from the measurements to find the shrinkage estimator. As in [5], we k define the matrices W k with elements wij = (Yik −Yi )(Yjk − PK T k Yj ) and W = 1/K k=1 W , where Yik denotes the i-th row of Y k . Then, it can be shown [5] that Vd ar{S} =
K X K kW k − W k2F . (K − 1)3 k=1
(8)
To estimate C(S, T ) we must chose a specific matrix T for the estimator. In Table I we state the five schemes considered in this paper. The first three are general schemes provided by the bibliography, where no a priori knowledge is supposed. Therefore, they are based on the sample covariance matrix S. The last two are our proposed schemes for this particular problem, where some a priori information is used. The first one consists in choosing T = IP , where IP is the P × P identity matrix. As shown in [5], this is the well known Tikhonov regularization scheme. The second one is the scheme proposed by Ledoit and Wolf [4] where T = (tr{S}/P )IP . The third corresponds to the scheme proposed by Sch¨afer and Strimmer [5], where T = diag(S). The fourth and fifth schemes are based on the knowledge of the background activity spatial correlation structure. Let D be the matrix such that each element is dij = exp{−(eij /γ)}, where eij is the distance between electrodes i and j normalized by the head radius and γ = 3. It is clear that D is
TABLE I S CHEMES CONSIDERED No 1 2 3 4 5
Shrinkage scheme T = IP T = (tr{S}/P ) IP T = diag(S) T = σ2 IP + (tr{S}/P − σ2 )D T = diag(S) + κD
Observations Thikonov regularization Ledoit & Wolf [4] Sch¨afer & Strimmer [5] Ours (1) Ours (2)
only a simplification of the spatial covariance model stated in [6]. Scheme four is based on the fact that the electronic power noise σ 2 is known. Then, a reasonable choice for the bias matrix is T = σ 2 IP + (tr{S}/P − σ 2 )D. In the last scheme we propose a linear combination between the scheme 3 and matrix D such that the higher the background activity is, the larger the weight is given to D. Then, we set T = diag(S) + κD, where κ is the mean value of the off-diagonal terms of a matrix Sn such that S = Sn ⊙ D, where ⊙ denotes the element to element or Hadamard matrix multiplication. Scheme 1 has a deterministic matrix T so C(S, T ) = 0. For the remaining schemes, the computation of C(S, T ) needed to obtain the value of the convex coefficient αo is straightforward. In fact, we derived efficient formulae to compute C(S, T ) for each of the above schemes, which we omit here for brevity. We analyze the results for different background activity to electronic noise power ratio (BNR); 0dB and 10dB. In both cases, we consider the signal to electronic noise power ratio fixed at 10dB. Also, to improve the reliability of the analysis, the reported results are an average of 30 independent simulations. In Figure 1 we show the condition number (cn) of the presented schemes as a function of the number of trials considered, for BN R = 10dB. The results do not vary too much for BER = 0dB. It is evident that only schemes 2, 3 and 5 have an acceptable cn value, with scheme 5, based on the true covariance structure the best performance. Also, note that the sample covariance matrix has an unacceptable cn value for K ≤ P/N = 6, as predicted in section II. In Figures 2 and 3 we show the distance (in Frobenius norm) between the shrinkage estimators and the real covariance matrix, as a function of the number of trials, for BN R=0 and BN R=10dB, respectively. The true covariance was obtained as the unbiased sample covariance estimator with 6000 trials, which is a consistent estimator, as shown in section III. We see that the scheme 5 proposed in this paper, based on the known structure of the covariance has again a better performance than the rest. Also, we note that all schemes are closer to the real covariance matrix than the sample covariance estimator, as predicted by Theorem 3.1. It is important to note that the schemes we propose are based on a certain prior knowledge of the covariance matrix structure, but the results are quite robust, since changing matrix D adopting γ values between 1 and 6 produces similar results.
1656
22
1.6 Scheme 1
Scheme 1
20
1.4
Scheme 2 18
Scheme 3
Scheme 4 Scheme 5
14
Scheme 4
1.2
b − ΩkF /kΩkF kΩ
16
log 10 (cn)
Scheme 2
Scheme 3
Sample
Scheme 5 Sample
1
12
0.8
10 8
0.6
6 0.4 4 2
0
5
10
Condition number variation versus number of trials, BN R =
Fig. 1. 10dB.
0.2
15
Number of trials (K)
0
5
10
Number of trials (K)
15
Fig. 3. Distance (under Frobenius norm) from the shrinkage estimators to the true covariance matrix versus the number of trials, BN R = 10dB. The results correspond to the different schemes listed in table I.
1.8
sample covariance matrix, specially scheme 5. This result holds even if the true structure of the covariance matrix is not exactly the same as the assumed one. A performance analysis with true EEG measurements is the necessary next step to validate our results. As future work we plan to adapt the shrinkage estimator to estimate also the temporal covariance matrix. The method could also be adapted to be used in the GLS formulation.
Scheme 1
1.6
Scheme 2 Scheme 3
1.4
b − ΩkF /kΩkF kΩ
Scheme 4 Scheme 5
1.2
Sample
1
0.8
R EFERENCES
0.6 0.4 0.2
0
5
10
Number of trials (K)
15
Fig. 2. Distance (under Frobenius norm) from the shrinkage estimators to the true covariance matrix versus the number of trials, BN R = 0dB. The results correspond to the different schemes listed in Table I.
V. CONCLUSIONS AND FUTURE WORKS In this paper we presented the shrinkage approach for covariance matrix estimation on the EEG problem. After showing the problems of using classical estimators, such as the MLE or the sample covariance matrix, we showed some of the benefits of using the proposed estimator, such as a lower expected value of the mean square error. We showed some properties leading to a well behaved estimator and proposed a particular scheme for the shrinkage estimator, considering simulated EEG signals over a real head shape model contaminated with electronic noise and spatiotemporal correlated background activity. As expected, the models built using the known structure of the covariance matrix performs better than other models based only on the
[1] S. Baillet, J.C. Mosher and R.M. Leahy, Electromagnetic Brain Mapping, IEEE Signal Processing Magazine, Nov. 2001, pp 14-30. [2] J.C. de Munck, H.M. Huizenga, L.J. Waldorp and R.M. Heethaar, Estimating Stationary Dipoles From MEG/EEG Data Contaminated With Spatially and Temporally Correlated Background Noise, IEEE Trans. on Signal Process., vol. 50, 2002, pp 1565-1572. [3] P. Stoica, J. Li, X. Zhu and J.R. Guerci, On Using a priori Knowledge in Space-Time Adaptive Processing, IEEE Trans. on Signal Process., vol. 56, 2008, pp 2598-2602. [4] O. Ledoit and M. Wolf, A well-conditioned estimator for largedimensional covariance matrices, J. Multivar. Anal., vol. 88, 2004, pp 365-411. [5] J. Sch¨afer and K. Strimmer, A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics, Stat. Appl. Genet. Molecul. Biol., vol. 4, 2005. [6] H.M. Huizenga, J.C. de Munck, L.J. Waldorp and R.P.P.P. Grasman, Spatiotemporal EEG/MEG Source Analysis Based on a Parametric Noise Covariance Model, IEEE Trans. on Biomed. Eng., vol. 49, 2002, pp 533-539. [7] R.A. Horn and C.R. Johnson, Matrix Analysis, Cambridge, U.K.: Cambridge Univ. Press; 1985. [8] F. Bijma, J.C. de Munck, H.M. Huizenga, and R.M. Heethaar, A Mathematical Approach to the Temporal Stationarity of Background Noise in MEG/EEG Measurements, Neuroimage, vol. 20, 2003, pp 233-243.
1657