Constructing Robust Neural Decoders Using Limited Training Data

IEEE TRANSACTIONS IN BIOMEDICAL ENGINEERING

1

Constructing Robust Neural Decoders Using Limited Training Data Shamim Nemati, Nicholas G. Hatsopoulos, Lee E. Miller, Andrew H. Fagg, Member, IEEE

Abstract One of the essential components of a neuromotor prosthetic device is a neural decoder that translates the activity of a set of neurons into an estimate of the intended movement of the prosthetic limb. Wiener filter style approaches model this transformation as a linear function of the number of spikes observed from a set of neurons and over a range of distinct time bins. More recently, researchers have employed recursive Bayesian estimation techniques, such as Kalman filters, and have reported substantially better performance than with the Wiener filter. It is argued that this improvement in performance is due to the compact nature of these Bayesian models. Our results show that the poor performance of the Wiener filter is restricted to cases in which small training data sets are used, leading to substantial model overfitting. However, when training data sets are larger, we show that the Wiener filter is able to make appropriate use of the additional degrees of freedom to consistently outperform the Kalman filter. Finally, we suggest an alternative to the standard pseudo-inverse approach to solving for the Wiener filter parameters. The resulting algorithm almost always outperforms both of the previous approaches independent of the data set size.

Index Terms Brain-machine interface, Kalman Filter, Wiener Filter, Ridge Regression

S. Nemati is with the Department of Electrical Engineering and Computer Science at the Massachusetts Institute of Technology, Cambridge, MA 02139 (the work described in this document was performed while at the University of Oklahoma), [email protected] N. G. Hatsopoulos is with the Department of Organismal Biology and Anatomy, University of Chicago, Chicago, IL 60637, [email protected] L. E. Miller is with the Departments of Physiology and Biomedical Engineering, Northwestern University, Chicago, IL, 60611, [email protected] A. H. Fagg is with the School of Computer Science at the University of Oklahoma, Norman, OK, 73072, [email protected]

December 8, 2007

DRAFT


2

I. I NTRODUCTION A neuromotor prosthesis capable of restoring partial motor control requires accurate and computationally efficient neural decoders to translate the activation level of a small subset of cells into a specification of motor commands for the prosthetic device. Early work demonstrated that neurons in the primary motor cortex are broadly tuned to the direction of hand movement [1], [2]. In subsequent work, a linear population vector model was employed to make off-line reconstructions of the velocity of the hand given the activity of a large number of sequentially recorded neurons [3], [4], [5]. Taylor et al. demonstrated that this approach could be used to reconstruct hand velocities from simultaneously recorded cells, which, in turn, could be used to control the motion of a robotic arm [6]. Introduction of a more general linear filtering framework, consisting of tapped delay lines, allowed for inclusion of a longer history of neuronal firing rates (typically on the order of one second) within the decoding process [7], [8], [9]. More involved decoding methodologies based on the Bayesian estimation framework were proposed by Brown et al. [10] and Wu et al. [11]. Under the Bayesian estimation framework, the goal is to estimate the state, xi (in our case, position and velocity of the hand), of a dynamical system at a given time, i, by computing the posterior probability distribution of the state, conditioned on all available measurements zi and the previous states of the hand, i.e., P (xi |zi , x1:i−1 ). Thus, the fundamental premise of the Kalman filter (or other recursive Bayesian estimators such as particle filters) is that the conditional probability density function (pdf) of the state of a stochastic process contains all the necessarily information to describe the evolution of the process. In particular, if we assume that all the system statistics are Gaussian, the process is completely described by its conditional mean and covariance. In a series of experiments, Wu et al. [11], [12], [13] made a comparative study of the relative performance of the Kalman filter versus the Wiener filter, and reported a superior decoding performance by the Kalman filter. In this work, we will show that the Kalman filter only outperforms the Wiener filter under limited conditions. While the added structure of the Kalman filter benefits the decoding performance for very small training data sets, this structure imposes severe limitations on what can be learned from larger training sets. To support our assertions, we present the result of our studies on the relative performance of these algorithms on two different data sets from each of three different monkeys (a total of six data sets). We demonstrate that our results are in fact consistent with Wu et al. under the experimental condition adopted by them (several minutes of training data). Next, we demonstrate the shortcomings of the Kalman filter when a larger training data set is utilized. Finally, we make recommendations on the design of robust Wiener and Kalman filters.

December 8, 2007

DRAFT


3

Section II gives an overview of the experimental paradigm adopted in this work, a summary of the theoretical foundations of the Wiener filters and the Kalman filters, notes on the numerical issues that might arise when utilizing this algorithms in practice, and recommendations on mitigating such problems. Section III describes the decoding performance of these algorithms on the various data sets, followed by a summary and discussion of the future work in section IV. II. M ETHODS A. Statement of the Problem We are seeking a decoder that maps the recent history of firing rates of some c neurons into a set of k external correlates of the neuronal activities (descriptors of arm movement). Such a decoder must be

computationally efficient for online implementation, and numerically stable and robust when only limited data are available for training purposes (small n). One reason for emphasizing the later requirement is that from a typical recording session, only successful trials are selected for the training purposes. In clinical applications, daily calibration sessions lasting more than 5-10 minutes might not be tolerated well. Thus, only a few minutes of useful training data might be available from which to construct a decoder. In this work, we will focus on two of the most widely used decoders, namely, the Kalman filter [11], [12], [13] and the Wiener filter [7], [8], [14], and we will make a comparative study of their relative performance as a function of training data set size. We show that special attention must be paid to theoretical and numerical issues that arise from employing an impoverished training data set. B. Experimental Setup and Subjects Three macaque monkeys (Macaca Mulatta) were operantly trained to perform a random-target pursuit task by moving a cursor to targets using the arm contralateral to the implanted cortex. The monkey’s arm rested on cushioned troughs secured to links of a two-joint robotic arm (KINARM system, Scott, 1999 [15]). The monkey’s upper arm was abducted 90 degrees such that shoulder and elbow flexion and extension movements were made in the horizontal plane. The robotic arm was not powered in this experiment, but allowed for the position of the monkey’s arm to be measured. A cursor (coincident with the handle position of the robotic arm) and a sequence of targets were projected onto a horizontal screen immediately above the monkey’s arm. At the beginning of a trial, a single target appeared at a random location in the workspace and the monkey was required to move to it. As soon as the cursor reached the target, the target disappeared and was replaced by a new one in a random location. After reaching the

December 8, 2007

DRAFT


4

seventh target, the monkey was rewarded with a drop of water. The monkeys typically executed 400 to 800 successful trials in the course of a 1 to 1.5 hour recording session. Silicon-based electrode arrays (Cyberkinetics Neurotechnology Systems, Inc., MA) composed of 100 electrodes (1.0 mm electrode length; 400 mm inter-electrode separation) were implanted in the arm area on the precentral gyrus of the primary motor cortex (M I ) of each monkey. During a recording session, signals from up to 96 electrodes were amplified (gain, 5000) and digitized (14-bit) at 30kHz per channel using a Cerebus acquisition system (Cyberkinetics Neurotechnology Systems, Inc., MA). Only waveforms (1.6 ms in duration resulting in 48 time samples per waveform) that crossed a threshold were stored and spike-sorted using Offline Sorter (Plexon, Inc., Dallas, TX). Inter-spike interval histograms were computed to verify single-unit isolation by ensuring that less than 0.05% of waveforms possessed an inter-spike interval less than 1.6 ms. Signal-to-noise ratios were defined as the difference in mean peak-to-trough voltage of the waveforms divided by the mean (over all 48 time samples of the waveform) standard deviation of the waveforms. All isolated single units used in this study possessed signal-to-noise ratios of 4 : 1 or higher. Two data sets were analyzed from each of three animals, RJ , BO, and RS (see table I). A data set is defined as all simultaneously recorded neural and kinematic data collected in one recording session. Each data set contained between 31 and 99 simultaneously recorded units from M I . The ensembles consisted of “randomly” selected units from M I except for a possible bias for neurons with large cell bodies that would generate higher signal-to-noise ratios. All of the surgical and behavioral procedures were approved by the University of Chicago’s IACUC and conform to the principles outlined in the Guide for the Care and Use of Laboratory Animals (NIH publication no. 86-23, revised 1985). C. Notation We use the following notation and terminology throughout this paper: c = the number of recorded neurons used for prediction k = the number of simultaneously recorded external correlates of neural activities. Here, k = 6 (Cartesian

position, velocity, and acceleration in x and y directions). ∆t = the sampling period (also called the bin size). In this study we fix the bin size at 100 ms to be more

consistent with the work of Kim et al. [16] who used 100 ms bin size and Wu et al. [13] who employed 50 ms and 75 ms bin sizes, depending on the task, and reported further improvement in decoding for

slightly larger bin sizes. b = the number of time bins.

December 8, 2007

DRAFT


5

L = the length of the observation window used at each step of prediction (also called the window size).

This quantity is equal to the number of time bins multiplied by the sampling period: L = b ∆t. In the case of the Kalman filter, L is typically set to 100 ms (containing only a single bin) and for the Wiener filter L is equal to 1s (containing 10 non-overlapping bins). n = the number of samples (also the size of the training data set). For example, one minute of recording

with ∆t = 100 ms results in n = 600 discrete measurements of the neural firing rates and the kinematic variables under study. F = a (m × n) matrix consisting of n observations of firing rates of c neurons. Each observation vector

(a column of F) describes the number of spikes that have occurred in each of the b time bins for each of the c neurons. For example, with c = 100 neurons and L = 10 bins per neuron, m = 1000. Note that in this case, nine out of ten bins are identical for each neuron from one observation to the next. This results in a nontrivial dependence between the columns of the F matrix. ¯ f = the (m × 1) vector containing the arithmetic mean of the rows of the F matrix. Z = the mean subtracted version of the F matrix: The ith column of Z is given by Zi = Fi − ¯ f (for i = 1, . . . , n). S = a (k × n) matrix of some k simultaneously recorded predictable quantities; for example, state of

the hand (Cartesian position, velocity, and acceleration). Note that the predictable quantities are also discretized at time intervals of ∆t = 100 ms. ¯ s = the (k × 1) vector containing the arithmetic mean of the rows of the S matrix. s¯k is the k th element

of the vector. X = the mean subtracted version of the S matrix. The ith column of X is given by Xi = Si − ¯ s (for i = 1, . . . , n). R = the (m × m) sample correlation matrix for the firing rate observation vectors: R = F FT . W = a (k × m) matrix of regression coefficients, linearly relating firing rates to hand kinematics. λ = an eigenvalue for a correlation matrix (e.g., the input correlation matrix R). The largest and the

smallest eigenvalues are denoted by λmax and λmin , respectively. k.k = the Euclidean 2-norm, unless otherwise stated. E[.]= the expected value of a random variable. δ(.) the Kronecker delta; δ(0) = 1, and δ(y 6= 0) = 0. A = the (k × k) Kalman filter state transition matrix. December 8, 2007

DRAFT


6

U = the (k × k) state noise covariance. H = the (m × k) Kalman filter observation matrix, relating state of the hand to neural activities. Q = the (m × m) observation noise covariance. G = the (k × m) matrix of estimated Kalman gains. P = the (k × k) Kalman filter estimation error covariance matrix. F V AFk = the model performance index (the fraction of variance accounted for) for predicted quantity k , defined as Pn (skiact − skiest )2 F V AFk = 1 − Pni=1 act , act )2 i=1 (ski − s¯k

(1)

where siact is the actual value of some quantity to be predicted at the ith time step of prediction (i = 1 · · · n), with s¯ act as its overall arithmetic mean, and siest the predicted value at the corresponding step.

This measure is similar to the coefficient of determination, except that it requires a perfect match between the actual and predicted quantities (rather than just a perfect correlation) in order to achieve a maximum measure of one. In addition, when measuring the performance of a model with respect to an independent test set, F V AFk can take on values less than zero. These cases occur when the model has inappropriately overfit the training data set. In practice, the FVAF measures from a set of N experiments generally were not distributed normally, precluding the use of a t-test for detecting significant differences in mean model performance. These situations were detected using a Shapiro-Wilk test [17]. When applicable, bootstrap sampling methods were used to estimate the sampling distribution: the shift method was used for paired tests, and randomization was performed for two-sample tests [18].

D. The Wiener Filter The Wiener filter assumes that the predicted quantity at time i is linearly related to the entire past history of observations. In practice, we often assume that only the past one second observation of neuronal activity is relevant for a prediction at time i [8], [9], [14]. We are seeking a set of linear coefficients, denoted by W, such that:

X = WF .

December 8, 2007

(2)

DRAFT


7

However, since the actual relationship between F and X could be non-linear or noisy, WF can only ˆ . In this case, our goal is to find a set of coefficients produce an estimate of X, which we denote as X W such that the cost function associated with square norm of error is minimized: ˆ − Xk2 = kWF − Xk2 , JLS (W) = kX

(3)

The optimal set of parameters can be found by differentiating JLS with respect to each of the parameters and setting each expression equal to zero. The solution is then given by [19]:

W = X FT F FT

−1

= X FT R−1 ,

Given the matrix W and a new set of spike count observations (f

new ),

(4) an estimate of the state of the

hand (sest ) can be obtained. Note that in previous Wiener filter approaches, a “1” is typically appended to each observation vector, f , corresponding to the offset term of the linear function (e.g., [7], [8]). However, in practice this approach can lead to numerical issues. These issues are avoided by predicting ˆ , and then adding the mean of the predicted quantity: the mean subtracted quantity, X

sest = W f

new

+¯ s.

(5)

Note that we are assuming that across the training data and the testing data s¯ does not change (i.e., mean stationarity is assumed). E. Singularity Reduction and Regularization The PINV solution to the above filtering problem breaks down if the sample correlation matrix R is illconditioned (singular) [19], which is almost always the case for small training data sets. This breakdown is due to the inverse operation that is involved in the solution provided by Eq. (4). In numerical least squares problems, this ill-conditioning often results in unstable solutions and overfitting. In fact, in many cases ill-conditioning and overfitting are not disjoint phenomena and are just indications of an impoverished training data set. A common measure of the stability of a matrix under the inverse operator is its condition number, which is the ratio of the largest to the smallest eigenvalue of the matrix:

χ(R) =

December 8, 2007

λmax . λmin

(6)

DRAFT


8

For an ill-conditioned matrix, the few smallest eigenvalues (including λmin , which shows up in the denominator of Eq. (6)) are close to zero, and as a result, the condition number is large. Note that a large condition number essentially means there is very little variation in certain dimensions of the input space. As such, a training algorithm with a large number of model parameters will essentially memorize a mapping restricted to a limited subspace of the joint input/output space, resulting in overfitting of the training data. Because the algorithm is free to arbitrarily choose the magnitude of the linear model gains (the components of W), it is not uncommon for those corresponding to dimensions of little variation to be chosen with very high magnitude. When an independent sample (e.g., f

new )

varies a small degree

beyond what is observed in the training set, this can lead to large errors in the predicted quantity. One possible solution to the ill-conditioning problem is to modify the cost function so as to penalize these high magnitude gain parameters. In some fields, this approach is also known as ridge regression, which essentially provides a way to explicitly address the bias-variance tradeoff [19], [20]. Within the context of ridge regression, the objective is to find a set of regression coefficients, W, that minimize the following cost function:

Jridge (W) = kW F − Xk2 + α kWk2 ,

(7)

where α is a regularization constant to be determined. Note that minimizing the first term in Eq. (7) will force the fitting of the training data (reducing bias on the training data but possibly increasing the prediction variance on a testing data), while minimizing the second term imposes smoothness on the solution (increasing the variance on the training data which might result in reducing prediction bias on a testing data). The modified solution resolves the ill-conditioning problem, but also solves the overfitting issues associated with the PINV solution. Taking the derivative of the modified cost function in Eq. (7) with respect to W, setting equal to zero, and solving for W results in: W = X FT (R + αI)−1 .

(8)

Here I is the (m × m) identity matrix with one on the main diagonal and zero elsewhere. We refer to this approach as the Singularity Robust Pseudo Inverse (SR-PINV) solution to the Wiener filter. There are a number of analytical and heuristic methods available in the literature for estimating the optimal value of α in specific contexts (e.g., [19], [16]). Kim et al. [16] adopted an iterative procedure, known as weight decay, to reduce the degrees of freedom of a linear model with a larger number of

December 8, 2007

DRAFT


9

model parameters and reported an improvement in performance. Here we devised an empirical formula for calculating the regularization constant, given by:

α=

λmax β , n

(9)

where n is introduced to divide out the effects of data set size on λmax , and β is a tuning factor to be determined empirically using a N-fold cross-validation approach. In general, scaling λmax by n results in an α that remains essentially steady across varying training data set sizes. This is because the individual eigenvalues of R linearly increase in magnitude as we utilize progressively larger training data sets. The resulting effect is a regularization scheme that diminishes in extent for larger data sets, where the associated matrices no longer suffer from the same numerical issues. In section III-C we study the effects of this regularization scheme as a function of training data set size. F. The Kalman Filter In this section, we will briefly summarize the basic tenets and assumptions of the Kalman filter, starting with a description of the filter and model parameter fitting. In what follows, x denotes a hidden state random variable. In addition, we define the observable random variable z to denote a noisy process which is linearly related to x, via the process model:

z i = H xi + qi ,

(10)

where H is a (m × k ) observation matrix and qi ∼ N(0, Qi ), Q ∈ Rm×m . Note that the noise terms {qi }ni=0 are assumed to have zero means and are mutually uncorrelated: E[qj qi T ] = Qδ(i − j) .

(11)

In the context of decoders for neuromotor prosthetic devices, we normally use x to represent the state of the hand and z to capture the corresponding neuronal firing patterns [11]. Thus, we take the ith column of the X and Z matrices as the ith realization of the random variables x and z (i.e., xi and zi ), respectively. Also, note that at the ith time-step, we are assuming that the firing rates zi are a result of the particular state of the hand xi , which is the converse of the modeling approach taken with the Wiener filter. Here, to ensure that observation noise is in fact zero mean, the mean firing rates are subtracted from each column of the F to result in a new data matrix which we shall refer to as the Z matrix. Wu et al. take an additional preprocessing step by simply replacing individual firing rates by their square-roots before December 8, 2007

DRAFT


10

subtracting the mean from the firing rates. The square-root transform has the effect of making the firing rate’s pdf more Gaussian-like. However, in actuality, the transformed data still remains non-Gaussian, thus violating the Kalman filter normality assumption. In practice, the square-root transform does not result in a significant change in decoding performance. This has also been noted in the later reports by Wu et al. [13]. Within the Kalman filtering framework, it is assumed that the state of the system evolves according to a first order Markov chain model corrupted by a Gaussian noise process, that is:

xi+1 = Axi + ui ,

(12)

where A is the (k × k ) state transition matrix and ui ∼ N(0, U), U ∈ Rk×k , is a zero mean, Gaussian, mutually uncorrelated noise, having zero cross-correlation with the qi sequence, i.e., E[uj qTi ] = 0 for all j and i. Here, the {xi }ni=0 are ensured to have zero mean by following the same procedure as in the case of the Wiener filter approach. The optimal Kalman estimator x î at the ith time step is given by [21]: î = x ˆ i|i−1 + Gi (zi − H x ˆ i|i−1 ), x

(13)

ˆ i|i−1 is the model prediction, (zi − H x ˆ i|i−1 ) is called innovation or measurement residual, and where x Gi is the Kalman gain given by (see appendix): Gi = Pi HT Q−1 .

(14)

Note that, Gi and Q (the uncertainty associated with the neuronal firing rate observations) are inversely related. That is, when observations are not very reliable Q is large and consequently Gi is small. As a result, a lower weight is assigned to the observations. See appendix for details on our implementation of the Kalman filter. III. R ESULTS Throughout this study, all the reported results are based on N-fold cross-validation (N = 20), in which N −2 folds are used for training the model, one fold is used for validating the model (to find parameters),

and one fold is used for testing purposes (to assess model performance) [22], [23]. Since each experiment is repeated N times, a mean and standard deviation is associated with each experimental configuration (data set and experimental parameters). However, we only show the mean performance to enhance readability of the figures. December 8, 2007

DRAFT


11

In what follows, we only show the decoding results for Cartesian x-position of the hand. Nevertheless, all the observations made here also hold for predicting Cartesian y -position, as well as velocity in both the x and y directions. One reason for this choice of focus is to facilitate comparison of the presented results with those of other authors [11], [12], [13]. A. PINV Solution to the Wiener Filter We computed the parameters of the Wiener filter using the pseudo-inverse approach described in Eq. (4). Here, we set the window size equal to one second (L = 1 second) with 10 non-overlapping time bins per neuron (∆ = 100 ms). Fig. 1 shows the mean decoding performance of each model as a function of available training time on six different data sets. For small training data sets (less than 3-6 minutes of recording), the FVAF performance measures exhibit a low mean (dropping below zero in some cases) and a high standard deviation. In contrast, the corresponding performance on the training data sets (not shown here) are close to the maximum performance of one, indicating severe overfitting of the training data by the PINV algorithm. As the amount of available training data is increased, performance increases, asymptoting at 15-20 minutes of training data. In these cases, the standard deviation of the performance metric is substantially lower than with the small training data sets. The asymptotic performance of the six data sets is determined in part by the number of neurons that have been recorded from: the RS data sets include the largest numbers of neurons, whereas the BO data sets include the smallest numbers (see table I). The overfitting effect that is observed for the models with the smallest amount of training time also manifests itself in the condition number of the associated sample correlation matrices (Fig. 2). For smaller training data sets, the corresponding condition numbers are very large for at least four out of six data sets. Note that the two data sets with the smallest condition numbers (BO1 and BO2) also have the smallest number of recorded neurons, and thus the least number of model parameters. As the amount of training time increases, the condition numbers drop, asymptoting by 5-7 minutes of training data. Note also that the condition number for RS2 does not asymptote until approximately 11 minutes of training data are available. However, the model only catastrophically overfits the training data when less than five minutes of data are available. This demonstrates that the condition number only indicates the potential for overfitting, but does not absolutely predict it. In particular, when the distribution of the test data set is well represented by the training data set, performance on the test data set can be high regardless of the condition number.

December 8, 2007

DRAFT


12

B. Kalman Filter As already discussed, a proper Kalman filter formulation requires that each prediction only be made as a function of firing rates within a single time bin and the previously estimated state of the system. This formulation is the same as the uniform lag Kalman filter adapted by Wu et al. [13] and Kim et al. [16] (we will refer to this as the proper Kalman filter). Here, consistent with the results reported by Wu et al. [13] we find that approximately 100ms to 150ms of lag results in the best prediction (neural observations precede hand movement by the amount of lag). Note that Wu et al. [13] also consider a non-uniform lag Kalman filter and report a slight improvement in performance. However, their final analysis of the relative performance of various algorithms is based on the uniform lag configuration. Fig. 3 shows the mean performance of the Kalman filter as a function of available training time. For all six data sets, the Kalman filter performs consistently regardless of the training set size (with only small reductions in performance for the smallest training set sizes). Comparing this figure with Fig. 1 reveals that the proper Kalman filter does better than the PINV solution to the Wiener filter for small data sets (1-5 minutes of training time). For instance, with approximately four minutes of training data, the Kalman filter significantly outperforms the PINV solution by an average FVAF of 0.03 (paired bootstrap test, p < 0.005). However, for larger data sets (greater than 6 minutes) the PINV solution outperforms the proper Kalman filter. At 20 minutes of training time, the PINV solution performs on average 0.2 better, which is significant according to a paired bootstrap test (p < 10−5 ). The differences in performance as a function of available training time are due to the fact that the proper Kalman formulation has many fewer model parameters than the Wiener formulation. Fewer parameters means that it is harder to overfit the model with the small data sets. However, the smaller number of parameters also means that the expressive power of the Kalman filter is less than that of the Wiener filter. With the larger data sets, the PINV solution to the Wiener filter is able to use this additional expressive power to its advantage. C. Singularity Robust Solution to the Wiener Filter We have already demonstrated that with smaller training data sets, the PINV solution to the Wiener filter suffers from ill-conditioning of the input correlation matrix (R). This, in turn, results in overfitting of the training data. It is highly desirable to design a decoder that is robust for any reasonable training data set size. To address this objective, we apply the singularity reduction scheme described in section II-D to the input correlation matrix R. One issue that must be addressed is how to select the β parameter, which determines the balance between model bias and variance. For the RJ1 data set, we construct models December 8, 2007

DRAFT


13

while varying both β and the size of the training set. For each combination of these parameters, we used 20-fold cross-validation to construct 20 different models; each model is evaluated using the validation data set. For each training set size, mean model performance is shown in Fig. 4 as a function of β . Note that setting β = 0 corresponds to using the PINV solution to the Wiener filter. In this case, performance is dramatically affected for the smaller training set sizes. As β is increased to 1-3, the performance for all training set sizes increases. However, further increases in β result in a reduction in performance. Note also that as one increases the number of data folds for training (and hence the amount of training data), the performance curve flattens, indicating a lower sensitivity to regularization. It is worth emphasizing that even though these performance curves correspond to decoding performance in the Cartesian x-position, the same conclusion holds for decoding other variables that describe hand movement. This is the case because the tuning factor β is only a function of the input correlation matrix, R, and not the output of the linear map. Based on the above results, we chose β = 2 for our subsequent experiments. Fig. 5 shows the mean condition number for the regularized input matrix (R + α I, where α is chosen according to Eq. (9)) for all six data sets as a function of the available training time. The figure provides empirical evidence for the effectiveness of the singularity reduction technique employed in this work. In particular, the mean condition numbers for small training set sizes are small relative to larger training set sizes. Also, the condition numbers for small training sets are substantially smaller than those of the PINV solution (compare with Fig. 2), but these differences fade as the training set size increases. These results indicate that the regularization term has a large effect on the conditioning of the matrix for small training sets, but that this effect is muted as the training set size increases. Fig. 6 shows the mean performance for each data set as a function of training set size. Similar to the Kalman filter, the approach performs robustly regardless of the amount of training data. However, with 4 minutes of training data, the SR-PINV Wiener filter outperforms the Kalman filter by an average of 0.17, which is significant according to a paired bootstrap test (p < 10−5 ). With 20 minutes of training

data, the SR-PINV approach outperforms PINV by 0.01 and outperforms the Kalman filter by an average of 0.21 (both differences are significant; p < 10−5 ). D. Modified Kalman Filter A possible approach to addressing the limited power of the Kalman filter to capture complex models is to provide each step of the prediction with an entire one second of neural activity. In our approach, this corresponds to a 1 second window size (L = 1 second), consisting of 10 non-overlapping time bins December 8, 2007

DRAFT


14

per neuron (∆ = 100 ms). However, with this structure, the observation noise covariance matrix (Q) is ill-conditioned for the small training data sets. Because Q is essentially the same as matrix F, the condition number for Q follows the same pattern as is observed for F (see Fig. 2). Following the procedure outlined in Sec. II-E, we add a constant term to the elements in the main diagonal of Q to obtain: Qreg = Q + α I, where α is given by Eq. (9). A 20-fold cross-validation study yields β = 50 as the best parameter choice based on the validation data set performance. The resulting algorithm performs better than the proper Kalman filter design (L = 100ms and ∆ = 100ms), but the overall performance is lower than the SR-PINV solution to the Wiener filter. Moreover, we find that setting the state transition matrix (A) equal to zero results in an improvement ˆ i|i−1 and only in performance. In fact, this manipulation is equivalent to discarding the a priori estimate x

relying on neuronal firing rates to make an estimate at the ith time step (thus reducing the complexity of ˆ i = Gi zi which is similar to the Wiener approach the model). The resulting estimator is of the form x

with the regression coefficients (W) replaced by the Kalman gain Gi . The performance of the resulting algorithm is significantly less than SR-PINV at 4 and 20 minutes of training data (p < 0.02), however the mean difference in performance is not substantial (less than 0.007). IV. D ISCUSSION

AND

F UTURE W ORK

The motivation behind this research is to ultimately predict (decode) intended hand movement in paralyzed patients from motor cortical activity and to utilize the decoded signals to control a prosthetic device (e.g., a computer cursor, a robotic hand, or a wheelchair). To that end, we studied two of the most widely used neural decoders, namely, the Wiener filter and the Kalman filter. We showed that when using the Wiener filter, special attention should be paid to the input correlation matrix to ensure stability of the solution across varying training data set sizes. In particular, small training set sizes can result in an ill-conditioned input correlation matrix, which in turn can lead to dramatic overfitting of the training data set. As a result, prediction performance on independent test data sets can suffer substantially. On the other hand, because the Kalman filter makes use of a much smaller parameter set, such overfitting is not observed in practice. We believe that it is these factors that explain the differences in prediction performance between the Kalman and the Wiener filters that have been observed by Wu et al., and that have led to their conclusion of the superiority of the Kalman filter over the Wiener filter [11], [12], [13]. In contrast, when the training data sets are large, the PINV solution to the Wiener filter does not suffer from the overfitting problems exhibited with the smaller data sets. Furthermore, by virtue of the larger number of parameters, the Wiener filter is capable of representing a richer class of decoders than the December 8, 2007

DRAFT


15

Kalman filter. With the larger training data sets, the PINV solution is able take advantage of this richness and substantially outperform the Kalman filter models on independent data sets. These results demonstrate the importance of taking steps to detect and avoid model overfitting. The problem of ill-conditioning in the input correlation matrix is an issue whether one is explicitly inverting this matrix to solve for model parameters directly (e.g., the PINV approach), or employing a gradient descent method to incrementally search for a set of parameters. We have shown that the condition number of the correlation matrix is a useful tool for anticipating the potential for overfitting. In addition, employing a regularization scheme that penalizes high magnitude model parameters can be used to address this illconditioning explicitly. In essence, the addition of such a penalty removes degrees of freedom from the model that do not substantially add to the model’s ability to explain the training data. We have demonstrated empirically that by adding a penalty for high magnitude model coefficients, we can substantially improve the performance of the Wiener filter solution in the case of impoverished training data sets. We also obtained an empirical value for the tuning parameter β that performed well in an N-fold cross-validation sense for our data set. We showed that the resulting algorithm generally performs better than the Kalman filter with a 100 ms window size and may be more appropriate in BMI applications where little is known about the underlying distributions of the neural firing rates and the kinematic variables. We also demonstrated that when the Kalman filter is designed with 1s window size, the estimated observation noise covariance matrix can also benefit from regularization. Even though such a design violates Kalman’s observation noise independence assumption, performance of the resulting algorithm is substantially higher than the 100 ms window size Kalman filter and is comparable to that of SR-PINV. As the amount of training data is increased, we have demonstrated an asymptotic improvement in model performance. This result is consistent with those reported by Serruya et al. [8] and Wu et al. [13]. However, while we observed the plateau effect for the training data sets larger than 15 minutes, Serruya et al. made the observation for the training data sets larger than 3-4 minutes. This is likely due to the smaller number of recorded neurons employed in their study (4-18 neurons). The decoders studied in this paper belong to the class of linear models. Recently, Kim et al. [24] reported an improvement in performance over the linear filters using a class of nonlinear models known as support vector machines (with polynomial kernels). Our preliminary results indicate that the added advantage of support vector regression is largely due to the regularization that takes place as a part of the algorithm, rather than the nonlinear transformation on the neural data. However, a comprehensive study of a wider class of learning kernels is required to make any definite statement regarding the December 8, 2007

DRAFT


16

effectiveness of this class of nonlinear model (for instance, see Shpigelman et al. [25]). In addition, nonlinear models might prove more valuable as we allow for experiments under more natural conditions with less constrained movements. Our future work includes integrating information over multiple brain regions pertinent to planning, movement execution, and proprioceptive feedback, as well as the use of externally available information sources such as head mounted cameras and oculomotor-derived signals. The use of external cues has already been suggested by Yu et al. [26] for the purpose of decoding movement end points from which goal-directed movement trajectories can be constructed. The theoretical foundations of such goal-directed trajectory decoding models have previously been studied by Srinivasan et al. [27] and tested using simulated data. The feasibility of online implementation of such methods remain an open area of study. A PPENDIX A K ALMAN F ILTER F ORMULATION A. Estimating the Kalman Model Parameters: One can utilize the least square estimation technique to estimate the Kalman model parameters A, H, U, and Q [13]:

n−1 A = argminA Σi=1 kXi+1 − A Xi k2 = X2 XT1 (X1 XT1 )−1 ,

(15)

H = argminH Σni=1 kZi − H Xi k2 = ZXT (XXT )−1 ,

(16)

U = (X2 − A X1 )(X2 − A X1 )T /(M − 1),

(17)

Q = (Z − H X)(Z − H X)T /M,

(18)

where X1 and X2 are copies of the X matrix with their last and first columns removed, respectively. Notice that we have dropped the subscripts from these parameters, thus, assuming the system statistics and dynamics remains unchanged over the time course of the experiment. B. Optimal Kalman Estimator: ˆ i at the ith time step is given by [21]: The optimal Kalman estimator x î = x ˆ i|i−1 + Gi (zi − Hˆ x xi|i−1 ),

December 8, 2007

(19)

DRAFT


17

where the a priori estimate x î|i−1 is calculated as: ˆ i|i−1 = A x ˆ i−1 . x

(20)

Furthermore, the optimal blending factor Gi is: Gi = Pi|i−1 HT (H Pi|i−1 HT + Q)−1 ,

(21)

ˆ i|i−1 : where Pi|i−1 is the error covariance matrix associated with the a priori estimate x

Pi|i−1 = E ei|i−1 (ei|i−1 )T ˆ i|i−1 )(xi − x ˆ i|i−1 )T = E (xi − x = A Pi−1 AT + U,

(22)

and the error covariance matrix associated with the updated (a posteriori) estimate is given by:

ˆ i )(xi − x ˆ i )T Pi = E ei (ei )T ] = E[(xi − x = (I − Gi H)Pi|i−1 .

(23)

Note that the bulk of the computation time is spent calculating the inverse in Eq. (21), which is proportional to the size of the observation noise covariance Q. In practice, for smaller data sets this matrix also suffers from ill-conditioning. Here, we address both the numerical and computational issues associated with this inversion by employing a mathematically equivalent form for Pi :

Pi = [(Pi|i−1 )−1 + HT Q−1 H]−1 .

(24)

Similarly, one can derive the following equivalent form for the optimal blending factor:

Gi = Pi HT Q−1 .

(25)

Note that under this new formulation, one needs to calculate the inverse of Q only once. In addition, in the case when Q is close to being singular, a regularization step similar to the procedure outlined in section II-E could be taken to address the ill-conditioning problem (see section III-B for more details). Together, equations 19, 20, 22, 24, and 25 constitute the complete solution to the estimation problem described by Eqs. 10 and 12. Pseudo code for our implementation of the Kalman filter algorithm is presented next. December 8, 2007

DRAFT


18

C. Pseudo Code for the Kalman Filter The optimal Kalman gain factor Gi and the error covariance matrix Pi are calculated offline and converge to a fixed value within few iterations:

1) Start by setting P−1 1|0 = 0 (infinite uncertainty) 2) Let i = 0 3) While (kGi − Gi−1 k > ǫ and kPi − Pi−1 k > ǫ) Set i = i + 1 Set Pi according to Eq. (24) Set Gi according to Eq. (25) Set Pi+1|i according to Eq. (22) End 4) Set G = Gi and P = Pi Here, k.k is the L2 norm and ǫ is a small number (10−10 ). Given A, H, and G, the recursive estimation loop is as follows: 1) Start by setting x1|0 = 0 2) Let i = 0 3) While (new observation zi is available) Set i = i + 1 Calculate x î according to Eq. (19) Calculate x î+1|i according to Eq. (20) End

D. Notes on the Recursive Properties of the Kalman Filter As mentioned earlier, a proper Kalman Filter formulation requires that each prediction only be made as a function of a single bin per neuron (L = 100 ms) and of the previously estimated state of the system. Thus, at the ith step of prediction, the observation vector zi has c elements (one number per each neuron). However, the Kalman filter recursively incorporates information from previous time steps to make a prediction at the current time step. To see this, let us rearrange the terms in Eq. (19) and ˆ i|i−1 by A x ˆ i−1 , which results in [21], [13]: replace x

December 8, 2007

DRAFT


19

ˆ i ≈ (A − G H A)ˆ x xi−1 + G zi ˆ i−1 + G zi = Mx

(26)

ˆ i−2 + M G zi−1 + G zi = . . . = M2 x ˆ1 + = Mi−1 x

i−2 X

Mj G zi−j

(27)

j=0

≈

i−2 X

Mj G zi−j ,

(28)

j=0

ˆ i−1 , x ˆ i−2 , . . . x ˆ2, where M = (A − G H A) and Eq. (27) is obtained by recursively substituting for x

using Eq. (26). Furthermore, for moderately large i (e.g., i > 15) Mi−1 is essentially zero, thus, we can safely make the approximation in Eq. (28). The reason for the validity of this approximation is that if the Kalman filter is stable, the eigenvalues of the M matrix (these are the roots of the characteristic polynomial for the Kalman filter) are all less than one in absolute value [21]. In fact, the closer to zero these eigenvalues are, the less temporal information is used in the prediction process. This is evident from Eq. (28) by observing that the older observations are necessarily weighted by progressively smaller weights, since, kM0 k > kM1 k > . . . > kMi−2 k, where k.k refers to the spectral norm also known as the matrix norm [19]. In fact, in neural decoding applications, where the true time-lag between the neuronal activities and external correlates of these activities are not known a priori, the Kalman filter’s observation weighting procedure is not guaranteed to be optimal. However, in the case of the Wiener Filter with L = 1 second (i.e., 10 observation bins per neuron) the pseudo-inverse algorithm has the flexibility to learn the optimal lag from the training data and assign higher weights to the appropriate time bins. ACKNOWLEDGMENT We thank Matthew Fellows, Elise Gunderson, Zach Haga, Dawn Paulsen, Dennis Tkach, and Jake Reimer for their help with the surgical implantation of the arrays, training of monkeys, and data collection. Support was provided in part by the National Institutes of Health (Grants #RO1-NSO-48845 and #RO1NS0-45853) and by the University of Oklahoma.

December 8, 2007

DRAFT


20

R EFERENCES [1] W. T. Thach, “Correlation of neural discharge with pattern and force of muscular activity, joint position, and direction of intended next movement in motor cortex and cerebellum,” Journal of Neurophysiology, vol. 41, pp. 654–676, 1978. [2] A. P. Georgopoulos, J. F. Kalaska, R. Caminiti, and J. T. Massey, “On the relations between the direction of two-dimensional arm movements and cell discharge in primate motor cortex,” Journal of Neuroscience, vol. 2, pp. 1527–1537, 1982. [3] A. P. Georgopoulos, R. Caminiti, J. F. Kalaska, and J. T. Massey, “Spatial coding of movement: a hypothesis concerning the coding of movement direction by motor cortical populations,” Experimental Brain Research, Suppl., vol. 7, pp. 327–336, 1983. [4] A. P. Georgopoulos, A. B. Schwartz, and R. E. Kettner, “Neuronal population coding of movement direction,” Science, vol. 233, pp. 1416–1419, September 1986. [5] G. A. Reina, D. Moran, and A. Schwartz, “On the relationship between joint angular velocity and motor cortical discharge during reaching,” Journal of Neurophysiology, vol. 85, pp. 2576–2589, 2001. [6] D. M. Taylor, S. I. Tillery, and S. A. B., “Direct cortical control of 3D neuroprosthetic devices,” Science, vol. 296, pp. 1829–1832, 2002. [7] D. Warland, P. Reinagel, and M. Meister, “Decoding visual information from a population of retinal ganglion cells,” Journal of Neurophysiology, vol. 78, no. 5, pp. 2336–2350, 1997. [8] M. Serruya, N. Hatsopoulos, M. Fellows, L. Paninski, and J. Donoghue, “Robustness of neuroprosthetic decoding algorithms,” Biological Cybernetics, vol. 88, no. 3, pp. 201–209, 2003. [9] L. Paninski, M. Fellows, N. Hatsopoulos, and J. P. Donoghue, “Spatiotemporal tuning of motor cortical neurons for hand position and velocity,” Journal of Neurophysiology, vol. 91, pp. 515–532, 2004. [10] E. Brown, L. M. Frank, D. Tang, M. C. Quirk, and M. A. Wilson, “A statistical paradigm for neural spike train decoding applied to position prediction from ensemble firing patterns of rat hippocampal place cells,” Journal of Neuroscience, vol. 18, pp. 7411–7425, 1998. [11] W. Wu, M. Black, Y. Gao, E. Bienenstock, M. Serruya, , and J. Donoghue, “Inferring hand motion from multi-cell recordings in motor cortex using a Kalman filter,” SAB - Workshop on Motor Control in Humans and Robots: On the Interplay of Real Brains and Artificial Devices, Edinburgh, Scotland (UK), pp. 66–73, 2002. [12] W. Wu, M. Black, Y. Gao, E. Bienenstock, M. Serruya, A. Shaikhouni, and J. Donoghue, “Neural decoding of cursor motion using a Kalman filter,” Advances in Neural Information Processing Systems 15, MIT Press, pp. 117–124, 2003. [13] W. Wu, Y. Gao, E. Bienenstock, J. Donoghue, and M. Black, “Bayesian population decoding of motor cortical activity using a Kalman filter,” Neural Computation, vol. 18, no. 1, pp. 80–118, 2006. [14] A. H. Fagg, G. Ojakangas, L. Miller, and N. Hatsopoulos, “Kinetic trajectory decoding using motor cortical ensembles,” 2007, submitted. [15] S. H. Scott, “Apparatus for measuring and perturbing shoulder and elbow joint positions and torques during reaching.” Journal of Neuroscience Methods, vol. 89, pp. 119–127, 1999. [16] S. Kim, J. Sanchez, Y. Rao, D. Ergodmus, J. Principe, J. Carmena, M. Lebedev, and M. Nicolelis, “A comparison of optimal MIMO linear and nonlinear models for brain-machine interfaces,” Journal of Neural Engineering, pp. 145–161, 2006. [17] S. S. Shapiro and M. B. Wilk, “An analysis of variance test for normality (complete samples),” Biometrika, vol. 52, pp. 591–611, 1965. [18] P. R. Cohen, Empirical Methods for Artificial Intelligence. December 8, 2007

Cambridge, MA: MIT Press, 1995. DRAFT


21

[19] A. Björck, “Numerical methods for least squares problems,” Society for Industrial and Applied Mathematics, p. 49, 1996. [20] J. A. Suykens, T. V. Gestel, J. De Brabanter, B. De Moor, and J. Vandewalle, Least Squares Support Vector Machines. World Scientific, 2002. [21] R. G. Brown and P. Y. C. Hwang, “The discrete Kalman filter,” in Introduction to Random Signals and Applied Kalman Filtering. John Wiley and Sons Inc., Third Edition, 1997, pp. 214–276. [22] M. Stone, “Cross-validatory choice and assessment of statistical predictions,” Journal of the Royal Statistical Society Series B-Methodological, vol. 36, pp. 111–147, 1974. [23] M. W. Browne, “Cross-validation methods,” Journal of Mathematical Psychology, vol. 44, pp. 108–132, 2002. [24] K. H. Kim, S. S. Kim, and S. J. Kim, “Superiority of nonlinear mapping in decoding multiple single-unit neuronal spike trains: A simulation study,” Journal of Neuroscience Methods, vol. 150, no. 2, pp. 202–211, 2006. [25] L. Shpigelman, Y. Singer, R. Paz, and E. Vaadia, “Spikernels: Predicting arm movements by embedding population spike rate patterns in inner-product spaces,” Neural Computation, vol. 17, no. 3, pp. 671–690, 2005. [26] B. Yu, C. Kemere, G. Santhanam, A. Afshar, S. Ryu, T. Meng, M. Sahani, and K. Shenoy, “Mixture of trajectory models for neural decoding of goal-directed movements,” Journal of Neurophysiology, vol. 97, pp. 3763–3780, February 2007. [27] L. Srinivasan, U. T. Eden, A. S. Willsky, and E. N. Brown, “A state-space analysis for reconstruction of goal-directed movements using neural signals,” Neural Computation, vol. 18, no. 10, pp. 2465–2494, 2006.

December 8, 2007

DRAFT


TABLE

AND

22

F IGURE C APTIONS

Table I: Number of recorded neurons per data set. Fig 1: Performance of the PINV solution to the Wiener filter as a function of available training time on six different data sets. Each point is the mean test set performance (as measured by Fraction of Variance Accounted For) for 20 cross-validated experiments. Each curve summarizes the performance for one of the six data sets. FVAF values below zero indicate substantial model overfitting (and are not shown). Fig. 2: Condition number of the R auto-correlation matrix as a function of available training time for six different data sets. Each point is the mean condition number across 20 cross-validated experiments. Fig. 3: Performance of the Kalman filter (with 100ms window size) as a function of available training time on six different data sets. Each point is the mean performance of 20-folds cross-validated experiments. Fig. 4: Performance of the SR-PINV solution of the Wiener filter as a function of regularization parameter, β , on the RJ1 data set. Each point is the mean performance across 20 cross-validated validation data sets. Each curve corresponds to a different choice of training set size, in which one fold is approximately one minute of training data. Fig. 5: Condition number of the regularized auto-correlation matrix (R + αI) as a function of available training time for six different data sets. Fig. 6: Performance of the SR-PINV solution to the Wiener filter as a function of available training time on six different data sets.

December 8, 2007

DRAFT


23

Data Set

RJ1

RJ2

BO1

BO2

RS1

RS2

Number of Cells

48

61

36

31

99

86 TABLE I

December 8, 2007

DRAFT


24

1

Fraction of Variance Accounted For

0.9 0.8 0.7 0.6 0.5 0.4 0.3

RJ1 RJ2 BO1 BO2 RS1 RS2

0.2 0.1 0 0

5

10 15 20 Average Available Training Time (min)

25

Fig. 1.

December 8, 2007

DRAFT


25

25

10


20

10

15

χ(R)

10

10

10

5

10

0

10

0

5


25

Fig. 2.

December 8, 2007

DRAFT


26

1


0.9 0.8 0.7 0.6 0.5 0.4 0.3


0.2 0.1 0 0

5


25

Fig. 3.

December 8, 2007

DRAFT


27

1 1 fold 3 folds 5 folds 10 folds 18 folds


0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

0

0.5

1

2 3 Tuning factor β

5

10

20

Fig. 4.

December 8, 2007

DRAFT


28

4

χ(R)

10

3

10


2

10

0

5


25

Fig. 5.

December 8, 2007

DRAFT


29

1


0.9 0.8 0.7 0.6 0.5 0.4 0.3

RJ1 X RJ2 X BO1 X BO2 X RS1 X RS2 X

0.2 0.1 0 0

5


25

Fig. 6.

December 8, 2007

DRAFT

Constructing Robust Neural Decoders Using Limited Training Data

Constructing Robust Neural Decoders Using Limited Training Data

Suggest Documents

Robust neural network training using partial gradient probing ...

Robust Algorithm for Neural Network Training

Efficient and Robust Data Dissemination Using Limited Extra Network ...

Acoustic model selection using limited data for accent robust ... - Eurasip

Constructing Low-Order Discriminant Neural Networks Using

Making the most of limited training data using distant supervision

Rule Extraction from Training Data Using Neural ... - World Scientific

Training of Artificial Neural Networks Using Information-Rich Data

Training Neural Networks Using Input Data ... - DAS Conference

Face Recognition Using Robust Convolutional Neural Network

Robust Neural Network Tracking Controller Using Simultaneous ...

Robust Smile Detection using Convolutional Neural Networks

Predicting Classifier Performance with Limited Training Data ...

Constructing neural networks for multiclass

Training Feedforward Neural Networks Using Symbiotic Organisms ...

Training Neural Networks Using Multiobjective Particle ... - CiteSeerX

Spiking Neural Network Training Using Evolutionary ... - CiteSeerX

Training Feedforward Neural Networks Using Genetic Algorithms

Training Deep Spiking Neural Networks using Backpropagation

Strategies for Training Robust Neural Network Based ... - Google Sites

Strategies for Training Robust Neural Network Based ... - Google Sites

Neural Graph Learning: Training Neural Networks Using Graphs

Robust Training of Artificial Feedforward Neural ... - Springer Link

Performance of robust training algorithms for neural networks