Relaxation-Based Feature Selection for Single-Trial ... - IEEE Xplore

5 downloads 0 Views 516KB Size Report
Zhisong Wang, Member, IEEE, Alexander Maier, Nikos K. Logothetis, and Hualou ... Z. Wang was with the School of Health Information Sciences, University of.
IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 56, NO. 1, JANUARY 2009

101

Relaxation-Based Feature Selection for Single-Trial Decoding of Bistable Perception Zhisong Wang, Member, IEEE, Alexander Maier, Nikos K. Logothetis, and Hualou Liang∗ , Senior Member, IEEE

Abstract—Bistable perception refers to the phenomenon of spontaneously alternating percepts while viewing the same stimulus continuously. Bistable stimuli allow dissociation between stimuli and perception, and thus, provide a unique opportunity for understanding the neural basis of visual perception. In this paper, we focus on a relaxation (RELAX) based algorithm to select features from the multitaper spectral estimates of the multichannel intracortical local field potential (LFP), simultaneously collected from the middle temporal visual cortex of a macaque monkey, for decoding its bistable structure-from-motion (SFM) perception. We demonstrate that RELAX surpasses the conventional sequential forward selection (SFS) by offering the flexibility of modifying selected features. We propose a redundancy reduction preprocessing technique to significantly reduce the computational load for both SFS and RELAX. We exploit the support vector machines classifier based on the selected features for single-trial decoding the reported perception. Our results demonstrate the excellent performance of the RELAX feature selection algorithm. Furthermore, we find that the features in the gamma frequency band (30–100 Hz) of LFP are most relevant to bistable SFM perception. This finding is novel in awake monkey studies and suggests that gamma oscillations carry the most discriminative information for bistable perception of SFM stimuli. Index Terms—Bistable stimuli, feature selection, local field potential (LFP), middle temporal (MT), multitaper spectral analysis, perception, redundancy, relaxation (RELAX), relevance, sequential forward selection (SFS), single-trial decoding, structure-frommotion (SFM), support vector machines (SVMs).

I. INTRODUCTION HE STUDY of the neuronal correlates of perceptual reports elicited by ambiguous visual stimuli is promising for understanding the neural basis of visual perception [1]. Recently, such investigations have been conducted in humans using single-trial EEG analysis [2], [3]. Considerable efforts have also been made to link perception to neural activity in the middle

T

Manuscript received January 25, 2008; revised April 27, 2008. First published August 12, 2008; current version published February 13, 2009. This work was supported in part by the National Institutes of Health (NIH) under Grant R01 MH072034 and in part by the Max Planck Society. Asterisk indicates corresponding author. Z. Wang was with the School of Health Information Sciences, University of Texas Health Science Center at Houston, Houston, TX 77030 USA. He is now with the Microsoft Corporation, Bellevue, WA 98007 USA. A. Maier is with the Unit on Cognitive Neurophysiology and Imaging, National Institute of Mental Health, Bethesda, MD 20892 USA (e-mail: [email protected]). N. K. Logothetis is with the Max Planck Institute for Biological Cybernetics, T¨ubingen 72076, Germany (e-mail: [email protected]). ∗ H. Liang was with the School of Health Information Sciences, University of Texas Health Science Center at Houston, Houston, TX 77030 USA. He is now with the School of Biomedical Engineering, Drexel University, Philadelphia, PA 19104 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TBME.2008.2003260

temporal (MT) visual area of macaque monkeys using intracortical recordings [4]–[8]. Although the neural signal used in most monkey studies [4], [5], [8] was spiking activity, i.e., the simultaneous activity of spiking neurons, recently local field potentials (LFPs) have also been used to predict these perceptual judgments [6], [7]. Compared with spiking activity, LFPs, which predominantly reflect the synaptic activities of local populations of neurons [9], are easier to collect and can maintain a more stable recording over very long periods of time, and thus, hold the potential for the development of reliable neural prostheses. In addition, much evidence [10]–[13] has shown that the temporal structure of the LFP is modulated by various cognitive processes and may therefore reflect neural processing at a population level, suggesting that LFP provides additional information to spiking activity. As such, the investigation of correlations between perceptual reports and LFP during physically identical but perceptually different conditions may shed new lights on the mechanism of neural information processing and the neural basis of visual perception. The temporal structure of the LFP can be characterized by spectral analysis. Given the nonstationary and time-varying nature of a typical single-trial LFP time series, it is generally preferred to convert the LFP time series into a time–frequency spectral representation (i.e., spectrogram). This, in turn, has given rise to signal processing challenges. The availability of simultaneous multichannel recordings further complicates data analysis and interpretation. To exploit the information from the space, time and frequency domains, there is a need to construct an LFP feature set that is of high dimensionality. Yet, the number of instances for each feature is often small. This can lead to the so-called curse of dimensionality [14]. In addition, there may be many irrelevant or redundant features in a typical LFP feature set. The irrelevant features seldom show the correlation between brain activity and a given percept, and hence, give very little discriminative information, while the redundant features give no additional information to other features. By eliminating both irrelevant and redundant features, more accurate classification and faster computation can be attained [15], [16]. There are two major dimensionality reduction processes, namely feature selection and feature extraction. Feature selection is the process of selecting a subset of features from a given feature set, whereas feature extraction is a process of generating new features by transforming or combining features in a given feature set. The difference between feature selection and feature extraction is that the former does not create new features but only eliminates features. Hence, its resulting features preserve the physical meaning of the features in the given feature

0018-9294/$25.00 © 2009 IEEE

102

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 56, NO. 1, JANUARY 2009

set. Consequently, feature selection has been widely used in diverse fields ranging from bioinformatics, statistics, pattern recognition, machine learning, data mining, text processing, Web applications, speech processing to computer security [17]–[22]. Feature selection algorithms are commonly divided into three categories: wrappers, filters, and hybridbased methods [22]. The wrapper-based approaches use the performance of a predetermined classifier to evaluate the feature subset [23]. The filter-based approaches do not require any classifiers but use independent measures to evaluate a subset of features [16], [24]. The hybrid-based approaches use both independent measures and classifiers for feature subset evaluation [25], [26]. Compared with wrapper-based and hybrid-based methods, the filter-based approaches have the advantages of being not only computationally efficient but also able to yield classifier-independent feature subsets. In this paper, we focus on filter-based algorithms to select features from the LFP collected in the MT visual area of a macaque monkey for decoding its perception of bistable structure-frommotion (SFM) on a single-trial basis. Because of the nonstationarity and variability of the LFP signals, we perform multitaper spectral analysis to obtain the spectral estimates of LFP with a moving window approach to achieve a good tradeoff between spectral resolution (bias) and stability (variance). We propose a relaxation (RELAX) based feature selection algorithm that surpasses the conventional sequential forward selection (SFS) algorithm by offering the flexibility of both adding new features and altering already selected features. We also propose a redundancy reduction preprocessing (RRP) technique to significantly reduce the computational load of both the SFS and RELAX feature selection. We exploit the support vector machines (SVMs) classifier based on the selected features to decode the reported perception. Our results demonstrate the excellent performance of the RELAX feature selection algorithm and suggest that the features in the gamma frequency band (30–100 Hz) carry the most discriminative information for bistable perception. The major contributions of the paper are twofold. First, we present a RELAX-based algorithm to select features from the multitaper spectral estimate of LFPs. The novelty lies in the use of RELAX to select the features from a high-dimensional feature set, yet with significant reduced computational load. Second, we find that the gamma oscillations are most relevant to bistable perception of SFM. This is a novel finding that is previously unknown in awake monkey studies, and has direct implications that synchronized gamma oscillations carry the most discriminative information for bistable perception of SFM stimulus. The rest of the paper is organized as follows. In Section II, we first present the experimental paradigm and then introduce the multitaper spectral analysis, the filter-based feature selection, and the SVM classifier. In Section III, we explore the application of the feature selection algorithms for decoding the bistable SFM perception. In Section IV, we present the discussions on the contributions of the paper, the criterion function and computational complexity of the RELAX algorithm, and another well-known classifier: the linear programming machine (LPM). Finally, Section V contains the conclusions.

II. MATERIALS AND METHODS A. Subjects and Neurophysiological Recordings Electrophysiological recordings were performed in a healthy adult male rhesus monkey. After behavioral training was complete, an occipital recording chamber was implanted and a craniotomy was made. Intracortical recordings were conducted with a multielectrode array while the monkey was viewing SFM stimuli, which consisted of an orthographic projection of a transparent sphere that was covered with randomly distributed dots on its entire surface. SFM is the perception of 3-D shape from motion cues. When in motion, this stimulus gives the striking percept of a 3-D sphere spinning in one of the two possible directions: clockwise or counterclockwise. Trials initiated with a short period of fixation (300–500 ms), followed by the presentation of the SFM stimulus for 2000– 3000 ms with an average interstimulus interval of 2000 ms. The stimulus rotated for the entire period of presentation, giving the appearance of a 3-D structure. The monkey was well trained and required to indicate the choice of rotation direction (clockwise or counterclockwise) by pushing one of two levers. The stimuli in the task were randomly intermixed with both the disparitybiased trials and ambiguous (bistable) trials. Correct responses for disparity-defined stimuli were acknowledged with application of a fluid reward. In the case of fully ambiguous (bistable) stimuli, where the stimuli can be perceived in one of two possible ways, the monkey was rewarded by chance. Only trials corresponding to bistable stimuli are analyzed in the paper. The recording site was the MT area of the monkey’s visual cortex, which is commonly associated with visual motion processing. Multielectrode multiunit and LFP recordings (nine electrodes) were collected in MT. Only LFP was used in this study to demonstrate the proposed RELAX feature selection method. LFP was sampled at 200 Hz.

B. Multitaper Spectral Analysis The multitaper method is employed to perform single-trial spectral estimation due to its excellent performance [27]–[29]. In contrast to the unitaper method [30], in which a single taper (window) such as boxcar or Hanning window is applied to the data, the multitaper method involves the use of multiple data windows for spectral estimation. The optimal family of windows are the Slepian functions or discrete prolate spheroidal sequences, which form a set of orthogonal functions [31]. The Slepian functions are characterized by their half-bandwidth W in frequency and their length L in time. For a given W and L, there are 2LW − 1 tapers that are well concentrated within the frequency band [−W, W ] and suitable for use in spectral estimation. It is suggested to use M = 2LW − 3 to M = 2LW − 1 tapers in [27]. The procedure to compute the spectral estimate is as follows. 1) Computing the Slepian functions

wm (t),

t = 1, . . . , L,

m = 1, . . . , M.

(1)

WANG et al.: RELAXATION-BASED FEATURE SELECTION FOR SINGLE-TRIAL DECODING OF BISTABLE PERCEPTION

These are parameterized by their length in time L and bandwidth in frequency W , and M tapers. The routine called dpss is available in MATLAB. 2) Computing the tapered Fourier transforms of the length-L data x(t) for each taper wm (t), m = 1, . . . , M x ˜m (f ) =

L 

x(t)wm (t) e−j 2π f t .

(2)

t=1

3) Computing the multitaper spectral estimate Sˆx (f ) that is obtained as the average over individual windowed spectral estimates M 1  Sˆx (f ) = |˜ xm (f )|2 . M m =1

(3)

The power spectrum can be computed with a moving window to obtain a spectrogram that provides a time–frequency representation of the LFP data. C. Feature Selection In this section, we focus on both the efficiency and effectiveness of feature selection. First, we choose the filter-based method instead of the wrapper-based and hybrid-based methods due to the properties of computational efficiency and generalization (classifier-independence) of the filter-based method. Second, we propose in this paper a RELAX-based approach to improve the effectiveness of feature selection. Third, we propose an RRP technique to significantly reduce the computational load for the SFS and RELAX feature selection. 1) Criterion Function: We employ classifier-independent measures called relevance and redundancy to construct the criterion function for feature selection [16], [21]. For a given feature subset, relevance measures correlations between instances of features with class labels, whereas redundancy measures correlations between instances of all feature pairs. An ideal feature subset should have high relevance and low redundancy. Assume there are K features in a feature subset. The relevance of the feature subset is defined as relevance =

K 1  rk K

(4)

k =1

where rk denotes the correlation coefficient between the kth feature and the class label. And the redundancy of the feature subset is defined as redundancy =

K K 1  ri,j K 2 i=1 j =1

(5)

where ri,j denotes the correlation coefficient between the ith feature and the jth feature. For feature selection, we maximize the following criterion function to strike a balance between relevance and redundancy criterion function = relevance − redundancy.

(6)

The optimal feature set can be found by employing an exhaustive search, but it is too computationally expensive for most

103

applications. In this paper, we consider more practically relevant sequential suboptimal methods for feature selection. 2) SFS and RELAX-Based Feature Selection: The sequential search can be done by forward selection or backward elimination [20]. In forward selection, features are progressively added into the selected feature subset, while in backward elimination, features are progressively eliminated from the original feature set. A well-known and commonly used method for selecting features from a high-dimensional feature set is SFS, which chooses the features incrementally by adding one feature at each step as follows. Assume the number of features to be selected is J. In practice, J is often set by trials and errors if no assumptions of the data distributions are used [21]. The SFS Feature Selection Algorithm Step 1: Assume K = 1. Choose the feature that has highest relevance among all features in the feature subset. Since only one feature is chosen, no redundancy can be computed. Step 2: Assume K = 2. Choose the feature from the remainder of the feature pool so that its combination with feature 1 yields the maximum criterion function. Step 3: Assume K = 3. Choose the feature from the remainder of the feature pool so that its combination with feature 1–2 yields the maximum criterion function. [Remaining steps]: Continue similarly until K is equal to J. The shortcoming of SFS is that there is no way to change the previously selected features in the latter steps. In this paper, we propose a RELAX-based feature selection algorithm that offers the flexibility of both adding new features and correcting already selected features. In our previous paper [8], we employed RELAX for weighted combination of signals from all the available channels to optimize the area under the receiver operating characteristic (ROC) curve. To the best of our knowledge, the method described here is the first time that RELAX has been used for feature selection. The RELAX feature selection algorithm consists of the following steps. The RELAX Feature Selection Algorithm Step 1: Assume K = 1. Choose the feature with the highest relevance among all features in the feature subset. Since only one feature is chosen, no redundancy can be computed. Step 2: Assume K = 2. Choose the feature from the remainder of the feature pool so that its combination with feature 1 yields the maximum criterion function. Save the maximum criterion function. Take the feature out from the feature pool and name it as feature 2. Next, we do the following K substeps. Choose the feature from the remainder of the feature pool so that its combination with feature 2 yields the maximum criterion function. If the new maximum criterion function is higher than the saved maximum criterion function, update the maximum criterion function and swap feature 1 and the newly selected feature. Choose the feature from the remainder of the feature pool so that its combination with feature 1 yields the maximum criterion function. If the new maximum criterion function is higher than the updated maximum criterion

104

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 56, NO. 1, JANUARY 2009

function, update the maximum criterion function and swap feature 2 and the newly selected feature. Then, iterate the previous K substeps until the criterion function stops increasing. Step 3: Assume K = 3. Choose the feature from the remainder of the feature pool so that its combination with feature 1–2 yields the maximum criterion function. Save the maximum criterion function. Take the feature out from the feature pool and name it as feature 3. Next, we do the following K substeps. Choose the feature from the remainder of the feature pool so that its combination with features 2–3 yields the maximum criterion function. If the new maximum criterion function is higher than the saved maximum criterion function, update the maximum criterion function and swap feature 1 and the newly selected feature. Choose the feature from the remainder of the feature pool so that its combination with features 1–3 yields the maximum criterion function. If the new maximum criterion function is higher than the updated maximum criterion function, update the maximum criterion function and swap feature 2 and the newly selected feature. Choose the feature from the remainder of the feature pool so that its combination with features 1–2 yields the maximum criterion function. If the new maximum criterion function is higher than the updated maximum criterion function, update the maximum criterion function and swap feature 3 and the newly selected feature. Then, iterate the previous K substeps until the criterion function stops increasing. [Remaining steps]: Continue similarly until K is equal to J. 3) Redundancy Reduction Preprocessing: The computational cost of RELAX and SFS can be high if the feature set is large. To alleviate this problem, we propose an RRP technique that reduces the feature set by eliminating some redundant features. Our intuition stems from the observation that the features from different moving windows and channels corresponding to the same frequency bin are generally correlated with each other. In other words, there are many redundant features at each frequency bin that do not provide additional information. Hence, we can choose representative features for all frequency bins, i.e., features whose instances are most correlated with the class labels. As such, we obtain a reduced feature set for subsequent RELAX and SFS feature selection. Since RRP greatly reduce the redundancy and the size of the feature set to be searched, it becomes invulnerable to the curse of dimensionality problem. Note that the features for different frequency bins in the reduced feature set may be derived from distinct channels and moving windows, and hence, the reduced feature set still contains diverse information for feature selection. It is noteworthy that RRP can be done rather quickly even for a high-dimensional feature set due to the fact that at each frequency bin, only the calculation of the correlation coefficient between each feature and the class label as well as the search for the feature with the maximum correlation coefficient are performed. In addition, the reduced feature set generally has a

much smaller number of features than the original feature set and hence leads to significant reduction in computational cost for feature selection as shown in the experimental results later on. D. SVMs Classifier SVM is a popular classifier that distinguishes different classes of data by minimizing the empirical classification error and maximizing the margin [32], [33]. Since SVM is robust to outliers and has a good generalization capability, it has been used in a wide range of applications [34]. Assume that xi , i = 1, . . . , I are the I training feature vectors for decoding and the class labels are yi ∈ {−1, +1}, then SVM solves the following optimization problem: minw2 + C

I 

ξi

i=1

subject to yi (w xi + b) ≥ 1 − ξi ξi ≥ 0

(7)

where w is the weight vector, C > 0 is the penalty (or regularization) parameter of the error term chosen by cross-validation, ξi is the slack variable, and b is the bias term. It turns out that the margin of the two classes is inversely proportional to w2 . Therefore, the first term in the objective function of SVM is used to maximize the margin. The second term in the objective function is the regularization term that allows for training errors for the inseparable case. The Lagrange multiplier method can be used to find the optimal solution for w and b in the earlier optimization problem. Assume that t is the testing feature vector. Testing is done as follows. If w t + b ≥ 0, the label of t is classified as +1, otherwise the label is classified as −1. SVM can also be used as a kernel-based method when the feature vectors are mapped into a higher dimensional space [32], [34]. III. EXPERIMENTAL RESULTS In this section, we provide experimental examples to demonstrate the performances of the proposed feature selection approaches for predicting perceptual decisions from the neuronal data. In total, 96 trials and nine channels of LFP were used in this study. For each trial, multitaper spectral analysis is performed on a 500-ms-long nonoverlapping sliding analysis window during the initial time period of 1500 ms after stimulus onset. In each window, the power spectrum was estimated with 50 frequency bins. We choose the bandwidth LW = 3 and use the first three Slepian sequences as tapers. This yields a large feature set of 1350 features (9 channels × 3 time windows × 50 frequency bins), from which we select discriminative features. We employ the linear SVM classifier from the LIBSVM package and linearly scale each feature to the range of [0, 1] [35]. The regularization parameter in SVM is chosen by crossvalidation [35]. After feature selection, we employ fivefold cross-validation based on the selected features of the training data using a number of regularization parameters, and choose the

WANG et al.: RELAXATION-BASED FEATURE SELECTION FOR SINGLE-TRIAL DECODING OF BISTABLE PERCEPTION

105

Fig. 1. Comparison of decoding accuracy based on SFS and RELAX feature selection from the multitaper spectral estimates. (a) No RRP is used. (b) RRP is used. The horizontal dotted lines in the figure denote the decoding accuracies obtained without feature selection.

regularization parameter with the best cross-validation performance. We use decoding accuracy as the performance measure, calculated via leave-one-out cross-validation (LOOCV). In particular, for a dataset with N trials, we choose N − 1 trials for feature selection and training and use the remaining 1 trial for testing. This is repeated for N times with each trial used for testing once. The decoding accuracy is obtained as the ratio of the number of correctly decoded trials over N . Fig. 1 shows the decoding accuracy based on the SFS and RELAX feature selection from the multitaper spectral estimates as the number of selected features K increases from 2 to 10. In Fig. 1(a), the RRP technique is not used and SFS and RELAX are applied to the original feature set. In Fig. 1(b), RRP is used and SFS and RELAX are applied to the reduced feature set. The horizontal dotted lines in Fig. 1 denote the decoding accuracies obtained without feature selection. Note that feature selection can improve the decoding performance and that RELAX outperforms SFS. In fact, we find that in both Fig. 1(a) and (b), the decoding accuracy obtained via RELAX is significantly better than that obtained via SFS (p < 0.05, t-test). In order to have a closer look at which features are selected by the feature selection methods, we examined the frequency features chosen by our methods by plotting the histograms of the selected frequency features across all LOOCVs. For example, when K = 3, each LOOCV selects three features corresponding to three frequency bins, and a histogram can be plotted given the 3N frequency bins selected in N LOOCVs. Fig. 2 shows the histograms of the frequency features selected across all LOOCVs based on the SFS and RELAX feature selection from the multitaper spectral estimates. The five subgraphs from top to bottom correspond to the selection of two to six features, respectively. Note that when K = 2, the selected features are located near 56 and 96 Hz. As the number of selected features increases, more features at other frequencies are selected and the histograms become more spread across the frequency range. Still, almost all the selected features are in the gamma frequency band. Furthermore, the histograms of SFS and RELAX have no pronounced differences when K is no more than 3. As K increases beyond 3, the selected features by RELAX become more scattered than those by SFS. This shows that the features selected by RELAX are less redundant than those by SFS, which is probably the

reason why RELAX outperforms SFS in terms of decoding accuracy. Fig. 3 shows the histograms of the frequency features selected across all LOOCVs based on the RELAX feature selection from the multitaper spectral estimates. The five subgraphs from top to bottom correspond to the selection of seven to ten features, respectively. Note from Fig. 3(b) that when K increases beyond 6, besides the features in the gamma frequency band, the feature corresponding to 22 Hz within the beta frequency band (15–30 Hz) are selected by RRP+RELAX across most of LOOCVs. This may lead to the degradation of the decoding accuracy obtained via RRP+RELAX at K = 7, as seen in Fig. 1(b). On the other hand, as can be seen from Fig. 3(a), still almost all the features selected by RELAX are in the gamma frequency band when K increases beyond 6. Taken together, our results suggest that the gamma band features carry the most discriminative information for bistable perception. Table I shows the decoding accuracy and average computational time over all LOOCVs [elapsed time in seconds using MATLAB in a PC (Pentium (R), 3.20 GHz)] based on the SFS and RELAX feature selection from the multitaper spectral estimates when the number of selected features K is 6. Note that RELAX has better decoding accuracy than SFS at the expense of more computation. In addition, the SFS and RELAX feature selection with RRP is much more computationally efficient than their counterparts without RRP. The former has less than one-twentieth elapsed time of that of the latter. Hence, RRP can significantly reduce the computational burden for both SFS and RELAX feature selection. Table II shows the average and best decoding accuracy based on the SFS and RELAX feature selection from the multitaper spectral estimates with the number of features ranging from 2 to 6. Here, the average decoding accuracy is defined as the average of the decoding accuracies computed based on the selected feature sets with the number of features ranging from 2 to 6, and the best decoding accuracy is defined as the best of the decoding accuracies computed based on the selected feature sets with the number of features ranging from 2 to 6. Table III shows the average and best decoding accuracy based on the SFS and RELAX feature selection from the multitaper spectral estimates with the number of features ranging from 2 to 10. It is clear from Tables II and III that RELAX outperforms SFS in terms of

106

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 56, NO. 1, JANUARY 2009

Fig. 2. Comparison of histograms of frequency features selected across all LOOCVs based on the (a) SFS, (b) RELAX, (c) RRP+SFS, and (d) RRP+RELAX feature selection from the multitaper spectral estimates. Subgraphs from top to bottom correspond to the selection of two to six features, respectively. Note that when K = 2, the selected features are located near 56 and 96 Hz. As the number of selected features increases, more features at other frequencies are selected and the histograms become more spread out across frequencies. Still, almost all selected features are in the gamma frequency range (30–100 Hz). Furthermore, the histograms of SFS and RELAX have no pronounced differences when K is no more than 3. As K increases beyond 3, the selected features by RELAX become more scattered than those by SFS.

decoding accuracy and can select more discriminative features for bistable perception. Given its good decoding accuracy and modest computational load, RRP+RELAX might be a good choice in practice. We have tried the unitaper spectral estimates with different tapers as well and found that the decoding accuracies calculated based on the selected features from the multitaper spectral estimates outperform the decoding accuracies calculated based on the selected features from the unitaper spectral estimates. This is likely due to the fact that the multitaper spectral estimates have a better tradeoff between spectral resolution and stability than the unitaper spectral estimates. IV. DISCUSSIONS In this paper, we focus on a RELAX-based algorithm to select features from the moving window multitaper spectral estimates of the multichannel intracortical LFP signals for decoding bistable SFM perception. We further propose an RRP technique to significantly reduce the computational load. Our results sug-

gest that the features in the gamma frequency band (30–100 Hz) of LFP carry the most discriminative information for decoding bistable perception. The use of bistable stimuli is a major strength of the study since it allows dissociation between stimuli and perception, and thus, provides a unique opportunity for studying the neural correlate of perception. The major contributions of this paper include the following. First, we present a RELAX-based algorithm to select features from the multitaper spectral estimate of LFPs. The novelty lies in the use of RELAX to select the features from a high-dimensional feature set, yet with significant reduced computational load. Second, we find that the gamma oscillations are most relevant to bistable perception of SFM. This is a novel finding that is previously unknown in awake monkey studies, and has direct implications that synchronized gamma oscillations carry the most discriminative information for bistable perception of SFM stimulus. Our findings are thus different, yet complementary, to previous human EEG studies

WANG et al.: RELAXATION-BASED FEATURE SELECTION FOR SINGLE-TRIAL DECODING OF BISTABLE PERCEPTION

107

Fig. 3. Comparison of histograms of frequency features selected across all LOOCVs based on the (a) RELAX and (b) RRP+RELAX feature selection from the multitaper spectral estimates. Subgraphs from top to bottom correspond to the selection of seven to ten features, respectively. Note that when K increases beyond 6, besides the features in the gamma frequency band, the feature corresponding to 22 Hz within the beta frequency band (15–30 Hz) are selected by RRP+RELAX across most of LOOCVs. This may contribute to the degradation of the decoding accuracy obtained via RRP+RELAX at K = 7, as seen in Fig. 1(b). On the other hand, still almost all the features selected by RELAX are in the gamma frequency band when K increases beyond 6.

in perception [36]–[38], as well as to the recent monkey studies in attention [10] and working memory [11]. The LFP, a focus of this study, has recently become of increasing interest. Compared with single-unit spiking activity, LFP has several different characteristics: 1) LFP is a continuous process while spiking activity is a point process; 2) LFP arises largely from dendritic activity of a large number of neurons while spiking activity arises mainly from the axon and soma [39]; 3) LFP is dominated by the synaptic inputs to a larger cortical area as well as local processing while spiking activity represents the output of a small number of neurons in a brain area [39]; and 4) LFPs appear to be more correlated with the fMRI BOLD signal that spikes [40]. Finally, the LFP is easier to record and maintains a steadier flow of information over time, thus has the potential of practical use for the development of reliable brain–machine interfaces. However, LFPs collected by nearby electrodes are highly correlated, and therefore, the number of independent measurements from a given brain area is relatively low. In terms of the usage and implementation of the RELAX algorithm, this contribution has a number of distinct differences compared with our previous work [8]. First, we focus on LFP in this paper instead of spikes in [8], where the average firing rate was used as a feature. For the LFP analysis, we perform multitaper spectral analysis to obtain the spectral estimates with a moving window approach to obtain the spectral features with a good tradeoff between spectral resolution (bias) and stability (variance). As a result, the dimensionality of the spectral features is much higher than that of neural-firing-rate-based features. Second, we used RELAX in our previous paper [8] to find the weights for all the available channels to maximize the choice probability, whereas in this paper, we use RELAX to select discriminative features from a high-dimensional feature set with more than 1000 features. Third, in our previous paper [8], a 1-D grid search is required for every pairwise linear

TABLE I COMPARISON OF DECODING ACCURACY AND AVERAGE COMPUTATIONAL TIME OVER ALL LOOCVS [ELAPSED TIME IN SECONDS USING MATLAB IN A PC (PENTIUM (R), 3.20 GHZ)] BASED ON THE SFS AND RELAX FEATURE SELECTION FROM THE MULTITAPER SPECTRAL ESTIMATES WHEN THE NUMBER OF SELECTED FEATURES K IS 6

combination when combining two channels. The computational load is extremely high with a large number of channels. It is, therefore, computationally prohibitive and unrealistic to use the algorithm in [8] to find the weights for all features in the highdimensional feature set, and then, do the feature selection based on the weights. In this paper, no such 1-D grid search is required. Instead, we directly tradeoff relevance and redundancy to select the features. As a result, the computational load is greatly reduced. Fourth, we proposed an RRP technique to mitigate feature redundancy while further significantly reducing the computational load. In the criterion function (6) for the RELAX algorithm, we tradeoff relevance and redundancy equally. It is possible to apply different weights to the relevance and redundancy in the criterion function. In this case, however, cross-validation is required to choose the weights and the feature selection then becomes wrapper based. In this paper, we prefer to use a filterbased method to reduce the computational burden. It is also possible to use relevance/redundancy as the criterion function as suggested in [21]. In practice, we find that RELAX-based feature selection often converges within several iterations. However, it is hard to estimate the computational complexity of the RELAX algorithm if the iterations of substeps of RELAX continue until the criterion function stops increasing. On the other hand, we can use “practical convergence” in the iterations of the algorithm by checking

108

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 56, NO. 1, JANUARY 2009

TABLE II COMPARISON OF AVERAGE AND BEST DECODING ACCURACY BASED ON SFS AND RELAX FEATURE SELECTION FROM THE MULTITAPER SPECTRAL ESTIMATES WITH NUMBER OF FEATURES RANGING FROM 2 TO 6

TABLE III COMPARISON OF AVERAGE AND BEST DECODING ACCURACY BASED ON SFS AND RELAX FEATURE SELECTION FROM THE MULTITAPER SPECTRAL ESTIMATES WITH NUMBER OF FEATURES RANGING FROM 2 TO 10

if the relative changes of the criterion function are sufficiently small or if a preset maximum number of iterations is reached. By checking whether the preset maximum number of iterations is reached, it can be shown that the computational complexity of RELAX is O(k 2 n) if k features are to be selected from a total of n features. In this study, we employ the SVM for decoding bistable perception due to its popularity and its excellent performance. LPM [41]–[43] is yet another well-known classifier that can also be used for our purpose. The formulation of LPM is the same as that of SVM except that LPM uses the L1 -norm of the weight vector in the objective function. As a result, LPM yields a sparse weight vector. It is worth noting that the sparseness of the weight vector for LPM can be exploited for feature selection. However, LPM requires cross-validation to choose the regularization parameter. Therefore, LPM for feature selection is essentially a wrapper-based method, and consequently, the computational load may be high. The proposed RELAX method, on the other hand, is filter based and can directly select a number of features without cross-validation.

V. CONCLUSION In this paper, we have shown how to use a RELAX-based algorithm to select features from the LFP in the MT area of a macaque monkey performing a bistable SFM task. We have shown that unlike the conventional SFS algorithm, RELAX offers the flexibility of modifying already selected features and that it can effectively select features without prior knowledge of the discriminative frequency bins, temporal moving windows, and channels. We have presented an RRP technique to reduce the computational load for both the SFS and RELAX feature selection and exploited the SVMs classifier based on the selected features to decode the reported perception on a single-trial basis. Experimental results have shown the excellent performance of the RELAX feature selection algorithm. Using these techniques, we have demonstrated that the features in the gamma

frequency band carry the most discriminative information for bistable perception. Our results confirm that MT area is associated with visual motion processing. The good decoding performances obtained based on only several selected features suggest that it may not always be necessary to use a multielectrode array with a large number of elements for neurophysiological recordings when it comes to decoding population activity. The excellent performance of the RELAX-based feature selection algorithm may be due to the following reasons. First, the multitaper spectral estimates have a good tradeoff between spectral resolution (bias) and stability (variance). Second, both the spatiotemporal correlations and the nonstationary nature of the single-trial time series have been taken advantage of by using a comprehensive feature set for feature selection. Third, the chosen criterion function strikes a good balance between relevance and redundancy. Finally, RELAX offers the flexibility of modifying already selected features. In addition, for both the SFS and RELAX feature selection, the RRP technique can be incorporated to not only mitigate feature redundancy, but also significantly reduce the computational load. The proposed feature selection approaches in this paper may have potential for a plethora of applications involving nonstationary multivariable time series such as brain– computer interfaces (BCIs). ACKNOWLEDGMENT The authors would like to thank the associate editor and reviewers for their valuable comments that have greatly improved the paper. REFERENCES [1] R. Blake and N. K. Logothetis, “Visual competition,” Nat. Rev. Neurosci., vol. 3, pp. 13–21, 2002. [2] M. G. Philiastides and P. Sajda, “Temporal characterization of the neural correlates of perceptual decision making in the human brain,” Cereb. Cortex, vol. 16, pp. 509–518, Apr. 2006. [3] M. G. Philiastides, R. Ratcliff, and P. Sajda, “Neural representation of task difficulty and decision making during perceptual categorization: A timing diagram,” J. Neurosci., vol. 26, pp. 8965–8975, Aug. 2006.

WANG et al.: RELAXATION-BASED FEATURE SELECTION FOR SINGLE-TRIAL DECODING OF BISTABLE PERCEPTION

[4] K. H. Britten, W. T. Newsome, M. N. Shadlen, S. Celebrini, and J. A. Movshon, “A relationship between behavioral choice and the visual responses of neurons in macaque MT,” Vis. Neurosci., vol. 13, pp. 87–100, 1996. [5] J. V. Dodd, K. Krug, B. G. Cumming, and A. J. Parker, “Perceptually bistable three-dimensional figures evoke high choice probabilities in cortical area MT,” J. Neurosci., vol. 21, pp. 4809–4821, 2001. [6] A. Maier, N. K. Logothetis, and D. A. Leopold, “Percept-related fluctuations of MT local field potentials,” Soc. Neurosci. Abstr., vol. 509.9, 2005. [7] H. Liang, X. Wang, Z. Wang, N. K. Logothetis, D. A. Leopold, and A. Maier, “A comparison of local field potentials and spiking activity to predict perceptual report during bistable visual stimulation,” Soc. Neurosci. Abstr., vol. 801.1, 2006. [8] Z. Wang, A. Maier, D. A. Leopold, and H. Liang, “Relaxation based multichannel signal combination (RELAX-MUSIC) for ROC analysis of percept-related neuronal activity,” IEEE Trans. Biomed. Eng., vol. 53, no. 12, pp. 2615–2618, Dec. 2006. [9] U. Mitzdorf, “Current source-density method and application in cat cerebral cortex: Investigation of evoked potentials and EEG phenomena,” Physiol. Rev., vol. 65, pp. 37–99, 1985. [10] P. Fries, J. H. Reynolds, A. E. Rorie, and R. Desimone, “Modulation of oscillatory neuronal synchronization by selective visual attention,” Science, vol. 291, pp. 1560–1563, 2001. [11] B. Pesaran, J. S. Pezaris, M. Sahani, P. P. Mitra, and R. A. Andersen, “Temporal structure in neuronal activity during working memory in macaque parietal cortex,” Nat. Neurosci., vol. 5, pp. 802–811, 2002. [12] H. Scherberger, M. R. Jarvis, and R. A. Andersen, “Cortical local field potential encodes movement intentions in the posterior parietal cortex,” Neuron, vol. 46, pp. 347–354, 2005. [13] H. Liang, S. L. Bressler, E. A. Buffalo, R. Desimone, and P. Fries, “Empirical mode decomposition of local field potentials from macaque V4 in visual spatial attention,” Biol. Cybern., vol. 92, no. 6, pp. 380–392, 2005. [14] O. R. Duda, P. E. Hart, and D. G. Stork, Pattern Classification.. New York: Wiley, 2001. [15] D. Koller and M. Sahami, “Toward optimal feature selection,” in Proc. Int. Conf. Mach. Learning, 1996, pp. 284–292. [16] M. A. Hall, “Correlation-based feature selection for discrete and numeric class machine learning,” in Proc. 17th Int. Conf. Mach. Learning, 2000, pp. 359–366. [17] P. M. Narendra and K. Fukunaga, “A branch and bound algorithm for feature subset selection,” IEEE Trans. Comput., vol. 26, no. 9, pp. 917– 922, Sep. 1977. [18] J. Doak, “An evaluation of feature selection methods and their applications to computer security,” Univ. California, Davis, Tech. Rep., 1992. [19] A. Jain and D. Zongker, “Feature selection: Evaluation, application, and small sample performance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 2, pp. 153–158, Feb. 1997. [20] I. Guyon and A. Elisseeff, “An introduction to variable and feature selection,” J. Mach. Learning Res., vol. 3, pp. 1157–1182, 2003. [21] C. Ding and H. Peng, “Minimum redundancy feature selection from microarray gene expression data,” J. Bioinf. Comput. Biol., vol. 3, no. 2, pp. 185–205, 2005. [22] H. Liu and L. Yu, “Toward integrating feature selection algorithms for classification and clustering,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 4, pp. 491–502, Apr. 2005. [23] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artif. Intell., vol. 97, pp. 273–324, 1997. [24] L. Yu and H. Liu, “Feature selection for high-dimensional data: A fast correlation-based filter solution,” in Proc. 20th Int. Conf. Mach. Learning, 2003, pp. 856–863. [25] S. Das, “Filters, wrappers and a boosting-based hybrid for feature selection,” in Proc. 18th Int. Conf. Mach. Learning, 2001, pp. 74–81. [26] E. P. Xing, M. I. Jordan, and R. M. Karp, “Feature selection for highdimensional genomic microarray data,” in Proc. 18th Int. Conf. Mach. Learning, 2001, pp. 601–608. [27] D. J. Thomson, “Spectrum estimation and harmonic analysis,” Proc. IEEE, vol. 70, no. 9, pp. 1055–1096, Sep. 1982. [28] D. B. Percival and A. T. Walden, Spectral Analysis for Physical Applications: Multitaper and Conventional Univariate Techniques. Cambridge, U.K.: Cambridge Univ. Press, 1993. [29] P. Mitra and B. Pesaran, “Analysis of dynamic brain imaging data,” Biophys. J., vol. 76, pp. 691–708, 1999.

109

[30] A. V. Oppenheim and R. W. Schafe, Digital Signal Processing. Englewood Cliffs, NJ: Prentice-Hall, 1989. [31] D. Slepian, “Prolate spheroidal wave functions, Fourier analysis, and uncertainty - V: The discrete case,” Bell Syst. Tech. J., vol. 57, pp. 1371–1430, 1978. [32] V. N. Vapnik, The Nature of Statisical Learning Theory. New York: Springer-Verlag, 1995. [33] C. Cortes and V. N. Vapnik, “Support-vector networks,” Mach. Learning, vol. 20, pp. 273–297, 1995. [34] B. Sch¨olkopf, C. J. C. Burges, and A. J. Smola, Eds., Advances in Kernel Methods: Support Vector Learning. Cambridge, MA: MIT Press, 1999. [35] C.-C. Chang and C.-J. Lin. (2001). LIBSVM: A library for support vector machines [Online]. Available: http://www.csie.ntu.edu.tw/∼cjlin/libsvm. [36] A. Keil, M. M. M¨uller, W. J. Ray, T. Gruber, and T. Elbert, “Human gamma band activity and perception of a gestalt,” J. Neurosci., vol. 19, pp. 7152–7161, 1999. [37] G. Csibra, G. Davis, M. W. Spratling, and M. H. Johnson, “Gamma oscillations and object processing in the infant brain,” Science, vol. 290, pp. 1582–1585, 2000. [38] V. Goffaux, A. Mouraux, S. Desmet, and B. Rossion, “Human nonphase-locked gamma oscillations in experience-based perception of visual scenes,” Neurosci. Lett., vol. 354, pp. 14–17, 2004. [39] U. Mitzdorf, “Properties of the evoked potential generators: Current source-density analysis of visually evoked potentials in the cat cortex,” Int. J. Neurosci., vol. 33, pp. 33–59, 1987. [40] N. Logothetis, J. Pauls, M. Augath, T. Trinath, and A. Oeltermann, “Neurophysiological investigation of the basis of the fMRI signal,” Nature, vol. 412, pp. 150–157, 2001. [41] K. Bennett and O. Mangasarian, “Robust linear programming discrimination of two linearly inseparable sets,” Optim. Methods Softw., vol. 1, pp. 23–34, 1992. [42] C. Campbell and K. Bennett, “A linear programming approach to novelty detection,” Adv. Neural Process. Syst., vol. 13, pp. 395–401, 2001. [43] K.-R. M¨uller, S. Mika, G. Ratsch, K. Tsuda, and B. Sch¨olkopf, “An introduction to kernel-based learning algorithms,” IEEE Trans. Neural Netw., vol. 12, no. 2, pp. 181–201, Mar. 2001.

Zhisong Wang (S’02–M’05) received the B.E. degree in electrical engineering from Tongji University, Shanghai, China, in 1998, and the M.Sc. and Ph.D. degrees in electrical engineering from the University of Florida, Gainesville, in 2003 and 2005, respectively. From 2001 to 2005, he was a Research Assistant in the Department of Electrical and Computer Engineering, University of Florida. From 2005 to 2008, he was a Postdoctoral Fellow in the School of Health Information Sciences, University of Texas Health Science Center at Houston, Houston. His current research interests include signal processing, time series analysis, machine learning, biomedical engineering, and cognitive and computational neuroscience. Dr. Wang is a member of the Sigma Xi and the Phi Kappa Phi.

Alexander Maier received the Ph.D. degree from the University of T¨ubingen, T¨ubingen, Germany, for work at the nearby Max Planck Institute of Biological Cybernetics. He studied neurobiology at the University of Munich (Germany). He is currently a Visiting Fellow at the National Institute of Mental Health, Bethesda, MD.

110

IEEE TRANSACTIONS ON BIOMEDICAL ENGINEERING, VOL. 56, NO. 1, JANUARY 2009

Nikos K. Logothetis received the B.S. degree in mathematics from the University of Athens, Athens, Greece, the B.S. degree in biology from the University of Thessaloniki, Thessaloniki, Greece, and the Ph.D. degree in human neurobiology from Ludwig Maximilian University, Munich, Germany. In 1985, he joined the Brain and Cognitive Sciences Department, Massachusetts Institute of Technology. In 1990, he joined the Faculty of the Division of Neuroscience, Baylor College of Medicine, Houston, TX, where he has been an Adjunct Professor of ophthalmology since 1995. In 1997, he moved to the Max Planck Institute for Biological Cybernetics, T¨ubingen, Germany, where he worked on the physiological mechanisms that underly visual perception and object recognition, and is currently the Director in the Neurophysiology Division. Since 1992, he has also been an Adjunct Professor of Neurobiology at the Salk Institute, San Diego, CA. His current research interests include the application of functional imaging techniques to monkeys, and measurement of how the functional magnetic resonance imaging signal relates to neural activity.

Hualou Liang (M’00–SM’01) received the Ph.D. degree in physics from Chinese Academy of Sciences, Beijing, China. He studied signal processing at Dalian University of Technology, Dalian, China. He was a Postdoctoral Researcher at Tel-Aviv University, Tel-Aviv, Israel, the Max-Planck-Institute for Biological Cybernetics, T¨ubingen, Germany, and the Center for Complex Systems and Brain Sciences, Florida Atlantic University, Boca Raton. He is currently an Associate Professor at the University of Texas Health Science Center at Houston, Houston. His current research interest include biomedical signal processing and cognitive and computational neuroscience. He teaches graduate interdisciplinary courses including “biomedical signal processing” and “computational biomedicine.”

Suggest Documents