Comparison of Classification and Dimensionality ...

2 downloads 0 Views 1MB Size Report
Oct 15, 2010 - This has revealed exciting insights into the spatial and temporal changes underlying a broad range of brain functions, such as how we see, feel ...
Comparison of Classification and Dimensionality Reduction Methods Used in fMRI Decoding Nasim T. Alamdari

Emad Fatemizadeh

School of Biomedical Engineering Science and Research Branch, Islamic Azad University Tehran, Iran [email protected]

School of Electrical Engineering Sharif University of Technology Tehran, Iran [email protected]

Abstract—In the last few years there has been growing interest in the use of functional Magnetic Resonance Imaging (fMRI) for brain mapping. To decode brain patterns in fMRI data, we need reliable and accurate classifiers. Towards this goal, we compared performance of eleven popular pattern recognition methods. Before performing pattern recognition, applying the dimensionality reduction methods can improve the classification performance; therefore, seven methods in region of interest (ROI) have been compared to answer the following question: which dimensionality reduction procedure performs best? In both tasks, in addition to measuring prediction accuracy, we estimated standard deviation of accuracies to realize more reliable methods. According to all results, we suggest using support vector machines with linear kernel (C-SVM and ν-SVM), or random forest classifier on low dimensional subsets, which is prepared by Active or maxDis feature selection method to classify brain activity patterns more efficiently. Keywords—Brain Image analysis; Classification; Dimensionality Reduction.

I.

Functional

MRI;

INTRODUCTION

Functional Magnetic Resonance Imaging (fMRI) has enabled scientists to look into the active human brain and provides a sequence of 3D brain images with intensities representing blood oxygenation level dependent (BOLD) brain activations. This has revealed exciting insights into the spatial and temporal changes underlying a broad range of brain functions, such as how we see, feel, move, understand each other and lay down memories [5]. Thus, in the last few years there has been growing interest in the use of pattern recognition techniques for analyzing fMRI data to decipher brain patterns. Toward this interest, we need reliable and accurate classifiers. Many follow-up studies have employed different methods to analyze human fMRI data; For example, [12] compared Gaussian naïve Bayes, support vector machine and k-nearestneighbor classifier. Their result for three datasets showed that GNB and SVM have better accuracies than KNN classifier. Correlation analysis and linear discriminant analysis (LDA) perform well reported in [8]. Also [6] used a popular tool, machine learning, to compare six multivariate classifiers. Overall, Fisher's linear discriminant (with an optimal-shrinkage covariance estimator) and the linear support vector machine

885

performed best. Moreover, [7] compared accuracy of six different ML algorithms applied to neuroimaging data. Maximum accuracy was achieved at 92% for Random Forest, followed by 91% for AdaBoost, 89% for Naïve Bayes, 87% for a J48 decision tree, 86% for K*, and 84% for support vector machine. Due to using different fMRI data sets and applying diverse preprocessing techniques, including dimensionality reduction, outlier elimination, and denoising in different studies, the most reliable and accurate classifier which previous studies offered is quite varied. Toward this problem, in this study we compared performance of eleven popular pattern recognition methods by measuring prediction accuracy and standard deviation of accuracy for each classifier to answer this question: which classifier is the most reliable and accurate method? When faced with situation involving high-dimensional data, before performing pattern recognition, it is natural to consider the possibility of projecting those data onto a lowerdimensional subspace without losing important information regarding some characteristic of the original variables to improve the classification performance. One way of accomplishing this reduction of dimensionality is through variable selection, also called feature selection. Another way is by creating a reduced set of linear or nonlinear transformation of the input variables. The creation of such composite variables (or features) by projection methods is often referred to as feature extraction. Usually, we wish to find those lowdimensional projections of the input data that enjoy some sort of optimality properties [1]. Therefore, in the present study we not only compared classifiers performance in categorizing stimulus1 from stimulus 2, but also seven dimensionality reduction methods in region of interest (ROI) have been compared to answer the following question: which dimensionality reduction procedure performs best ? This paper is organized as follows: In the next section, we first describe the dimensionality reduction methods. In section III, we briefly explain the pattern recognition techniques and the result evaluation procedure, Section IV presents the fMRI data set. Results of experiments and comparison are described in Section V. Finally Section VI presents the conclusions and future work directions.

II.

DIMENSIONALITY REDUCTION METHODS

The dimensionality reduction methods can improve the classification performance. Several dimensionality reduction methods have been successfully used in [12] to perform classification analyses. Also in [4], two different texturefeature-extraction algorithms (SF, and GLDS) and maxDis feature selection were applied in ultrasound images. We used such methods, and performed a comparative analysis of them. A. Active Voxels are selected from the relative activity level in the target class with respect to the fixation condition by t-test. We select 140 most active voxels over the entire brain. B. RoiActive Active voxels are uniformly selected from all the regions of interest (ROIs) within the brain. We select 20 most active voxels from each ROI. C. maxDis A simple way to identify potentially good features is to compute the distance between the two classes for each feature as:

diss. =

m1 − m 2

(1)

σ 12 + σ 22

Where in equation (1), m1 and m2 are the mean values, and σ1 and σ2 are the standard deviations of the two classes. The best features are considered to be the ones with the greatest distance. We select 20 voxels over the each ROI. D. Average We average all voxels in an ROI into a ‘Super voxel’.

Picture (P) and the result evaluation procedure. The format of the function that we use in this work is similar to the classification function estimated successfully in [10, 11]: fs: fMRIs (t, t+8)

(2)

Where fMRIs (t, t+8) is the sequence of 16 observed fMRI images for subject s. We explored a number of classifier training methods, including: support vector machine (SVM) [1], linear discriminant analysis (LDA) [1], Gaussian nave Bayes (GNB) [11], k-nearest neighbor classifier (kNN) [8], decision tree (DT) [2], AdaBoost [2], random forest [1], multilayer perceptron (MLP) [1], and pattern correlation analysis (Cor.) [6]. These classifiers were selected because most of them have been used successfully in other applications involving highdimensional data. In this study, a set of scripts in MATLAB was developed to perform the pre- and post-processing, also some toolboxes were used to implement classifiers such as: •

LIBSVM-3.12 package (Chang and Lin, 2008) for three SVM classifiers: linear C-SVM, nonlinear CSVM with a radial basis function (RBF) kernel, and linear ν-SVM.



Stprtool-2.11 package (Franc and Hlava, 2006) for AdaBoost analyses.



Other classifiers were implemented by using MATLAB toolboxes such as Neural Network toolbox.

In kNN Classifier, Euclidean distance metric was used, considering values of 1, 3, 5, 7, and 9 for k. As can be seen in Fig. 1, the 5NN classifier outperformed other kNN classifiers, for it achieves minimum classification error and standard deviation. Thus we chose 5NN classifier in comparing the performance with the other pattern recognition methods, described above. To evaluation results, we used ‘leave-one-example-outfrom each class’ cross validation described in [3], to train and test classifiers. This procedure follows these steps for each subject:

E. ActiveAvg This method first selects the Active voxels from all of the ROIs, and then creates a single ’Super voxel’ for each ROI.



Leave one example out from each class; learn classifier from the remaining ones, examples whose labels it can see, after that, use trained classifier to make a prediction for a test set, examples whose labels it cannot see. The predicted labels are then compared to the true labels and the accuracy of the classifier – the fraction of examples where the prediction was correct – can be computed.



Repeat for each example in turn.



Compute the accuracy of the prediction made for all examples.

F. Statistical Features (SF) In SF method the following statistical features were computed: (1) mean value, (2) median value, (3) standard deviation, (4) skewness, and (5) kurtosis. G. Gray-Level Difference Statistics (GLDS) The GLDS algorithm uses first-order statistics of local property values based on absolute differences between pairs of gray levels or of average gray levels to extract the following texture measures: (1) contrast, (2) angular second moment, (3) entropy, and (4) mean. III.

Cognitive State (S and P)

CLASSIFICATION TECHNIQUES

In this section, we describe our exploration of pattern recognition methods for classification of Sentence (S) from

886

The procedure was repeated 40 times for SP or PS data set, and results were averaged across these 40 results.

performance. Here the performance metrics are percentage of classification accuracy and standard deviation of 18 experiments for the classifier. In this study, across all dimensionality reduction methods we select or extract voxels in seven ROIs except one feature selection method: Active. The SF and GLDS feature extraction methods have weak effect on the performance of the classifier. The best linear C-SVM classifier accuracy was found on low dimensional subsets, which prepared by Active or maxDis feature selection.

Fig. 1. The mean and standard deviation of the kNN classifier accuracy for various values of k.

IV.

THE STAR/PLUS EXPERIMENT

For our study we have used the Star/Plus experiment data (center for cognitive brain imaging, Carnegie Mellon University). The details of the experiment are provided here: subjects performed a sequence of trials, during which they were first shown a sentence and a simple picture, and then asked whether the sentence correctly described the picture. In half of the trials the picture was presented first, followed by the sentence, which we will refer to as SP data set. In the remaining trials, the sentence was presented first, followed by the picture, which we will call PS data set. Six subjects participated in this study, and from each subject three datasets were extracted: SP, PS, and SP+PS data sets. The timing within each trial is specified: the first stimulus (S or P) was presented at the beginning of the trail (image= 1), after 4 seconds (image= 9) the stimulus was removed, replaced by a blank screen, 4 seconds later (image= 17) the second stimulus was presented. This remained on the screen for 4 seconds, or until the subject pressed the mouse button, whichever came first. Finally a rest period of 15 seconds (30 images) was added after the second stimulus was removed from the screen. Thus, each trial lasted a total of approximately 27 seconds (approximately 54 images). As [12] mentioned, given the different sizes and shapes of different brains, it is not possible to directly map the voxels in one brain to those in another, hence region of interest (ROI) mapping method were considered for producing representations of fMRI data for use across multiple subjects. In this case 7 ROIs selected to be most relevant by a domain expert are used in ROI mapping method. V.

B. Comparing classification results In this section, we experimented with eleven classifier learning methods: LDA, linear C-SVM, linear ν-SVM, nonlinear C-SVM, GNB, random forest, AdaBoost, MLP, decision tree, correlation analysis, and kNN. Performance of each classifier was averaged over two dimensionality reduction methods which obtained more accurate and reliable results than other approaches in Section V, Part A; in other words, to comparison accuracy of classifiers easily, we averaged out 6 (subjects) × 3 (data sets) × 2 (feature selection methods) =36 experiments for each classifier as TABLE II shows and the value of each experiment, averaged across the cross validation folds. All results are highly significant compared to the 50% accuracy expected by chance. The best classifiers based on mean accuracy, were found to be C-SVM with linear kernel, linear ν-SVM, and random forest. In comparison of three type of SVM, linear ν-SVM yielded similar results to the linear C-SVM and nonlinear C-SVM with a radial basis function (RBF) kernel gave worse result. KNN consider value 5 for k and GNB approximately perform equivalent (with classification accuracy of about 87%) but standard deviation of accuracies are slightly different. Also in comparison of performance of LDA and Cor., to our surprise, their performance relative to the other methods did not perform as well as reported in [8]. EFFECTS OF DIMENSIONALITY REDUCTION METHOD IN TABLE I. ACCURACY OF LINEAR C-SVM CLASSIFIER. COLUMNS 3 AND 4 RESPECTIVELY SHOW THE MEAN AND STANDARD DEVIATION OF CLASSIFICATION ACCURACY.

Accuracy

Dimensionality Reduction Method

Mean

Standard Deviation

Active

93.1%

± 6.28

maxDis

91.9%

± 7.72

RoiActive

89.7%

± 8.63

ActiveAvg

86.1%

± 12.51

Average

84.5%

±11.07

SF

75.2%

± 12.91

GLDS

60.2%

±15.91

RESULTS AND COMPARISON

A. Which Dimensionality Reduction Method Performs Best? As an illustration, TABLE I shows the values of the classification accuracy, averaged across the 6 (subjects) × 3 (data sets) × 1 (linear C-SVM classifier) =18 experiments to comparison accuracy of feature selection and feature extraction easily. By the obtained results we will be able to decide with which preprocessing method the classifier has the best

887

TABLE II.

MEAN AND STANDARD DEVIATION OF ACCURACIES IN DIFFERENT CLASSIFIERS.

Accuracy Mean

Standard Deviation

Linear C-SVM

92.5%

± 7.06

Linear ν-SVM

92.4%

± 7.05

Random Forest

88.0%

± 10.69

GNB

87.6%

± 10.79

kNN (k= 5)

87.0%

± 11.62

C-SVM (RBF kernel)

85.9%

± 10.19

LDA

83.4%

±13.63

AdaBoost

82.9%

± 11.53

Cor.

81.4%

± 12.81

MLP

77.2%

± 11.53

DT

76.5%

± 13.89

The classification accuracy (%)

Classifier

89 88.5 88 87.5 87 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Penalty parameter

Fig. 2. Effect of penalty parameter in C-SVM classifier accuracy with linear kernel.

VI.

To examine the effect of penalty parameter (C parameter) in accuracy of C-SVM classification with linear kernel, we compared accuracy of the classifier for different value of C within the interval [0, 1]. Fig 2 shows the results, averaged out across the six (subjects) × three (data sets) × two (Active and Average dimensionality reduction) = 36 experiments for the linear C-SVM. As it is clear from Fig 2, the different value of penalty parameter used has a very weak effect on the performance of the classifier. So we can use any value of penalty parameter in the linear C-SVM classifier without any concern about the result. Generally speaking, according to previous study and our results based on two performance metrics, SVMs (C-SVM and ν-SVM) with linear kernel are able to make use of information distributed across a large number of voxels, each of which is only weakly informative. In addition to linear SVMs, random forest also has significant performance to classify brain activity patterns. C. In Which Data Set, Classifier Performs Best? In this section we compared effect of data set reduced by Active feature selection, in linear ν-SVM classifier performance. As an illustrated in Fig. 3, the accuracies in SP or PS data sets are better than the accuracies when using their union (SP+PS). Moreover the accuracies in SP are better than in PS. Mitchell in [12] conjecture that this phenomenon occurs because making a mental picture of a sentence's semantic content is easier than extracting and mentally rehearsing a semantic sentence from a picture, therefore making it harder to separate the two cognitive states in PS. Another conjecture is that the cognitive state of the subjects and the way they respond to a new sentence may be influenced when they have just seen a picture which they expect to compare to the upcoming sentence.

888

CONCLUSION

In this study we have compared different classification techniques with some dimensionality reduction methods for detecting cognitive states in fMRI data. Our experiments show that SVMs (C-SVC and ν-SVM) with linear kernel and Random Forest gain more accurate and reliable result than other approaches to distinguish stimuli. Across dimensionality reduction methods, Active and maxDis method demonstrate accurate results. According these approaches, we suggest using linear SVM or random forest to decipher brain patterns in subsets of fMRI data, which is prepared by maxDis or Active feature selection. We also suggest many opportunities for future machine learning research in this field. For example, we have a plan for using other dimensionality reduction algorithms such as manifold learning methods. It is also of interest to examine classification algorithm methods in the different data sets and more varied tasks to achieve more reliable results.

Fig. 3. Effect of data set in accuracy of picture or sentence prediction.

REFERENCES [1] [2] [3]

[4] [5]

[6]

[7]

A.J Izenman, Modern multivariate statistical techniques, Springer, New York, 2008. C.M Bishop, Pattern recognition and machine learning, Springer, New York, 2007. F. Pereira, T. Mitchell, and M. Botvinick, “Machine learning classifiers and fMRI: a tutorial overview”, NeuroImage, Volume45, Issue 1, Pages S199-S209, March 2009. L. Costaridou, Medical image analysis methods, vol. 3, CRC. Press, 2005. L. Zhang, D. Samaras, D. Tomasi, N. Volkow, and R. Goldstein, “Machine learning for clinical diagnosis from functional magnetic resonance imaging,” Computer Vision and Pattern Recognition Conference, 2005. M. Misaki, Y. Kim, P.A. Bandettini and N. Kriegeskorte, “Comparison of multivariate classifiers and response normalizations for patterninformation fMRI,” NeuroImage, Volume 53, Issue 1, Pages 103–118, 15 October 2010. P.K. Douglas, S. Harris, A. Yuille and M.S. Cohen, “Performance comparison of machine learning algorithms and number of independent

889

components used in fMRI decoding of belief vs. disbelief,” NeuroImage, Volume 56, Issue 2, Pages 544–553, 15 May 2011. [8] S.-P. Ku, A. Gretton, J. Macke, N.K. Logothetis, “Comparison of pattern recognition methods in classifying high-resolution bold signals obtained at high magnetic field in monkeys,” Magn Reson Imaging 26, 1007– 1014 , 2008. [9] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statistical learning: data Mining, inference, and prediction, 2nd ed., Springer, 2008. [10] T. Mitchell, R. Hutchinson, M.A. Just, R.S. Niculescu, F. Pereira, and X. Wang, “Classifying instantaneous cognitive states from fmri data,” American Medical Informatics Association Annual Symposium, Washington DC 469, 2003. [11] T. Mitchell, R. Hutchinson, R. Niculescu, F. Pereira, X. Wang, M. Just, and S. Newman, “Learning to decode cognitive states from brain images,” Machine Learning, Vol. 57, Issue 1-2, pp. 145-175, 2004. [12] X. Wang, T.M. Mitchell, and R. Hutchinson, “Using machine learning to detect cognitive states across multiple subjects,” CALD KDD project paper, 2003.