2013 3rd International Workshop on Pattern Recognition in Neuroimaging
Brain Decoding via Graph Kernels ∗
Sandro Vega-Pons∗† and Paolo Avesani∗† NeuroInformatics Laboratory (NILab), Fondazione Bruno Kessler, Trento, Italy, Email:
[email protected] † Centro Interdipartimentale Mente e Cervello (CIMeC), Universit`a di Trento, Italy such as: modularity, centrality and node-degree distribution are computed to characterize these networks. Despite the analysis of these topological properties of a graph might be useful in brain decoding studies, the application of classifiers on graph data would be a more robust approach.
Abstract—An emergent trend in data analysis of functional brain recordings is based on multivariate pattern recognition. Unlike univariate approaches, it is designed as a prediction task by decoding the brain state. fMRI brain decoding is a challenging classification problem due to the noisy, redundant and spatiotemporal correlated data, where there are generally much more features than samples. The use of a classifier requires that raw data is mapped into n-dimensional real vectors where the structural information of the data is not taken into account. Alternative methods propose a different data representation based on a graph encoding. While graphs provide a more powerful representation, machine learning algorithms for this type of encoding become computationally intensive. The contribution of this paper is the introduction of a graph kernel with a lower computational complexity that allows taking advantage from both the representative power of graphs and the discrimination power of kernel-based classifiers such as Support Vector Machines. We provide experimental results for a discrimination task between faces and houses on a fMRI dataset. We also investigate on synthetic data, how the brain decoding task differs according to the different encodings: vectorial and graph-based. A remarkable feature of the graph approach is its capability to handle data from different subjects, without the need of any intersubject alignment. An intersubject decoding experiment is also performed for the faces versus houses problem.
Recently, the idea of graph encoding followed by graph embedding into a vectorial space was explored by Richiardi et al. [3]. After a simple graph encoding of the data, the vectors obtained by unfolding the upper triangular part of the adjacency matrix of each graph are used. Once the data is embedded into a vectorial space, traditional classifiers can be applied for solving the brain decoding problem. However, the realignment of functional data and computation of brain atlases with fixed number of regions is required in this method. Furthermore, all graphs should have the same number of nodes and there should be an isomorphism between nodes in different graphs. These are constraints that restrict the flexibility and representative power of graph structures. In this paper, we present a more general graph-based representation approach that allows using any kind of graph encoding of the data. Graphs could have different numbers of nodes and edges, and a mapping between nodes in different graphs is not required. In this approach, graph kernels are the keystones that allow taking advantage from both the representative power of graphs and the discrimination power of kernel-based classifiers. Roughly speaking, a graph kernel is a similarity measure between graphs that can be seen as an inner product in an embedding feature space. Machine learning tasks such as classification can be carried out by using linear methods completely based on the inner product computed via the graph kernels in this feature space.
Keywords-brain decoding; connectivity graphs; graph kernels;
I. I NTRODUCTION Advances in neuroimaging techniques and particularly fMRI studies, have shown the possibility of decoding the information from a single trial of functional brain recording. Traditional fMRI analyses are usually based on univariate voxel analysis. However, it is difficult to accomplish brain decoding tasks by following this approach. Multi-voxel pattern analysis is an alternative approach, in which the brain decoding task is shaped as a classification problem. In the last years, different neurocognitive studies have been designed according to this method of data analysis by doing inference upon the results of the estimate of misclassification error [1]. However, fMRI-based brain decoding represents a source of challenging classification problems due to the extremely large featureto-instance ratio, low signal-to-noise ratio and strong spatiotemporal correlation among features in the data. Besides, raw data is commonly mapped into n-dimensional feature vectors where the structural information of the original data is not taken into account. A structural representation is an alternative to face this problem. In brain connectivity studies, graphs have shown to be suitable to represent brain data. Most of the related studies are concerned with the small-world properties of functional brain networks and experiments have been carried out on resting-state fMRI data [2]. Usually, topological descriptors 978-0-7695-5061-9/13 $26.00 © 2013 IEEE DOI 10.1109/PRNI.2013.43
To the extent of our knowledge, there are only two previously proposed methods on graph kernels for brain decoding problems [4], [5]. They show the suitability of this approach, however, in [4] the realignment of functional data is required and the computational complexity of the proposed graph kernel is very high. The workaround is to reduce the number of nodes and edges of the graph by upscaling the representation of the brain data. In the case of [5], a region-based encoding is also proposed where functional, structural and geometrical information is taken into account. In this case, a simple graph kernel is used to compare graphs. In both cases, since a single node represents a large region of the brain, this encoding prevents the opportunity of detecting the relevant changes inside these regions. In our proposal, the graph resolution could reach the voxel level, i.e. each node is associated to a single voxel. Besides, an efficient and expressive graph kernel is used as similarity measure between graphs. 136
DG = {(G1 , Y1 ), . . . , (Gn , Yn )} and a graph kernel k, kernel methods such as SVM can be directly applied to face the graph classification problem. Graph kernels are commonly based on the idea of convolution kernels, i.e. they decompose graphs in a set of simpler structures like: walks, trees, subgraphs, etc. The comparison of two graphs is then based on the similarity between all pairs of these structures. Based on this idea, several graph kernels have been proposed in the last ten years, for example, random walks based kernels, graphlets based kernels and subtree patterns based kernels [9]. The first graph kernels were computationally very expensive. Therefore, despite their nice theoretical properties, their application scope was very limited. They were mainly used in molecule classification problems, in which graphs are usually very small. Besides computational complexity (efficiency), expressivity is another important property of a graph kernel. Expressivity means that the kernel should be a non-trivial similarity measure between graphs that takes into account enough information from both graphs to make a comparison. Recently, the Weisfeiler-Lehman subtree kernel was proposed [10], which represents a very good trade-off between expressivity and efficiency.
II. M ETHODS Let T = {T1 , T2 , . . . , Tn } be the set of n trials (which can be intra- or inter-subject) in an fMRI experiment and let Y = {Y1 , Y2 , . . . , Yn } be the class labels (stimuli) associated to the n trials. In the multi-voxels pattern analysis approach, the raw four-dimensional volume data of each trial Ti is mapped into a real vector Xi . This way, a dataset of class-labeled samples DV = {(X1 , Y1 ), . . . , (Xn , Yn )} is obtained and standard machine learning classifiers are applied on this vectorial representation. In the proposed approach, each trial Ti is mapped into a graph Gi and the brain decoding problem is transformed into a graph classification problem with the following classlabeled graph dataset DG = {(G1 , Y1 ), . . . , (Gn , Yn )}. Despite graphs can be more informative structures than feature vectors, working with graphs is a more challenging task. They are harder to manipulate and the computational complexity of many algorithms increases when using graphs instead of vectors. A way to overcome these limitations is by using kernel methods for structured data such as graph kernels [6]. In the following, we first present our graph encoding for each trial of fMRI data. Then, kernel functions and particularly graph kernels are shortly introduced. Afterwards, we review the Weisfeiler-Lehman graph kernel and we give the arguments for selecting this kernel for our method.
C. Weisfeiler-Lehman graph kernels The Weisfeiler-Lehman subtree kernel is a graph kernels based on the 1-dimensional variant of Weisfeiler-Lehman test of graph isomorphism [10]. Given two graphs G and G , this test consists in an iterative process starting with the comparison of node labels in both graphs. Afterwards, a new artificial label is computed for each node by compressing the node labels of its neighboring nodes, and the graphs with the compressed nodes are compared. This process is repeated until the node label sets of G and G differ, or a maximum number of iterations is reached. In the graph kernel case, the goal is to compute a similarity value between the pair of graphs, instead of trying to determine whether they are isomorphic or not. The kernel is also computed through an iterative process, in which the common original and compressed labels in the two graphs are counted as in the original test. More formally, given two graphs G and G , the Weisfeiler-Lehman subtree kernel with h iterations is defined as: (h) (1) kW L (G, G ) = φ(h) (G), φ(h) (G )
A. Graph encoding of fMRI data We use simple, undirected and node-labeled graphs G = (V, E, ) to encode the information in each fMRI trial Ti , where V is a set of nodes, E ⊂ V × V a set of undirected edges and : V → Σ is a function that assigns a label from an alphabet Σ to each node in the graph. Each trial Ti is composed of a set of voxel1 time series Ti = {w1 , w2 , . . . , wm }. In the graph G, there is a node associated to each voxel in Ti , i.e. V = {v1 , v2 , . . . , vm } where vj is the node associated to the voxel wj . Edges are computed by using a bounded similarity measure Γ between time series (voxels) and a threshold value τ . Formally, there is an edge ejk ∈ E connecting the nodes vj and vk if and only if the similarity value of voxels wj and wk is greater than the threshold τ , i.e., ejk ∈ E ⇔ Γ(wj , wk ) ≥ τ . In practice, different similarity measures can be used to compare time series (see [7] for a review), for example, Pearson Correlation Coefficient and Cosine Similarity are two suitable options. The node labeling function is defined such that the degree value of each node is used as its label.
where, defining Σ(h) as the set of all labels (original and compressed) that occur as node labels at least once in G or G at the end of the h-th iteration and c : {G, G } × Σ(h) → N a function such that c(G, σ) is the number of occurrences of the letter σ ∈ Σ(h) in the graph G, we have: φ(h) (G) = c(G, σ1 ), c(G, σ2 ), . . . , c(G, σ|Σ(h) | ) (2)
B. Graph Kernels Given a non-empty set X , a positive definite kernel2 k : X × X → R is a function that satisfies the symmetry and positive definite properties [8]. It is known that if k is a kernel function, there is a mapping φ : X → H from X to some Hilbert space H, such that k(x, x ) = φ(x), φ(x )H for all x, x ∈ X , where ·, ·H denotes the dot product in H. As a particular case, a graph kernel k : G × G → R is a kernel function defined on a graph domain G (set of all possible graphs). Therefore, given a class-labeled graph dataset 1 These
As this kernel is based on the Weisfeiler-Lehman test of isomorphism, we can say it is expressive enough about the real similarity between graphs. Moreover, it can be computed in time O(h|E|), which makes it very efficient. Furthermore, there is an optimization for the computation of this kernel on n graphs, improving the naive n2 -fold application of the kernel to all possible pairs of graphs. Efficiency is the key property that allows its application in our method.
voxels can correspond to whole brain data or specific ROI(s). called kernel for simplicity.
2 Hereafter
137
a given stimulus the area has the same size and the same location (SFO). The results are reported on Table I, second column. Both encodings are equally effective. The second experiment considered a different pattern of response, where for stimulus A the size of the activation area is not stationary among different trials. The results are reported on Table I, third column. In this case, vector encoding is more effective since the graph encoding is sensitive to the stationarity of activation area across the different trials. In the third experiment, the patterns of activation for both stimuli were stationary, but the location of activation areas varies among different trials for the same stimulus. The results are reported on Table I, fourth column. Graph encoding reports better results in this experiment since the vectorial encoding is very sensitive to this kind of just partially overlapped features. The three experiments are graphically represented in Fig. 1. The first row (figures (a), (b) and (c)) shows stimulus A for vectorial encoding in the three different configurations. The second row (figures (d), (e) and (f)) shows the difference between the average graph embedding φ (see eq. (2)) for stimuli A and B in the three different cases.
III. M ATERIALS A. Synthetic Data To investigate the behavior of graph kernels on fMRI recordings we designed 5 different synthetic datasets. All of them are shaped as a two dimensional square of 10 by 10 voxels. The timecourses of each voxel were computed according to the Balloon model [11]. Voxel behaviour is designed as an auto-regressive gaussian noise or as the response to a block of 6 stimuli with TR=2.5 seconds. Each trial was 32 seconds long and the onset after the first 5 seconds. Auto-regressive gaussian noise was added to the timecourse computed by Balloon model. The design of the different datasets follows the idea of having two different stimuli, A and B, and different patterns of response. The first pattern of response for stimulus A was defined as a stationary full overlapping (SFO) area of 5 by 5 voxels in the lower left corner, while the pattern of response for stimulus B as an area of 3 by 3 voxels in the top right corner. We generated 32 samples for each pattern according to the size of the real data we present in the next section. A further 32 samples for stimulus A were generated with mixed patterns of response: half of them as 5 by 5 voxel and the other half as 3 by 3 voxels as non-stationary overlapping (NSO) areas. Finally, we generated a fourth and a fifth pattern of response for stimulus A and B were each sample has only a partial overlap (SPO) with the other samples.
TABLE I AVERAGE CLASSIFICATION ACCURACY OF BOTH VECTORIAL - AND GRAPH - BASED APPROACHES WITH SYNTHETIC DATA . T HREE DIFFERENT CONFIGURATIONS OF STIMULUS A AND B ARE PRESENTED . R ESULTS CORRESPOND TO A LEAVE - ONE - OUT CROSS - VALIDATION .
B. Real Data We use a subset of stimuli, Face and House, and a specific region of brain, Ventral Temporal Cortex, from the dataset of Haxby et al. [12], provided within the PyMVPA Python package 3 . It includes the recordings of 6 subjects with 12 runs per subject. In each run, eight visual stimuli were presented to each subject in random order. Images of the same category were displayed in a block for 22.5 s separated by 12.5 s rest periods. A volume was taken every 2.5 s (scan repetition time TR = 2500 ms). Therefore, for each run 72 volumes were recorded (9 TRs for 8 categories). The total number of voxels is 43193, but the original data comes with 5 masks in functional space that allow using specific regions of interest. We used three of them (“mask4 vt”, “mask8 face vt” and “mask8 house vt”) in our experiments. In the masks names “vt” refers to ventral temporal cortex and ”face”, ”house” masks are GLM contrast based localizer maps. In our experiments, as we only use faces and houses categories, we consider a subset of 144 samples coming from the 6 subjects (24 from each one) that belong to the 2 specified classes (categories).
Encoding SFO-A vs. SFO-B NSO-A vs. SFO-B SPO-A vs. SPO-B Vector 1.0 1.0 0.47 Graph 1.0 0.70 0.96
10
1.0
10
1.0
0.8
8
0.6
0.6
0.4
6
0.2
0.2
6
0.2
0.6
0.2
2
4
6
8
10
1.0
0.6
2
4
2
6
8
10
1.0
20
30
40
50
(d) SFO-A vs. SFO-B
0.6
0.8
0 0
4
2
6
8
10
1.0
3 2
1 10
0.4
0 0
(c) SPO-A
2
1
0.2
2
4
3
2
0.2
0.0
(b) NSO-A
3
0 0
0.4
4
0.8
0 0
(a) SFO-A 4
0.6
6
0.4
0.8
0 0
0.8
0.0
4
0.4
2
1.0
8
0.4
0.0
4
10
0.8
8
1
10
20
30
40
(e) NSO-A vs. SFO-B
0 0
10
20
30
40
(f) SPO-A vs. SPO-B
Fig. 1. Graphical representation of experiments in Table I. Top [(a),(b),(c)]: Vectorial-based approach for stimulus A. Bottom [(d),(e),(f)]: Difference in the average mapping of graphs from stimulus A and B.
IV. R ESULTS B. Intra-subject analysis
A. Synthetic data analysis The empirical investigation on synthetic data was aimed to understand the influence of different patterns of response on the results of the discrimination task when adopting both the traditional vectorial encoding and the one based on graph. The first experiment was designed as a binary contrast between stimulus A and B where the patterns of activation were stationary and full overlapping, e.g. for each trial and for
We applied a leave-one-out cross-validation for both vectorial-based and graph-based approaches with the data of each subject in the Haxby dataset. In the case of the vectorial approach, we unfolded the four-dimensional volume data of each trial and we used each time series value of each voxel as a feature. SVM with linear kernel was applied as classifier. In the case of the graph approach, the graph encoding was done as explained in Section II, with threshold τ = 0.5 and number of iterations h = 3. SVM was also applied as classifier, but using
3 http://data.pymvpa.org/datasets/haxby2001
138
the Weisfeiler-Lehman graph kernel. Average classification accuracy values for each individual subject using three data masks are presented in Table II.
Results with real data corroborates those obtained with synthetic data. In the intra-subject experiment, where patterns of activation are expected to be around the same location, a better performance of the vectorial encoding approach is obtained in general. The also good performance of the graph approach shows that graph encoding at voxel level can also be used for discrimination tasks. Nevertheless, chance level results are obtained for some cases, namely: subject 5 with the first mask and subject 3 with the other two masks. In the inter-subject studies, the location of information can vary across subjects due to differences in anatomical structure and functional organization. The graph encoding approach has shown to be robust to this variability. Results in Table III suggest the suitability of the graph-based approach in this kind of problems. Despite accuracy values are not very high, the fact that they are above chance level means that this approach is able to discriminate between classes in this kind of problems. From this table it can also be appreciated that there are not big changes in the classification results if the threshold τ takes values in a reasonable interval. This robustness with respect to this parameter is important since the optimum value could vary from one dataset to another. Both approaches are complementary, one could outperform the results of the other in different cases. The graph based approach seems to be specially promising for inter-subject analysis problems, which are a difficult and important kind of brain decoding problems.
TABLE II AVERAGE CLASSIFICATION ACCURACY OF BOTH VECTORIAL - AND GRAPH - BASED APPROACHES FOR INTRA - SUBJECT EXPERIMENT WITH THE H AXBY DATASET (FACE VS . H OUSE ). D IFFERENT ROI S ( MASKS ) ARE USED . R ESULTS CORRESPOND TO A LEAVE - ONE - OUT CROSS - VALIDATION . Mask vt face vt house vt
Encod. Vector Graph Vector Graph Vector Graph
Sub1 1.0 1.0 1.0 0.82 0.92 1.0
Sub2 1.0 0.91 0.79 0.73 0.92 1.0
Sub3 1.0 0.73 0.83 0.57 0.96 0.46
Sub4 1.0 0.86 1.0 0.83 1.0 0.96
Sub5 1.0 0.43 0.91 0.70 0.91 0.86
Sub6 1.0 0.78 1.0 0.85 0.83 1.0
C. Inter-subject analysis In order to use the vectorial approach for inter-subject experiments an alignment of functional data is required. In the case of the Haxby dataset, where there are different masks for each subject, this problem is even more complex. Therefore, we decided to only use the graph based approach in this inter-subject experiment. For the graph approach the inter-subject problem is completely transparent. We used the same configuration as reported in the previous (intra-subject) experiment, regardless subjects are not aligned and the size of masks (number of voxels) varies from one subject to another. Average classification accuracies using different threshold values (τ = {0.2, 0.4, 0.6, 0.8}) are presented in Table III. These results correspond to a leave-one-subject-out crossvalidation, i.e., in each one of the 6 folds, data from 5 subjects (120 samples) is used to train the classifier and the data from the remaining subject (24 samples) is used for testing.
ACKNOWLEDGMENT This research has been supported by the RESTATE Programme, co-funded by the European Union under the FP7 COFUND Marie Curie Action - Grant agreement no. 267224. R EFERENCES [1] F. Pereira, T. Mitchell, and M. Botvinick, “Machine learning classifiers and fmri: A tutorial overview,” NeuroImage, vol. 45, no. 1, pp. S199 – S209, 2009. [2] E. Bullmore and O. Sporns, “Complex brain networks: graph theoretical analysis of structural and functional systems,” Nat Rev Neurosci, vol. 10, no. 3, pp. 186–198, Mar. 2009. [3] J. Richiardi, H. Eryilmaz, S. Schwartz, P. Vuilleumier, and D. V. D. Ville, “Decoding brain states from fmri connectivity graphs,” NeuroImage, vol. 56, no. 2, pp. 616 – 626, 2011. [4] F. Mokhtari and G.-A. Hossein-Zadeh, “Decoding brain states using backward edge elimination and graph kernels in fMRI connectivity networks.” J Neurosci Methods, vol. 212, no. 2, pp. 259–268, 2013. [5] S. Takerkart, G. Auzias, B. Thirion, D. Schn, and L. Ralaivola, “Graphbased inter-subject classification of local fmri patterns,” in Machine Learning in Medical Imaging, LNCS, 2012, vol. 7588, pp. 184–192. [6] S. Vishwanathan, N. N. Schraudolph, R. Kondor, and K. M. Borgwardt, “Graph kernels,” J. Mach. Learn. Res., vol. 11, pp. 1201–1242, 2010. [7] T. W. Liao, “Clustering of time series data - a survey,” Pattern Recognition, vol. 38, no. 11, pp. 1857 – 1874, 2005. [8] T. Hofmann, B. Schlkopf, and A. J. Smola, “Kernel methods in machine learning,” The Annals of Statistics, vol. 36, no. 3, pp. 1171–1220, 2008. [9] M. Rupp and G. Schneider, “Graph kernels for molecular similarity,” Molecular Informatics, vol. 29, no. 4, pp. 266–273, 2010. [10] N. Shervashidze, P. Schweitzer, E. J. van Leeuwen, K. Mehlhorn, and K. M. Borgwardt, “Weisfeiler-Lehman Graph Kernels,” J. Mach. Learn. Res., vol. 12, pp. 2539–2561, 2011. [11] Y. Zheng, J. Martindale, D. Johnston, M. Jones, J. Berwick, and J. Mayhew, “A Model of the Hemodynamic Response and Oxygen Delivery to Brain,” NeuroImage, vol. 16, no. 3, pp. 617–637, 2002. [12] J. V. Haxby, M. I. Gobbini, M. L. Furey, A. Ishai, J. L. Schouten, and P. Pietrini, “Distributed and overlapping representations of faces and objects in ventral temporal cortex,” Science, vol. 293, no. 5539, pp. 2425–2430, 2001.
TABLE III AVERAGE CLASSIFICATION ACCURACY ( WITH STANDARD DEVIATION BETWEEN FOLDS ) ON LEAVE - ONE - SUBJECT- OUT CROSS - VALIDATION OF THE GRAPH APPROACH ON H AXBY DATASET (FACE VS . H OUSE ). D IFFERENT MASKS AND THRESHOLD VALUES ARE USED . Mask vt face vt house vt
τ = 0.2 0.64 (0.10) 0.64 (0.11) 0.62 (0.09)
τ = 0.4 0.63 (0.11) 0.64 (0.11) 0.76 (0.12)
τ = 0.6 0.66 (0.12) 0.70 (0.10) 0.75 (0.11)
τ = 0.8 0.64 (0.10) 0.61 (0.08) 0.74 (0.11)
V. D ISCUSSION In the experiments with synthetic data we explored the behavior of both vectorial- and graph-based approaches when the patterns of activation and the location of the patterns vary. From results in Table I and pictures in Fig. 1 two different scenarios can be appreciated, where opposite results are obtained with both approaches. When location was fixed but the pattern changed, vectorial encoding outperforms graph encoding. However, when information is given in a fixed pattern, but the location of the pattern can vary in different trials of the same stimulus, the graph encoding is a better choice. Moreover, in the case where location and pattern of information were similar, both approaches reached high accuracy values.
139