CONVOLUTIONAL NEURAL NETWORKS ON IRREGULAR DOMAINS

1 downloads 0 Views 126KB Size Report
Oct 27, 2017 - we introduce a procedure to build CNNs from data. We ap- ply our methodology on a scrambled version of the CIFAR-. 10 and Haxby datasets.
CONVOLUTIONAL NEURAL NETWORKS ON IRREGULAR DOMAINS THROUGH APPROXIMATE TRANSLATIONS ON INFERRED GRAPHS

arXiv:1710.10035v1 [cs.DM] 27 Oct 2017

Bastien Pasdeloup, Vincent Gripon, Jean-Charles Vialatte, Dominique Pastor IMT Atlantique — UMR CNRS Lab-STICC [email protected] ABSTRACT We propose a generalization of convolutional neural networks (CNNs) to irregular domains, through the use of an inferred graph structure. In more details, we introduce a three-step methodology to create convolutional layers that are adapted to the signals to process: 1) From a training set of signals, infer a graph representing the topology on which they evolve; 2) Identify translation operators in the vertex domain; 3) Emulate a convolution operator by translating a localized kernel on the graph. Using these layers, a convolutional neural network is built, and is trained on the initial signals to perform a classification task. Contributions are twofold. First, we adapt a definition of translations on graphs to make them more robust to irregularities, and to take into account locality of the kernel. Second, we introduce a procedure to build CNNs from data. We apply our methodology on a scrambled version of the CIFAR10 and Haxby datasets. Without using any knowledge on the signals, we significantly outperform existing methods. Moreover, our approach extends classical CNNs on images in the sense that such networks are a particular case of our approach when the inferred graph is a grid. Index Terms— graph signal processing, graph inference, translations on graphs, convolutional neural networks 1. INTRODUCTION Convolutional neural networks have shown high improvements of the classification of images, thanks to their ability to identify localized objects at various locations in input signals [1]. For this reason, generalizing such networks to nonimage signals has recently been a very active field of research [2, 3, 4, 5, 6, 7, 8, 9]. In CNNs, a convolutional layer is built by shifting a localized kernel to every possible location in input signals. This is motivated by the fact that objects in images are typically spatially coherent, localized, and have similar representations when shifted. These observations are This work was partially funded by the Labex CominLabs Project Neural Communication, and by R´egion Bretagne. Additionally, the authors would like to thank NVIDIA Corporation for the donation of the Titan X GPU used in the experiments.

likely to be valid for many other kinds of signals, e.g. brain signals, sensor networks. . . A promising venue of research in this area is graph signal processing, which is an emerging field that aims at generalizing harmonic operators to signals evolving on complex topologies represented by graphs. In particular, authors have been interested in the notions of translation, smoothness [10] and stationarity [11], which are the pendants of the properties mentioned above. Nevertheless, most of the graph translation operators are defined in the spectral domain, and do not match Euclidean translations when used on grid graphs [12]. In this article, we introduce a novel methodology that generalizes classical CNNs, allowing their application to signals on irregular domains. Also, we do not assume that the graph on which signals are supposed to evolve is known, which makes our approach potentially applicable to any kind of signals. In details, we proceed in three steps: 1. From a training set of signals, infer a graph representing the topology on which they evolve; 2. Identify translation operators in the vertex domain; 3. Emulate a convolution operator by translating a localized kernel on the graph [13]. We illustrate the performance of our approach on a scrambled version of the CIFAR-10 dataset of images [14], and on the Haxby dataset of functional magnetic resonance imaging (fMRI) signals [15]. For each of these application cases, we obtain significant improvements when compared to state of the art methods. This article is organized as follows: in Sec. 2, we introduce existing approaches to generalize CNNs to irregular domains. In Sec. 3, we give details on the three steps of our methodology. In particular, we present the graph inference methods used, and propose a relaxation of the translations introduced in [16, 12] to make them more robust to the irregularities in the graph and to take into account the locality of the kernel. Then, in Sec. 4, we perform experiments on the abovementioned datasets. Finally, Sec. 5 concludes.

2. RELATED WORK 2.1. Convolution in the spectral domain A first attempt to generalize CNNs was proposed by Bruna et al. [2], followed by Henaff et al. [4], in which the authors propose to use the translation operator from the graph signal processing framework in place of Euclidean translations. This operator is defined in the spectral domain associated with the graph, as a dot product between the signal spectrum and the spectrum of a Dirac signal [10]. Due to the spectral definition, translation of a signal on a graph using this operator does not preserve locality, nor existing objects in the signal. The problem of locality preservation has been addressed by Defferrard et al. [6] by considering Chebyshev polynomial filters. Levie et al. [17] introduce Cayley polynomials that also localize in the frequency domain. However, none of these approaches enforce the entries of the kernel to remain the same upon translation, due to their spectral definitions. 2.2. Convolution in the vertex domain Another approach is to perform convolutions directly in the vertex domain. Typically, the output of these convolutions at each vertex is a function of its neighbors and a weight kernel [18, 3, 5]. Locally, this operation can amount to a scalar product as in [13, 19, 7], exactly like for convolutions on images. However, there is no explicit translation on graphs: matching vertices from these neighborhoods centered at different locations is performed arbitrarily. Some strategies make use of random walks [8], or perform learning of kernels and matching jointly [9]. In this paper, we identify translations. An advantage of these approaches is that generally the graph can change over time. For this reason, the graph can also be considered as an input of the CNN. Other methods exist that only take a graph as an input [18, 19], which is a different problem. 3. METHODOLOGY 3.1. Inferring a graph from data Graph inference from data has received a lot of attention in the past years, as it provides a solution to enable the use of graph signal processing tools in the case of an unknown topology. Selecting a graph implies to make some assumptions, either on the graph or on the signals. In this article, we are particularly interested in graphs with the two following attributes: • Given that objects are localized and have small variability, we expect signals to be smooth on the graph [10]; • In order to identify objects at various locations, signals should be stationary [11] on the graph. To these extents, we consider the following graphs:

e • GΣ e : Use of the sample covariance matrix Σ. This matrix features higher values for entries of the signals behaving similarly, which therefore favors smoothness; • GK : The matrix inferred by the method of Kalofolias [20] (for α = β = 10−3 ), which infers a graph with a smoothness prior; • GS : The matrix inferred using the method introduced in [21], which enforces consistency of the graph with stationary signals, and promotes sparsity of the solution; ∗ • GK : The regularization of GK to have it match a stationarity assumption on the signals. Details on the regularization method can be found in [21].

None of these methods infers a binary, sparse graph. To this extent, we binarize them by keeping only the H highest entries per row of the adjacency matrix, and symmetrizing. Note that symmetry of the adjacency matrix is not required for the proposed method, but helps ensuring connectivity of the graph. 3.2. Identifying translations on the inferred graph In [16, 12], authors have proposed a definition for translations that are analogous to Euclidean translations on the grid graphs, while not making use of the metric space. Let G = hV, Ei be a graph. A translation as introduced in [12] is a function ψ : V 7→ V ∪ {⊥}, where ⊥ allows the disappearance of signal entries, such that: • ψ is injective for each vertex whose image is in V, i.e., ∀v1 , v2 ∈ V : (ψ(v1 ) = ψ(v2 ) 6= ⊥) ⇒ (v1 = v2 ); • ψis edge-constrained (EC), i.e., ∀v ∈ V : (v, ψ(v)) ∈  E ∨ ψ(v) = ⊥ ;

• ψ is strongly neighborhood-preserving (SNP), i.e., ∀v1 , v2 ∈ V : (v1 , v2 ) ∈ E ⇔ (ψ(v1 ), ψ(v2 )) ∈ E ∨ (ψ(v1 ) = ⊥) ∨ (ψ(v2 ) = ⊥).

There are two limitations due to which we cannot use this method directly. First, the authors have shown that identification of such functions is an NP-complete problem. Second, these definitions are meant to be applicable to grid graphs, or very small deformations of grid graphs. In order to cope with these restrictions, we introduce two modifications: 3.2.1. Localization of the translation kernel We define a kernel of parameter P ∈ N+ , centered on a vertex v1 ∈ V, as the maximal subset V(v1 ,P ) ⊆ V such that ∀v2 ∈ V(v1 ,P ) : d(v1 , v2 ) ≤ P , where d(v1 , v2 ) is the geodesic distance between vertices on the graph.

Observing that the kernel to translate is localized on the graph, and therefore that no vertex outside of V(v1 ,P ) contains signal energy, we restrict our study to functions ψe such that: ψe : V(v1 ,P ) 7→ V(v1 ,P +C) ∪ {⊥} ,

(1)

where C ∈ N+ allows definition of a larger set of vertices that includes V(v1 ,P ) . We are only interested in such functions for which the vertex v1 is sent to one of its neighbors. We call them approximate translations. In order to identify them, we choose in this paper to perform an exhaustive search. In the experiments in Sec. 4, we consider small values of P and C, so that this search is computationally tractable. 3.2.2. Relaxation of the constraints We propose to controllably relax the properties of the translations in Sec. 3.2. More precisely, we are interested in funce tions ψe that minimize the following score S(ψ): e = S(ψ) + +

e α|{v ∈ V(v1 ,P ) : ψ(v) = ⊥}| e e β|{v ∈ V(v1 ,P ) : (v, ψ(v)) 6∈ E ∧ ψ(v) 6= ⊥}| 2 γ|{(v1 , v2 ) ∈ V(v1 ,P ) :    e 1 ), ψ(v e 2 )) 6∈ E ∨ , (v1 , v2 ) ∈ E ∧ (ψ(v     e e (v1 , v2 ) 6∈ E ∧ (ψ(v1 ), ψ(v2 )) ∈ E e 1 ) 6= ⊥) ∧ (ψ(v e 2 ) 6= ⊥)}| ∧(ψ(v

(2) where α, β and γ are parameters controlling the respective importance of 1) The function loss (i.e., the number of vertices sent to ⊥); 2) EC violations; and 3) SNP violations. 3.3. Creating a convolutional layer Now that we have inferred a graph from the signals, and have introduced a method to identify approximate translations of a kernel, we proceed as follows: 1. We initialize a kernel V(v1 ,P ) on one of the most central vertices in the graph, considering the proximity centrality of vertices [22]; 2. For each neighbor v2 of v1 , we identify an approximate translation allowing to center the kernel on v2 ; 3. For every possible new location of the kernel, we start again from 2. until we reach a fixed point of the process. When multiple potential kernels are centered on the same vertex v3 ∈ V, we only keep the one that minf1 ) + S(ψ f2 ) + . . . along the imizes the total score S(ψ path from v1 to v3 . The algorithm stops when the best kernel (according to the score function) has been found for every possible center. As the number of elementary paths between any pair of vertices in a connected graph is limited, the algorithm terminates.

GΣ e 82.14%

GK 82.36%

GS 79.17%

∗ GK 82.52%

[23] 72.7%

3×3 85.98%

Fig. 1. Correct classification rates obtained with our method, as a function of the graph inference technique used. The performance achieved by Cheng et al. [23], and using a classical CNN on the non-scrambled version of the dataset, are also given for reference. 4. EXPERIMENTS 4.1. On the CIFAR-10 dataset To demonstrate the performance of our methodology for classification, we consider a scrambled version of the CIFAR-10 dataset of images [14]. We choose here to consider a simple CNN architecture, consisting of seven convolutional layers defined using the methodology in Sec. 3, with various numbers of feature maps (96, 96, 96, 192, 192, 192, 96). Batch normalization is added after each convolutional layer, followed by a ReLU nonlinearity. A dropout of 0.5 is added on each layer after the third in order to prevent overfitting. Finally, a dense layer of 10 neurons is added, followed by a SoftMax1 . Approximate translations are obtained with a kernel of parameters P = 1 and C = 1, and are evaluated using the score function in (2), with parameters α = 0.5, β = 1 and γ = 0.4. The network architecture, as well as these parameters, have been chosen arbitrarily. The obtained results could possibly be improved with a grid search. We apply our three-step algorithm for each of the graphs introduced in Sec. 3.1 (with H = 4), and compare the classification performance we obtain with the current best state of the art method on the scrambled CIFAR-10 dataset [23]. In addition, we also compare our results with those we would obtain with a classical CNN with the same architecture on the non-scrambled version of the dataset, for reference. For the latter, we use a 3 × 3 kernel. The results we obtain, given in Fig. 1, demonstrate that our methodology highly outperforms state of the art classification methods that do not use prior knowledge on the signals. Moreover, we believe that our method is able to almost reach the performance of CNNs on the non-scrambled version of the dataset. This is confirmed by the small difference between the best result we obtain with our methodology and the CNN baseline on images. Due to the simplicity of the network we have considered here, the latter does not perform a lot better than the former. However, state of the art methods that use the knowledge that signals are images can reach classification performance up to 97.28% [25] on the non-scrambled dataset. This suggests that application of our method to define the layers of a highly complex CNN could possibly reach very high 1 The Python3/Keras implementation of this CNN is available at https://github.com/BastienPasdeloup/icassp2018

Subject 1 Subject 2 Subject 3 Subject 4 Subject 5 Subject 6

Ggeo 56% 52% 47% 53% 42% 49%

MLP 54% 44% 48% 42% 39% 49%

SVM 50% 42% 35% 32% 31% 40%

LR (C = 1, l1 ) 51% 48% 44% 34% 42% 42%

LR (C = 50, l1 ) 47% 47% 38% 28% 34% 40%

(LR C = 1, l2 ) 49% 47% 41% 34% 31% 38%

Ridge 35% 34% 28% 22% 23% 31%

Average

49.9%

45.9%

38.2%

43.6%

38.9%

39.7%

28.8%

Fig. 2. Results of our method with the graph Ggeo on the Haxby dataset. Other results are obtained with a multilayer perceptron (MLP), and with the classifiers adapted from [24]. performance. Although, this requires a proper definition of pooling on graphs, which is a direction for future work. Additionally, we observe that use of the graphs with smoothness assumptions perform best. This suggests that smoothness of the signals is a good assumption in the inference part. Also, it is interesting to notice that the regularization of the solution by Kalofolias, which enforces a graph on which signals are smooth to match a stationarity assumption on the signals, slightly improves the result. Finally, using classical CNNs directly on the scrambled dataset leads to a correct classification rate of 56.74%, illustrating the interest of inferring an adapted topology.

It is worth mentioning that we also considered the matrices introduced in Sec. 3.1. However, these graphs feature a few vertices with a high degree, due to the existence of hubs in the brain activity. Because of these hubs, the kernel covers a large number of vertices, and therefore computation of approximate translations becomes intractable in addition to losing the locality aspect of CNNs. When using the coordinates of the vertices to build a graph, Fig. 2 demonstrates that the results of our method are comparable to existing methods, and are in most cases better. This illustrates applicability of the method to complex datasets that were not addressable with CNNs, provided that the inferred graph is not too irregular.

4.2. On the Haxby dataset The Haxby dataset of fMRI signals [15] consists of measurements of the brain activity during a visualization task, in which images of 8 classes are shown to 6 subjects during 12 sessions. For each subject, activity in 163, 840 voxels is measured. A parcellation is then applied to group the vertices using functional or geographical properties, in order to reduce the number of variables to 444 [26]. As these locations are associated with coordinates in the brain, we choose here to consider a graph Ggeo by computing the Euclidean distances between vertices, and keeping the H = 6 highest entries per row, then symmetrizing. Contrary to CIFAR-10, this dataset is not provided with training and test sets. For this reason, we have splitted the 12 sessions into a training set of 10 sessions and a test set of 2 sessions. To provide a baseline to compare with, we have adapted the classifiers in [24] to match these settings. Additionally, we compare with a multilayer perceptron (thus making no use of the underlying structure), made of three layers with parameters found with grid search. The CNN we consider consists of three layers with ReLU non-linearities: one convolutional layer and two dense layers, followed by some dropout, with parameters found by grid search. Approximate translations on Ggeo are obtained with a kernel of parameters P = 1 and C = 1, and are evaluated using the score function in (2), with parameters α = 1, β = 0.7 and γ = 0.5. The results we obtain are given in Fig. 2.

5. CONCLUSIONS In this article, we have introduced a generalization of classical CNNs to allow their application to signals evolving on irregular, unknown topologies. Our method proceeds in three steps, by first inferring a graph modeling the topology on which signals evolve, then identifying approximate translations on this graph, and finally using them to shift a convolutional kernel. This allows the creation of convolutional layers that are adapted to the data to process. When considering images, our algorithm has proved to highly outperform state of the art methods that do not use prior knowledge on the signals. Also, we have applied it to fMRI data with relative success, illustrating the impact that our method can have on analysis of complex signals. Directions to extend this work are numerous. Inference of graphs with vertices of high degrees remains a challenge. We should therefore investigate for methods that could solve this issue. Also, extending other notions from classical neural networks, such as pooling or strides, is a direct continuation. 6. REFERENCES [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional

neural networks,” in Advances in neural information processing systems, 2012.

[14] Alex Krizhevsky and Geoffrey Hinton, “Learning multiple layers of features from tiny images,” 2009.

[2] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun, “Spectral networks and locally connected networks on graphs,” arXiv preprint:1312.6203, 2013.

[15] James V Haxby, M Ida Gobbini, Maura L Furey, Alumit Ishai, Jennifer L Schouten, and Pietro Pietrini, “Distributed and overlapping representations of faces and objects in ventral temporal cortex,” Science, vol. 293, no. 5539, 2001.

[3] James Atwood and Don Towsley, convolutional neural networks,” preprint:1511.02136, 2015.

“DiffusionarXiv

[4] Mikael Henaff, Joan Bruna, and Yann LeCun, “Deep convolutional networks on graph-structured data,” arXiv preprint:1506.05163, 2015.

[16] Nicolas Grelier, Bastien Pasdeloup, Jean-Charles Vialatte, and Vincent Gripon, “Neighborhoodpreserving translations on graphs,” in GlobalSIP 2016: IEEE Global Conference on Signal and Information Processing, Arlington, United States, Dec 2016.

[5] Federico Monti, Davide Boscaini, Jonathan Masci, Emanuele Rodol`a, Jan Svoboda, and Michael M Bronstein, “Geometric deep learning on graphs and manifolds using mixture model cnns,” arXiv preprint:1611.08402, 2016.

[17] Ron Levie, Federico Monti, Xavier Bresson, and Michael M. Bronstein, “Cayleynets: Graph convolutional neural networks with complex rational spectral filters,” arXiv preprint:1705.07664, 2017.

[6] Micha¨el Defferrard, Xavier Bresson, and Pierre Vandergheynst, “Convolutional neural networks on graphs with fast localized spectral filtering,” in Advances in Neural Information Processing Systems, 2016.

[18] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel, Al´an Aspuru-Guzik, and Ryan P Adams, “Convolutional networks on graphs for learning molecular fingerprints,” in Advances in neural information processing systems, 2015.

[7] Gilles Puy, Srdan Kitic, and Patrick P´erez, “Unifying local and non-local signal processing with graph cnns,” arXiv preprint:1702.07759, 2017. [8] Yotam Hechtlinger, Purvasha Chakravarti, and Jining Qin, “A generalization of convolutional neural networks to graph-structured data,” arXiv preprint:1704.08165, 2017. [9] Jean-Charles Vialatte, Vincent Gripon, and Gilles Coppin, “Learning local receptive fields and their weight sharing scheme on graphs,” arXiv preprint:1706.02684, 2017. [10] David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Vandergheynst, “The emerging field of signal processing on graphs: Extending highdimensional data analysis to networks and other irregular domains,” IEEE Signal Processing Magazine, vol. 30, no. 3, 2013. [11] Benjamin Girault, “Stationary Graph Signals using an Isometric Graph Translation,” in Eusipco, Nice, France, Aug. 2015. [12] Bastien Pasdeloup, Vincent Gripon, Nicolas Grelier, Jean-Charles Vialatte, and Dominique Pastor, “Translations on graphs with neighborhood preservation,” arXiv preprint:1709.03859, 2017. [13] Jean-Charles Vialatte, Vincent Gripon, and Gr´egoire Mercier, “Generalizing the convolution operator to extend cnns to irregular domains,” arXiv preprint:1606.01166, 2016.

[19] Mathias Niepert, Mohamed Ahmed, and Konstantin Kutzkov, “Learning convolutional neural networks for graphs,” in International Conference on Machine Learning, 2016. [20] Vassilis Kalofolias, “How to learn a graph from smooth signals,” in Artificial Intelligence and Statistics, 2016. [21] Bastien Pasdeloup, Vincent Gripon, Gr´egoire Mercier, Dominique Pastor, and Michael G Rabbat, “Characterization and inference of graph diffusion processes from observations of stationary signals,” IEEE Transactions on Signal and Information Processing over Networks, 2017. [22] Gert Sabidussi, “The centrality index of a graph,” Psychometrika, vol. 31, no. 4, 1966. [23] Xiuyuan Cheng, Xu Chen, and St´ephane Mallat, “Deep haar scattering networks,” Information and Inference: A Journal of the IMA, vol. 5, no. 2, 2016. [24] “Different classifiers in decoding the haxby dataset,” goo.gl/pux2oG, 2015, Accessed: 2017-10-02. [25] Anuvabh Dutt, Denis Pellerin, and Georges Qunot, “Coupled ensembles of neural networks,” arXiv preprint:1709.06053, 2017. [26] Pierre Bellec, Pedro Rosa, Oliver Lyttelton, Habib Benali, and Alan Evans, “Multi-level bootstrap analysis of stable clusters (basc) in resting-state fmri,” vol. 51, 03 2010.