Lu Bai, Edwin R. Hancockâ and Lin Han. Department of Computer Science, University of York, UK. Deramore Lane, Heslington, York, YO10 5GH, UK. {lu,erh ...
A Graph Embedding Method Using The Jensen-Shannon Divergence Lu Bai, Edwin R. Hancock⋆ and Lin Han Department of Computer Science, University of York, UK Deramore Lane, Heslington, York, YO10 5GH, UK {lu,erh,lin}@cs.york.ac.uk
Abstract. Riesen and Bunke recently proposed a novel dissimilarity based approach for embedding graphs into a vector space. One drawback of their approach is the computational cost graph edit operations required to compute the dissimilarity for graphs. In this paper we explore whether the Jensen-Shannon divergence can be used as a means of computing a fast similarity measure between a pair of graphs. We commence by computing the Shannon entropy of a graph associated with a steady state random walk. We establish a family of prototype graphs by using an information theoretic approach to construct generative graph prototypes. With the required graph entropies and a family of prototype graphs to hand, the Jensen-Shannon divergence between a sample graph and a prototype graph can be computed. It is defined as the Jensen-Shannon between the pair of separate graphs and a composite structure formed by the pair of graphs. The required entropies of the graphs can be efficiently computed, the proposed graph embedding using the Jensen-Shannon divergence avoids the burdensome graph edit operation. We explore our approach on several graph datasets abstracted from computer vision and bioinformatics databases.
1 Introduction In pattern recognition, graph based object representations offer a versatile alternative way to the vector based representation. The main advantage of graph representations is their rich mathematical structure. Unfortunately, most of the standard pattern recognition and machine learning algorithms are formulated for vectors, and are not available for graphs. One way to overcome this problem is to embed the graph data into a vector space, and then deploy vectorial methods. However, the vector space embedding presents two obstacles. First, since graphs can be of different sizes, the vectors may be of different lengths. The second problem is that the information residing on the edges of a graph is discarded. In order to overcome these problems, Riesen and Bunke recently proposed a method for embedding graphs into a vector space [10], that bridges the gap between the powerful graph based representation and the algorithms available for the vector based representation. The ideas underpin graph dissimilarity embedding framework were first described in Duin and Pekalska’s work [8]. Riesen and Bunke generalized and substantially extended the ⋆
Edwin R. Hancock is supported by a Royal Society Wolfson Research Merit Award.
2
L. Bai et al.
methods to the graph mining domain. The key idea is to use the edit distance from a sample graph to a number of class prototype graphs to give a vectorial description of the sample graph in the embedding space. Furthermore, this approach potentially allows any (dis)similarity measure of graphs to be used for graph (dis)similarity embedding as well. Unfortunately, the edit distance between a sample graph and a prototype graph requires burdensome computations, and as a result the graph dissimilarity embedding using the edit distance can not be efficiently computed for graphs. To address this inefficiency, in this paper we investigate whether the Jensen-Shannon divergence can be used as a means of establishing a computationally efficient similarity measure between a pair of graphs, and then use such a measure to propose a novel fast graph embedding approach. In information theory the Jensen-Shannon divergence is a nonextensive mutual information theoretic measure based on nonextensive entropies. An extensive entropy is defined as the sum of the individual entropies of two probability distributions. The definition of nonextensive entropy generalizes the sum operation into composite actions. The Jensen-Shannon divergence is defined as a similarity measure between probability distributions, and is related to the Shannon entropy [7]. The problem of establishing Jensen-Shannon divergence measures for graphs is that of computing the required entropies for individual and composite graphs. In [1], we have used the steady state random walk of a graph to establish a probability distribution for this purpose. The Jensen-Shannon divergence between a pair of graphs is defined as the difference between the entropy of a composite structure and their individual entropies. To determine a set of prototype graphs for vector space embedding. We use an information theoretic approach to construct the required graph prototypes [4]. Once the vectorial descriptions of a set of graphs are established, we perform graph classification in the principle component space. Experiments on graph datasets abstracted from bioinformatics and computer vision databases demonstrate the effectiveness and the efficiency of the proposed graph embedding method. This paper is organized as follows. Section 2 develops a Jensen-Shannon divergence measure between graphs. Section 3 reviews the concept of graph dissimilarity embedding, and shows how to compute the similarity vectorial descriptions for a set of graphs using the Jensen-Shannon divergence. Section 4 provides the experimental evaluations. Finally, Section 5 provides the conclusion and future work.
2 The Jensen-Shannon Divergence on Graphs
In this section, we exploit the Jensen-Shannon divergence for developing a computationally efficient similarity measure for graphs. We commence by defining a Shannon entropy of a graph associated with its steady state random walk. Then we develop the similarity measure for graphs by using the Jensen-Shannon divergence between the graph entropies.
A Graph Embedding Method Using The Jensen-Shannon Divergence
3
2.1 Graph Entropies Consider a graph G(V, E) with vertex set V and edge set E ⊆ V × V . The adjacency matrix A for G(V, E) has elements 1 if(i, j) ∈ E; A(i, j) = (1) 0 otherwise. The vertex degree matrix of P G(V, E) is a diagonal matrix D with diagonal elements given by D(vi , vi ) = d(i) = j∈V A(i, j). Shannon Entropy For the graph G(V, E), the probability of a steady state random P walk on G(V, E) visiting vertex i is PG (i) = d(i)/ j∈V d(j). the Shannon entropy associated with the steady state random walk on G(V, E) is HS (G) = −
|V | X
PG (i) log PG (i).
(2)
i=1
Time Complexity For the graph G(V, E) having n = |V | vertices, the Shannon entropy HS (G) requires time complexity O(n2 ). 2.2 A Composite Entropy of A Pair of Graphs To compute the Jensen-Shannon divergence of a pair of random walks on a pair of graphs Gp (Vp , Ep ) and Gq (Vq , Eq ), we require a method for constructing a composite structure Gp ⊕ Gq for the pair of graphs Gp (Vp , Ep ) and Gq (Vq , Eq ). For reasons of efficiency, we use the disjoint union as the composite structure. According to [2], the disjoint union graph of Gp (Vp , Ep ) and Gq (Vq , Eq ) is GDU = Gp ∪ Gq = {Vp ∪ Vq , Ep ∪ Eq }.
(3)
Let graphs Gp and Gq be the connected components of the disjoint union graph GDU , and ρp = |V (Gp )|/|V (GDU )| and ρq = |V (Gq )|/|V (GDU )|. The entropy (i.e. the composite entropy) [5] of GDU is H(GDU ) = ρp H(Gp ) + ρq H(Gq ).
(4)
Here the entropy function H(·) is the Shannon entropy HS (·) defined in Eq.(2). 2.3 The Jensen-Shannon Divergence on Graphs The Jensen-Shannon divergence between the (discrete) probability distributions P = (p1 , p2 , . . . , pK ) and Q = (q1 , q2 , . . . , qK ), associated with the random walks on graphs Gp (Vp , Ep ) and Gq (Vq , Eq ), is negative definite (nd) with the following function: DJS (P, Q) = HS (
P +Q HS (P ) + HS (Q) )− . 2 2
(5)
4
L. Bai et al.
PK where HS (P ) = k=1 pk log pk is the Shannon entropy of the probability distribution P . Given a pair of graphs Gp (Vp , Eq ) and Gq (Vq , Eq ), the Jensen-Shannon divergence for them is DJS (Gp , Gq ) = H(Gp ⊕ Gq ) −
H(Gp ) + H(Gq ) . 2
(6)
where H(Gp ⊕ Gq ) is the entropy of the composite structure. Here we use the disjoint union defined in Sec.2.2 as the composite structure, and the entropy function H(·) is the Shannon entropy HS (·) defined in Eq.(2). Time Complexity For a pair of graphs Gp (Vp , Ep ) and Gq (Vq , Eq ) both having n vertices, computing the Jensen-Shannon divergence DJS (Gp , Gq ) defined in Eq. (6) requires time complexity O(n2 ).
3 Graph Embedding Using The Jensen-Shannon Divergence In this section, we explore how to use the Jensen-Shannon divergence as a means of embedding graph structures into a vector space. We commence by reviewing the definition of the graph dissimilarity embedding. 3.1 Graph Dissimilarity Embedding In [10], Riesen and Bunke have proposed a graph dissimilarity embedding to embed a sample graph into a vectorial description, they computed the edit distances between the sample graph and a number of prototype graphs. For a sample graph Gi (Vi , Ei ) (i = 1, . . . , N ) and a set of prototype graphs T = {T1 , . . . , Tm , . . . , Tn }, we measure the dissimilarities between Gi (Vi , Ei ) and each prototype graph Tm ∈ T as the m-th element of the n-dimensional vectorial description V (Gi ) of Gi . The mapping ϕTn : (Gi ) → Rn is defined as the function V (Gi ) = (d(Gi , T1 ), . . . , d(Gi , Tm ), . . . , d(Gi , Tn ))
(7)
where d(Gi , Tm ) is the graph dissimilarity measure between Gi (Vi , Ei ) and the m-th prototype graph Tm . Riesen and Bunke proposed to use the graph edit distance as the dissimilarity measure. Although, their approach allows any (dis)similarity measure of graphs to be used. 3.2 A Graph Embedding Method using the Jensen-Shannon Divergence The novel graph embedding procedure described in Section 3.1 offers us a principled way to develop a new graph embedding approach using the Jensen-Shannon divergence. Consider the sample graph Gi (Vi , Ei ) and the set of prototype graphs T = {T1 , . . . , Tm , . . . , Tn }, we compute the similarity measure between Gi (Vi , Ei ) and each prototype graph using the Jensen-Shannon divergence, as a result the mapping ϕTn : (Gi ) → Rn defined in Eq.(3.1) can be re-written as
A Graph Embedding Method Using The Jensen-Shannon Divergence
VDJS (Gi ) = (DJS (Gi , T1 ), . . . , DJS (Gi , Tm ), . . . , DJS (Gi , Tn ))
5
(8)
where DJS (Gi , Tm ) is the Jensen-Shannon divergence between the sample graph Gi (Vi , Ei ) and the m-th prototype graph Tm . Since the Jesnen-Shannon divergence between graphs can be efficiently computed, the proposed embedding method are more efficient than the dissimilarity embedding using the costly computed graph edit distance. 3.3 The Prototype Graph Selection For our approach, the prototype graphs T = {T1 , . . . , Tm , . . . , Tn } serve as reference points to transform graphs into real vectors. Hence, the aim of prototype graph selection is to find reference points which result in a meaningful vector in the embedding space. Intuitively, the prototype graphs T = {T1 , . . . , Tm , . . . , Tn } should be able to characterize the structural variations present in a set of sample embedded graphs. Furthermore, these prototype graphs should be neither too redundant nor too simple. To locate the prototype graphs we make use of Luo and Hancock’s probabilistic model of graph structures described in [6], and develop an information theoretic approach to selecting prototype graphs. By using a two-part minimum description length criterion [13, 11, 12], these selected prototype graphs trade off the goodness-of-fit to the sample data against their intrinsic complexities. To formalize this idea, we locate the prototype graphs that minimize the overall code-length. The code-length of a set of sample graphs is the average of their Shannon-Fano code which is equivalent to the negative logarithm of their likelihood function. The code-length for describing the complexity of the prototype graphs is measured using the approximate von-Neumann entropy [3]. To minimize the overall code length, we develop a variant of EM algorithm where we view both the structure of the prototype graphs and the vertex correspondence information between the sample and prototype graphs as missing data. In the two interleaved steps of the EM algorithm, the expectation step involves recomputing the a posteriori probability of vertex correspondence while the maximization step involves updating both the structure of the prototype graphs and the vertex correspondence information. More details of how we apply the minimum description length criterion and how the EM algorithm work can be found in [4].
4 Experimental Evaluation In this section, we demonstrate the performance of our proposed method on several graph datasets abstracted from real-world image and bioinformatics databases. These datasets are: ALOI, CMU, MUTAG and NCI109 [14]. The ALOI dataset consists of 54 graphs extracted from selected images of three similar boxes. The CMU dataset consists of 54 graphs extracted from selected images of three similar toy houses. For each object in the ALOI and CMU datasets, there are 18 images captured from different viewpoints. The graphs are the Delaunay triangulations of feature points extracted from the different
6
L. Bai et al.
images using the SIFT detection. The maximum and minimum vertices of the ALOI and CMU datasets are 1288 (max) and 295 (min) for ALOI, and 495 (max) and 27 (min) for CMU. The MUTAG dataset is based on graphs representing 188 chemical compounds, and aims to predict whether each compound possesses mutagenicity. The maximum and minimum number of vertices are 28 and 10 respectively. As the vertices and edges of each compound are labeled with real number, we transform these graphs into unweighted graphs. The NCI109 is based on un weighted graph representing 4127 chemical compounds, and aims to predict whether each sub-set of compound is active in an anti-cancer screen. The maximum and minimum number of vertices are 111 and 4 respectively. 4.1 Experiments on Graph Datasets Experimental Setup: We evaluate the performance of our proposed graph embedding method using the Jensen-Shannon divergence (DEJS) on the four graph datasets. We compare our method against several alternative graph based learning methods. The comparative methods include a) pattern vectors from coefficients of the Ihara zeta function (CIZF) [9], b) pattern vectors from algebraic graph theory (PVAG) [15], and c) the graph dissimilarity embedding using the edit distance (DEED) [10]. For our method and DEED on each dataset, we randomly divide the graphs into 20 folds and use any 6 folds to learn 6 prototype graphs. We construct the 6 dimensional vector description of each testing graph. For the alternative methods CIZF and PVAG on each dataset, we construct the vectorial description of each testing graph. We then perform 10-fold cross-validation of KNN classifier to evaluate the performance of our method and the alternative methods, using nine folds for training and one fold for testing. All the KNN classifiers and the paraments were performed and optimised on a Weka workbench. We repeat the whole experiment 10 times and report the average classification accuracies in Table 1. We also report the runtime to establish graph feature vectors of each method in Table 1 under Matlab R2011a with an Intel(i5) 3.2GHz 4-core processor.
Data DEJS CIZF PVAG DEED
ALOI 91.35 − − −
Table 1. Experimental Comparisons on Graph Datasets CMU NCI109 MUTAG Data ALOI CMU NCI109 100 65.49 80.75 DEJS 2” 1” 2” 100 67.19 80.85 CIZF − 2′ 33” 14” 62.59 64.59 82.44 PVAG − 5” 19” 100 63.34 83.55 DEED − 3h55′ 17h49′
MUTAG 1” 1” 1” 49′ 23”
Experimental Results: On the ALOI dataset which possesses graphs of more than one thousand vertices, our method takes 2 seconds, while DEED takes over one day and even CIZF and PVAG generate overflows on the computation. The runtime of CIZF and PVAG methods are only competitive to our method DEJS on the MUTAG, NCI109 and CMU datasets which possess graphs of smaller sizes. This reveals that our DEJS can easily scale up to graphs with thousands of vertices. DEED can achieve competitive classification accuracies to our DEJS, but requires more computation time. The graph similarity embedding using the Jensen-Shannon divergence measure is more efficient than that using the edit distance dissimilarity measure proposed by Riesen and Bunke.
A Graph Embedding Method Using The Jensen-Shannon Divergence
7
The reason for this is that the Jensen-Shannon divergence between graphs only requires quadratic numbers of vertices. Furthermore, both our embedding method and DEED also require extra runtime for learning the required prototype graphs. For the ALOI, CMU, NCI109 and MUTAG datasets, the average times for learning a prototype graph are 5 hours, 30 minutes, 15 minutes and 5 minutes respectively. This reveals that for graphs of large sizes, our embedding method may require additionally and potentially expansive computations for learning the prototype graphs. However for graphs of less than 300 vertices, the learning of prototype graphs can still be completed in polynomial time. 4.2 Stability Evaluation In this subsection, we investigate the stability of our proposed method DEJS. We randomly select three seed graphs from the ALOI dataset. We then apply random edit operations on the three seed graphs to simulate the effects of noise. The edit operations are vertex deletion and edge deletion. For each seed graph, we randomly delete a predetermined fraction of vertices or edges to obtain noise corrupted variants. The feature distance between an original seed graph G0 and its noise corrupted counterpart Gn is defined as their Euclidean distance, defined as q (9) dG0 ,Gn = (VJS (G0 ) − VJS (Gn ))T (VJS (G0 ) − VJS (Gn )) We show the experimental results in Fig.1 and Fig. 2. Fig.1 and Fig. 2 show the effects of vertex and edge deletion respectively. The x-axis represents 1% to 35% of vertices or edges are deleted, and the y-axis shows the Euclidean distance dG0 ,Gn between the original seed graph Go and its noise corrupted counterpart Gn . From Fig.1 and Fig. 2, there is an approximate linear relationship in each case. This implies that the proposed method possesses ability to distinguish graphs under controlled structural-error. 12
14
12
12
10
10
6
4
Euclidean distance
Euclidean distance
Euclidean distance
10 8
8
6
8
6
4
4 2
0
2
2
0
5
10
15 20 Node edit operation
25
30
0
35
0
5
10
(a)
15 20 Node edit operation
25
30
0
35
0
5
10
(b)
15 20 Node edit operation
25
30
35
25
30
35
(c)
Fig. 1. Stability evaluation vertex edit operation 12
10
12
9 10
8
6
4
Euclidean distance
7
8
Euclidean distance
Euclidean distance
10
6 5 4 3 2
2
8
6
4
2
1 0
0
5
10
15 20 Edge edit operation
(a)
25
30
35
0
0
5
10
15 20 Edge edit operation
25
30
35
0
0
5
(b) Fig. 2. Stability evaluation edge edit operation
10
15 20 Edge edit operation
(c)
8
L. Bai et al.
5 Conclusion and Future Work In this paper, we have shown how to use the Jensen-Shannon divergence as a means of embedding a sample graph into a vector space. We use an information theoretic approach to construct the required prototype graphs. We embed a sample graph into feature space by computing the Jensen-Shannon divergence measure between the sample graph and each of the prototype graphs. We perform 10-folds cross validation associated with KNN classifier to assign the graphs into classes. Experimental results demonstrate the effectiveness and efficiency of the proposed method. Since learning prototype graphs usually requires expensive computation, our further work is to define a fast approach to learn the prototype graphs. This will be useful to define a faster graph embedding method. Acknowledgments We thank Dr. Peng Ren for providing the Matlab implementation for the graph Ihara zeta function method.
References 1. Bai, L., Hancock, E.R.: Graph kernels from the jensen-shannon divergence. Journal of Mathematical Imaging and Vision (To appear) 2. Gadouleau, M., Riis, S.: Graph-theoretical constructions for graph entropy and network coding based communications. IEEE Transactions on Information Theory 57, 6703–6717 (2011) 3. Han, L., Hancock, E.R., Wilson, R.C.: Characterizing graphs using approximate von neumann entropy. In: IbPRIA, pp. 484–491 (2011) 4. Han, L., Hancock, E.R., Wilson, R.C.: An information theoretic approach to learning generative graph prototypes. In: SIMBAD, pp. 133–148 (2011) 5. K¨oner, J.: Coding of an information source having ambiguous alphabet and the entropy of graphs. In: Proceedings of the 6th Prague Conference on Information Theory, Statistical Decision Function, Random Processes, pp. 411–425 (1971) 6. Luo, B., Hancock, E.R.: Structural graph matching using the em alogrithm and singular value decomposition. IEEE Trans. Pattern Analysis and Machine Intelligence 23, 1120– 1136 (2001) 7. Martins, A.F., Smith, N.A., Xing, E.P., Aguiar, P.M., Figueiredo, M.A.: Nonextensive information theoretic kernels on measures. Journal of Machine Learning Research 10, 935–975 (2009) 8. Pekalska, E., Duin, R.P.W., Pacl´ık, P.: Prototype selection for dissimilarity-based classifiers. Pattern Recognition 39, 189–208 (2006) 9. Ren, P., Wilson, R.C., Hancock, E.R.: Graph characterization via ihara coefficients. IEEE Transactions on Neural Networks 22, 233–245 (2011) 10. Riesen, K., Bunke, H.: Graph classification and clustering based on vector space embedding. In: World Scientific Press (2010) 11. Rissanen, J.: Modelling by shortest data description. Automatica 14, 465–471 (1978) 12. Rissanen, J.: A universal prior for integers and estimation by minimum description length. Annals of Statistics 11, 417–431 (1983) 13. Rissanen, J.: Stochastic complexity in statistical inquiry. Singapore:World Scientific (1989) 14. Shervashidze, N., Borgwardt, K.M.: Fast subtree kernels on graphs. In: Proceedings of the Neural Information Processing Systems, pp. 1660–1668 (2009) 15. Wilson, R.C., Hancock, E.R., Luo, B.: Pattern vectors from algebraic graph theory. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1112–1124 (2005)