Cogn Comput (2015) 7:346–358 DOI 10.1007/s12559-014-9295-7
Uncertain Graph Classification Based on Extreme Learning Machine Donghong Han • Yachao Hu • Shuangshuang Ai Guoren Wang
•
Received: 22 August 2013 / Accepted: 7 July 2014 / Published online: 14 August 2014 Springer Science+Business Media New York 2014
Abstract The problem of graph classification has attracted much attention in recent years. The existing work on graph classification has only dealt with precise and deterministic graph objects. However, the linkages between nodes in many real-world applications are inherently uncertain. In this paper, we focus on classification of graph objects with uncertainty. The method we propose can be divided into three steps: Firstly, we put forward a framework for classifying uncertain graph objects. Secondly, we extend the traditional algorithm used in the process of extracting frequent subgraphs to handle uncertain graph data. Thirdly, based on Extreme Learning Machine (ELM) with fast learning speed, a classifier is constructed. Extensive experiments on uncertain graph objects show that our method can produce better efficiency and effectiveness compared with other methods. Keywords Uncertain graph Classification Extreme Learning Machine
Introduction Recently, cognitive science and information processing have attracted great attention in many research areas, D. Han Y. Hu S. Ai (&) G. Wang Key Laboratory of Medical Image Computing (NEU), MOE, Shenyang, China e-mail:
[email protected] D. Han e-mail:
[email protected] D. Han Y. Hu S. Ai G. Wang College of Information Science and Engineering, Northeastern University, Shenyang, China
123
including neurobiology, cognitive psychology, machine learning, and pattern recognition [1–5]. As the basis of cognitive science, cognitive computation is used to analyze and compute datasets in a real-time manner so that cognitive systems are capable of self-learning and cognition, similar to the human brain. Classification is an important task in machine learning and is also a key part of cognitive computation [6]. In the era of big data, there is an inevitable trend for analyzing massive data so that cognitive systems learn the mechanism of the human cognitive process. Nowadays, a large amount of mass data is generated in many applications such as bioinformatics, chemoinformatics, social network analysis, web information management, etc. As one of the most important data structures, the graph is suitable for describing and modeling questions in these fields. The problem of graph classification, which is an important branch of the area of mining graph data, has become a hot research domain. Graph classification aims to construct a classification prediction model by learning graph data with category labels, followed by the realization of automatic classification of graph data without category labels. For example, in the field of bioinformatics, protein function can be predicted using protein secondary and tertiary structure. The structure can be formed through folding of the protein amino acid sequence, where the secondary structure can be represented through plane labeled graphs and the tertiary structure can be represented through special geometric graphs with geometry information. Constructing a classification predication model on protein secondary and tertiary structure can help biologists analyze and identify basic protein functions. Existing methods of graph classification can be divided into two groups: kernel-based approaches [7–9] and methods based on frequent subgraphs [10–13]. A graph
Cogn Comput (2015) 7:346–358
kernel is a similarity measure between two graphs, the two most popular being the cyclic-based graph kernel [8] and the walk-based graph kernel [9]. In contrast to approaches with kernel functions based on frequent patterns, the authors propose a kernel function based on a natural set of cyclic and tree patterns, independent of their frequency [8]. The alternative graph kernels were proposed previously [9]. The feature space of these kernels is conceptually based on the label sequences of all possible walks in the kernels. The classification method based on frequent subgraphs is mainly applied to use the frequent subgraphs as classification features. This method has good classification performance and scalability for any topology or geometric structure. The main steps of this method are: (1) mining frequent subgraphs, (2) extracting features, and (3) constructing the classifier. Pattern growth is one of the best strategies for mining frequent subgraphs. In this approach, one first mines a frequent subgraph p and generates the children of p extensively. Then, for each child of p, we compute the support of the children of p (the supergraph pattern of p) and extend it in a depth-first way until all frequent subgraphs are found. Among existing algorithms for frequent subgraph mining [14–19], the algorithm called gSpan (graph-based substructure pattern mining) [15], one of the most efficient algorithms, is outstanding in terms of its mining efficiency, memory utilization, scalability, etc. Other studies replace frequent subgraphs with discriminative subgraph patterns. The GAIA algorithm [10] can discover discriminative subgraph patterns in big graph databases. This method exploits a novel subgraph encoding approach to support an arbitrary subgraph pattern exploration order. On the other hand, the literature [11] proposes a method which optimizes a submodular quality criterion to select features among frequent subgraphs. This criterion can be integrated into gSpan to prune the search space for discriminative frequent subgraphs. The above-mentioned methods can solve a specific classification problem for graph patterns, but all of them are used to classify certain graphs. However, graph objects are inherently uncertain in many real-world applications, such as sensor network monitoring, moving object detection, etc. As we know, vertices and edges in biological or social networks are usually based on the existence of a probability. The difference between uncertain and certain graphs is that an uncertain graph represents the probability distribution of all certain implicated graphs (introduced in ‘‘Problem Definition’’ section). Classifying uncertain graph objects faces important challenges. For example, in traditional algorithms for mining frequent subgraphs, the proportion of a subgraph appearing in a certain graph database is its support. However, such a definition is unavailable for uncertain graphs, because it cannot be known whether a
347
subgraph can match an uncertain graph or not. As a result, existing methods of classifying certain graphs are unable to resolve the classification problem for uncertain graphs effectively. In this paper, we propose a classification algorithm based on ELM [20] to solve the classification problem in uncertain graphs. ELM [20] (introduced in ‘‘ELM’’ section) is an emergent computational intelligence technique that can be used for prediction, regression, and classification problems. ELM was chosen to train our classifiers due to its good generalization performance and fast learning speed [21– 23]. To the best of our knowledge, there is no work so far on classifying uncertain graph objects. In this paper, we study the problem of uncertain graph classification. The main contributions of our work can be summarized as follows: • •
•
•
A classification framework is put forward for uncertain graph objects; We extend the gSpan algorithm; in other words, we propose an algorithm to find all specific embedded graphs in uncertain graph sets, enabling treatment of uncertain graph data; We provide a novel method using ELM based on equality constraints to train classifiers to solve the multiclassification problem on uncertain graphs; We conduct extensive experiments to evaluate the performance of our proposed classification framework. The statistics verify the efficiency and effectiveness of our proposed algorithms.
In the remainder of this paper, we define the problem in ‘‘Problem Definition and Preliminaries’’ section. Then, ‘‘Uncertain Graph Classification Algorithm Based on ELM’’ section discusses the algorithms proposed in this paper, while experimental results are given in ‘‘Experiments and Performance Evaluation’’ section; Conclusions are presented in ‘‘Conclusions’’ section.
Problem Definition and Preliminaries Problem Definition Definition 2.1 (Uncertain Graph): An uncertain graph P G ¼ ðV; E; ; L; PÞ, where V is a set of vertices, E P V V is a set of edges, is a label set for vertices and edges, L is a function assigning labels to each of the vertices and edges, and P : E ! ð0; 1 is a function assigning conditional existence probability values to the edges. In an uncertain graph, the probability of an edge e 2 E is the probability of e existing in practice. If the probability of an edge e is 1, e must exist in the graph. A special uncertain
123
348
Cogn Comput (2015) 7:346–358
graph with existence probability values of 1 on all edges is called a P certain graph, which can be indicated by g ¼ ðV; E; ; LÞ. V(G) and E(G) represent the set of vertices and edges of the graph G, respectively. For the sake of simplicity, it is assumed that the existence probabilities of edges of an uncertain graph are mutually independent. A certain graph g is implicated to uncertain graph G, if g and G have the same set of vertices and the set of edges of g is a subset of G’s. So, an uncertain graph G with j E j edges implicates 2jEj certain graphs g, where umusðGÞ is a set of certain graphs implicated by G. The probability of an P uncertain graph G ¼ ðV; E; ; L; PÞ implicated to a certain P graph g ¼ ðV 0 ; E0 ; 0 ; L0 Þ is given by Y Y PðeÞ ð1 PðeÞÞ: pðG ! gÞ ¼ ð1Þ 0 0 e2E
e2ðEE Þ
An uncertain graph dataset D is composed of multiple uncertain graphs, i.e., D ¼ ðG1 ; G2 ; . . .; Gi . . .Þ. As shown in Fig. 1, the dataset of uncertain graph D is made up of the uncertain graphs G1 ; G2 ; G3 . The uncertain graph G1 in the dataset of D above has four edges, and it implicates 24 ¼ 16 certain graphs as shown in Fig. 2. Definition 2.2 (Graph Isomorphy): Graph g ¼ P fV; E; ; Lg is said to be graph isomorphic to graph P g0 ¼ fV 0 ; E0 ; 0 ; L0 g, if there exists a bijective function f : V $ V 0 such that: (1) (2) (3)
8u 2 V, LðuÞ ¼ L0 ðf ðuÞÞ; 8u; v 2 V, ððu; vÞ 2 EÞ , ððf ðuÞ; f ðvÞÞ 2 E0 Þ; 8ðu; vÞ 2 E, Lðu; vÞ ¼ L0 ðf ðuÞ; f ðvÞÞ.
Definition 2.3 (Subgraph Isomorphy): Graph g ¼ P fV; E; ; Lg is said to be subgraph isomorphic to graph P g0 ¼ fV 0 ; E0 ; 0 ; L0 g, denoted by g g0 , if there exists an injective function f : V ! V 0 such that: (1) (2) (3)
8u 2 V, LðuÞ ¼ L0 ðf ðuÞÞ; 8u; v 2 V, ððu; vÞ 2 EÞ ) ððf ðuÞ; f ðvÞÞ 2 E0 Þ; 8ðu; vÞ 2 E, Lðu; vÞ ¼ L0 ðf ðuÞ; f ðvÞÞ.
Fig. 1 A dataset of uncertain graph D
123
A certain graph s is a frequent subgraph of an uncertain graph G, only if s is isomorphic to at least one certain graph implicated by G; the probability of G containing subgraph s is defined as follows: X PðG ! gÞ: Pðs; GÞ ¼ ð2Þ g2umusðGÞ^sg
Definition 2.4 (Embedded Graph): Given a certain graph P s, a subgraph of an uncertain graph G ¼ ðV; E; ; L; PÞ, then si ðVðsi Þ VðGÞ; Eðsi Þ EðGÞÞ is the embedded graph of s in G, only if si is isomorphic to s. In traditional frequent subgraph mining, the absolute support of a certain graph s in a precise graph database d is supðs; dÞ ¼ jfg j s g; g 2 dgj, and the relative support is given by sup0 ðs; dÞ ¼ jfgjsg;g2dgj . Unfortunately, these jdj definitions are unavailable when mining uncertain graphs, because it cannot be known whether a subgraph can match an uncertain graph or not. Therefore, we need to redefine the concept of support. Definition 2.5 (Support): The support of s in D is X supðs; DÞ ¼ Pðs; GÞ; G2D
ð3Þ
where D is the uncertain graph database, s is a certain graph, and s is a frequent subgraph to D only if the support of s is no less than a given threshold, minsup. Preliminaries ELM ELM was originally proposed for single-hidden-layer feedforward neural networks and then extended to generalized single-hidden-layer feedforward networks (SLFNs) where the hidden layer need not be neuron-like [24–26]. In ELM, the input weight and the bias of the hidden node are randomly generated, and the output weight
Cogn Comput (2015) 7:346–358
349
Fig. 2 The certain graphs implicated in the uncertain graph G1 and their probability
can be calculated without iterative tuning. The output function of ELM for generalized SLFNs is L L X X fL ðxÞ ¼ bi gi ðxÞ ¼ bi Gðai ; bi ; xÞ; x 2 Rd ; bi 2 Rm ; i¼1
i¼1
ð4Þ where bi ¼ ½bi1 ; ; biL T denotes the vector of the output weights between the hidden layer of L nodes and the output nodes, and gi denotes the output function Gðai ; bi ; xÞ of the ith hidden node. For N distinct samples ðxi ; ti Þ 2 Rd Rm , the SLFNs with L hidden-layer nodes are modeled as L L X X bi gi ðxj Þ ¼ bi Gðai ; bi ; xj Þ ¼ oj ; j ¼ 1; ; N: i¼1
i¼1
ð5Þ That SLFNs can approximate these N samples with zero P error means that Lj¼1 koj tj k ¼ 0; i.e., there exist ðai ; bi Þ and bi such that
L X
bi Gðai ; bi ; xj Þ ¼ tj ;
j ¼ 1; . . .; N:
ð6Þ
i¼1
The above three equations can be written compactly as Hb ¼ T; where 0
1 hðx1 Þ B . C C H¼B @ .. A hðxN Þ 0 1 Gða1 ; b1 ; x1 Þ GðaL ; bL ; x1 Þ B C .. .. C ¼B . . @ A Gða1 ; b1 ; xN Þ GðaL ; bL ; xN Þ NL 0 T1 0 T1 t1 b1 B . C B . C C C b¼B and T ¼ B : @ .. A @ .. A tTN Nm bTL Lm
ð7Þ
123
350
Cogn Comput (2015) 7:346–358
H is called the hidden layer output matrix of the network [27, 28]; the ith column of H is the ith hidden node output with respect to inputs x1 ; x2 ; ; xN . hðxÞ ¼ ½Gða1 ; b1 ; xÞ; ; GðaN ; bN ; xÞ is called the hidden layer feature mapping. The ith row of H is the hidden layer feature mapping with respect to the ith input xi : hðxi Þ. Since ELM can approximate any target continuous functions and the output of the ELM classifier hðxÞb can be as close to the class labels in the corresponding regions as possible, the classification problem for the proposed constrained-optimization-based ELM with a single-output node can be formulated as N 1 1X Minimize : LPELM ¼ kbk2 þ C n2 2 2 i¼1 i
Subject to : hðxi Þb ¼ ti ni ;
ð8Þ
i ¼ 1; . . .; N
where C is a user-specified parameter. Based on the Karush-Kuhn-Tucker (KKT) theorem [29], training ELM is equivalent to solving the following dual optimization problem: N N X 1 1X LDELM ¼ kbk2 þ C n2i ai ðhðxi Þb ti þ ni Þ; 2 2 i¼1 i¼1
ð9Þ where each Lagrange multiplier ai corresponds to the ith training sample. An alternative approach for multiclass applications is to let ELM have multioutput nodes instead of a single-output node. m-Class classifiers have m output nodes. If the original class label is p, the expected output vector of m output nodes is ti ¼ ½0; . . .; 0; p1 ; 0; . . .; 0T . In this case, only the pth element of ti ¼ ½ti;1 ; . . .; ti;m T is 1, while the rest of the elements are set to zero. The classification problem for ELM with multioutput nodes can be formulated as N 1 1X Minimize : LPELM ¼ kbk2 þ C kn k2 2 2 i¼1 i
Subject to : hðxi Þb ¼ tTi nTi ;
ð10Þ
i ¼ 1; ; N
where ni ¼ ½ni;1 ; . . .; ni;m T is the training error vector of m output nodes with respect to the training sample xi . As for the problem of a single-output node, training ELM is equivalent to solving the following dual optimization problem: LDELM ¼
N 1 1X kbk2 þ C kn k2 2 2 i¼1 i
N X m X i¼1 j¼1
123
ai;j ðhðxi Þbj ti;j þ ni;j Þ;
ð11Þ
where bj is the vector of the weights linking the hidden layer to the jth output node and b ¼ ½b1 ; . . .; bm . From the above equation, we can see that the singleoutput node case can be considered as a specific case of multioutput nodes when the number of output nodes is set to 1. For the aforementioned optimization problem, two kinds of solutions can be obtained for different sized training datasets. The first case is that the number of training samples is not huge. The output function of the ELM classifier for this solution is 1 T I T ð12Þ f ðxÞ ¼ hðxÞb ¼ hðxÞH þ HH T: C The second solution is that the number of training samples is huge. In this case, the output function of the ELM classifier is 1 I T ð13Þ þ HH HT T: f ðxÞ ¼ hðxÞb ¼ hðxÞ C gSpan This subsection provides a brief overview of gSpan. We first introduce several techniques developed in gSpan, including mapping each graph to a depth-first search (DFS) code (a sequence), building a novel lexicographic ordering among those codes, and constructing a search tree based on that lexicographic order. When performing a depth-first search [30] in a graph, we construct a DFS tree. One graph can have several different DFS trees; For example, the graphs in Fig. 3b–d are isomorphic to that in Fig. 3a. The thickened edges in Fig. 3b–d represent three different DFS trees for the graph in Fig. 3a. The depth-first discovery of the vertices forms a linear order. We use subscripts to label this order according to their discovery time [30], where i\j means vi is discovered before vj . We call v0 the root and vn the rightmost vertex. The straight path from v0 to vn is named the rightmost path.We denote such subscripted G as GT . Forward Edge and Backward Edge. Given GT , (i,j) is an ordered pair to represent an edge. If i\j, it is a forward edge; otherwise, it is a backward edge. A linear order, T is built among all the edges in G by the following rules: [assume e1 ¼ ði1 ; j1 Þ, e2 ¼ ði2 ; j2 Þ]: (1) if i1 ¼ i2 and j1 \j2 , e1 T e2 ; (2) if i1 \j1 and j1 ¼ i2 , e1 T e2 ; and (3) if i1 ¼ i2 and e1 T e2 and e2 T e3 , e1 T e3 . Definition 2.6 (DFS Code): Given a DFS tree T for a graph G, an edge sequence ðei Þ can be constructed based on T , such that ei T eiþ1 , where i ¼ 0; . . .; j E j 1. ðei Þ is called a DFS code, denoted as code(G,T). For simplicity, an edge can be represented by a 5-tuple, ði; j; li ; li;j ; lj Þ, where li and lj are the labels of vi and vj ,
Cogn Comput (2015) 7:346–358
351
(a)
(b)
(c)
(d)
Fig. 3 DFS tree Table 1 DFS codes for Fig. 1b–d Edge
(Fig. 1b)a
(Fig.1c)b
(Fig.1d)c
0
(0,1,X,a,Y)
(0,1,Y,a,X)
(0,1,X,a,X)
1
(1,2,Y,b,X)
(1,2,X,a,X)
(1,2,X,a,Y)
2
(2,0,X,a,X)
(2,0,X,b,Y)
(2,0,Y,b,X)
3
(2,3,X,c,Z)
(2,3,X,c,Z)
(2,3,Y,b,Z)
4
(3,1,Z,b,Y)
(3,0,Z,b,Y)
(3,0,Z,c,X)
5
(1,4,Y,d,Z)
(0,4,Y,d,Z)
(2,4,Y,d,Z)
respectively, and li;j is the label of the edge between them; For example, ðv0 ; v1 Þ in Fig. 1b is represented by (0,1,X,a,Y). Table 1 presents the corresponding DFS codes for Fig. 3b–d. Definition 2.7 (DFS Lexicographic Order): DFS lexicographic order is a linear order defined as follows: If a ¼ codeðGa ; Ta Þ ¼ ða0 ; a1 ; . . .; am Þ and b ¼ codeðGb ; Tb Þ ¼ ðb0 ; b1 ; . . .; bn Þ, a; b 2 Z, then a b if either of the following is true: (1) (2)
9t; 0 t minðm; nÞ, ak ¼ bk for k\t, at e bt ak ¼ bk for 0 k m, and n m:
For the graph in Fig. 3a, there exist different DFS codes. Three of them, which are based on the DFS trees in Fig. 3b–d, are listed in Table 1. According to DFS lexicographic order, c a b. Definition 2.8 (Minimum DFS Code): Given a graph G, Z(G) = fcode(G,T)j T is a DFS tree of G g, based on DFS lexicographic order, the minimum one, min(Z(G)), is called the minimum DFS code of G. It is also a canonical label of G.
can be solved by existing sequential pattern mining algorithms. In fact, to construct a valid DFS code, b must be an edge which only grows from the vertices on the rightmost path. In Fig. 4, the graph shown in Fig. 4a has several potential children with one edge growth, which are shown in Fig. 4b–f (assuming that the darkened vertices constitute the rightmost path). Among them, Fig. 4b–d grow from the rightmost vertex, while Fig. 4e, f grow from other vertices on the rightmost path. Fig. 4b.0–b.3 are children of Fig. 4b, and e.0–e.2 are children of Fig. 4e. Backward edges can only grow from the rightmost vertex, while forward edges can grow from vertices on the rightmost path. The enumeration order of these children is enhanced by the DFS lexicographic order; i.e., it should be in the order of Fig. 4b–f. Algorithm 1 of gSpan is shown below:
Algorithm 1 gSpan Input: s: a DFS code; D: a graph set; minsup: the minimum support threshold; Output: S: a set of frequent graphs; Call gSpan(s,D,minsup,S) Procedure gSpan(s,D,minsup,S) 1: if s d f s s then 2: return 3: end if 4: add s to S 5: C 6: scan D for once, find out all edges e that make s be extended to rightmost s insert s r e to C and compute it’s support 7: sort C in DFS lexicographic order 8: for each frequent s r e in C do 9: gSpan(s r e,D,minsup,S) 10: end for 11: return
r e,
Theorem 2.1 Given two graphs G and G0 , G is isomorphic to G0 , if and only if minðGÞ ¼ minðG0 Þ. (proof omitted)
Uncertain Graph Classification Algorithm Based on ELM
Thus, the problem of mining frequent connected subgraphs is equivalent to mining their corresponding minimum DFS codes. This problem turns out to be a sequential pattern mining problem with a slight difference, which conceptually
In this section, we introduce the classification algorithm proposed in this paper in detail. We propose a framework to address the problem of classifying uncertain graph objects as
123
352
Cogn Comput (2015) 7:346–358
Fig. 4 DFS code/graph growth
(a)
(b)
shown in Fig. 5. The framework mainly contains three separate components: (1) mining frequent subgraphs, (2) extraction of features, and (3) construction of the classifier. The first step is to mine all frequent subgraphs in each class of the uncertain graphs with the same label from the training set, which are used as a candidate set of classification features. The second step is to extract a part of the subgraphs from the above set as classification features through the feature extraction algorithm; the selected frequent subgraphs have very strong distinguishing ability. Next, each uncertain graph is mapped into a feature space; that is, we use a one-dimensional vector of length n (where n is the number of classification features) to represent an uncertain graph. Finally, a classifier is constructed with the ELM algorithm. In this framework, mining frequent subgraphs and then extracting features are the basic work for constructing the uncertain graph classifier. Based on discriminative features selected in the previous steps, we train the classifier using the ELM technique due to its high learning speed. These three components play the same important role in the framework of uncertain graph classification. We explain the three steps in detail below. Mining Frequent Subgraphs In this work, we improved gSpan in two aspects: (1) discover all embedded graphs of s in an uncertain graph G, rather than judge whether s exists in G or not; and (2) calculate the support of the graph s in an uncertain graph G. Apart from these two improvements, there is no essential difference between the improved gSpan and the traditional one; the detailed process can be found in Subsection ‘‘gSpan’’. Next, the improvements are discussed in detail.
123
(c)
(d)
(e)
(f)
Mining All Embedded Graphs (MAEG) Given two graphs G and G0 , we need to discover all embedded graphs of G0 in G. Assume a ¼j EðGÞ j, b ¼j EðG0 Þ j. In the algorithm, we use a one-dimensional vector F ¼ ðF1 ; F2 ; . . .; Fi ; . . .; Fa Þ of length a to represent the usage of each edge in dfs(G): the ith edge has been used for Fi ¼ 1, while Fi ¼ 0 indicates the opposite. Similarly, let a one-dimensional vector H ¼ ðH1 ; H2 ; . . .; Hi ; . . .; Hb Þ of length b represent the mapping from dfs(G0 ) to dfs(G): Hi ¼ j represents that the ith edge in dfs(G) has been mapped to the jth edge in dfs(G), while Hi ¼ 0 represents that the ith edge in dfs(G0 ) has not been mapped to any edge of dfs(G). Algorithm 2 MAEG(G,G ,F,H ,d) Input: G: The minimum DFS code in an uncertain graph G, that is dfs(G); G : dfs(G ); F: one-dimensional vector F with the length of a, every initial value is zero; H: one-dimensional vector H with the length of b, every initial value is zero; d: The depth of search, the initial value is zero; Output: R: The set of embedded graphs found in Algorithm 1, and the initial value is null; 1: if d b then 2: add the embedded graph corresponding to F to R 3: return 4: end if 5: for 1 k a do 6: if Fk 0 AND the vertex label and the edge label of the kth edge in dfs(G) is not equal to the vertex label and the edge label of the dth edge in dfs(G ) then 7: goto 5 8: end if 9: end for 10: Hd = k 11: Fk = 1 12: MAEG(G,G’,F,H,d+1) 13: Fk = 0
Cogn Comput (2015) 7:346–358
353
Fig. 5 Classification framework for uncertain graph data
Lines 1–4 in Algorithm 2 are to test whether an embedded graph of G0 in G is discovered or not. If it is found, the DFS code of the embedded graph is added to R. The next step is to enumerate all edges that have not been used but meet the conditions (lines 5–6) and then set vector H and F (lines 10–11). Line 12 is to call this algorithm recursively, which can get into the next round of the test. Line 13 is to go back to the first round of the test with the purpose of finding all embedded graphs. We find all embedded graphs of G0 in G and store them in the set R through Algorithm 2.
Algorithm 3 CalPro(s,G) Input: s: A subgraph need to be queryed; G: An uncertain graph; Output: The probability of s in G; 1: Find out embedded graph set R s1 s2 rithm 2 2: p 0 3: for 1 k R do 4: p p 1 k 1 1 i1 ik k e E 5: end for 6: return p
si
Si1
of s in G through Algo-
E Sik
P e
The algorithm 4 of judging whether a certain graph s is frequent subgraph to an uncertain graph set D is as follows:
Calculating the Support The calculation of the support of a certain graph s in an uncertain graph G by Eqs. (1) and (2) requires the list of all certain graphs implicated in G and the probability of their existence. Next, we judge whether s is isomorphic to each of these or not. If so, the difficulty of the algorithm will be increased. Here, we simplify Eq. (2) through mathematical operations: Theorem 3.1 In this work, we assume that there are n embedded graphs ðs1 ; s2 ; . . .; sn Þ of s in the uncertain graph G, then X Y X Y ð PðeÞÞ ð PðeÞÞ Pðs; GÞ ¼ 1 i n e2EðSi Þ
þ þ ð1Þt1
1 i j n e2ðEðSi Þ[EðSj ÞÞ
X
ð
Y
PðeÞÞ
1 i1 it n e2ðEðSi1 Þ[[EðSit ÞÞ
þ þ ð1Þn1
X
ð
Y
PðeÞÞ:
1 i1 in n e2ðEðSi1 Þ[[EðSin ÞÞ
ð14Þ
According to Eq. (14), we present Algorithm 3 for calculating the probability of a certain graph s in an uncertain graph G as follows:
Algorithm 4 JudgeFreGra(s,D,mi nsup) Input: s: A subgraph need to be queryed; D: An uncertain graph set D; minsup: A minimum threshold; Output: Boolean: Whether s is frequent subgraph for D; 1: Calculate P(s,G i ) for each uncertain graph G i in D through Algorithm 3 2: sup(s,D) = 1 i D P s G i 3: return sup s D mi n sup)
Extraction of Features For a given training database DS ¼ fðGi ; Ci Þ j i ¼ 1; 2; . . .; ng, where Gi is a graph and Ci is the label of Gi . We use the improved gSpan to deal with uncertain graphs with the same label Ci of Di , and we can discover all frequent subgraphs in the set Di . However, this algorithm produces too many subgraphs. This large amount of subgraphs is unacceptable for our classification algorithm, leading to three main problems: (1) increasing the training time of the classification model, (2) reducing the accuracy of the classifier, and (3) increasing the processing time for the classification. To solve the problems mentioned above, we adopt a scoring function to select a subset from all found features on the condition of guaranteeing the accuracy of the
123
354
classification model. The principle of the scoring function is: the higher the frequency in a specific graph set Di and the lower the frequency in other graph set Dj ðj 6¼ iÞ, the higher the score will be. The scoring function of a subgraph s in Di is defined as follows: ! supðs; Di Þ : ð15Þ Scoreðs; Di Þ ¼ ln P j6¼i supðs; Dj Þ As can be seen, Eq. (15) leads to singularities when P j6¼i supðs; Dj Þ ¼ 0. In this case, we assume Score(s,Di ) is positive infinity. Given another threshold minsco, for every class, if a graph s meets the condition Scoreðs; Di Þ minsco, it is added to the final feature set C. Here, we need to keep in mind that, when choosing minsco, we should ensure that there is at least one frequent subgraph selected and added to the final feature set C for every uncertain graph set Di with the same class label of uncertain graph set. Construction of the Classifier After the two steps above, we have found a feature set C. We map uncertain graphs into the feature set C; that is, we use a one-dimensional vector F of length j C j to represent an uncertain graph G, where the value Fi of the ith component in F represents the probability of existence for the corresponding feature si of it; that is, Fi ¼ Pðsi ; GÞ. Then, the vector F is normalized. We convert all uncertain graphs in the training set into one-dimensional feature vectors of length j C j and combine these vectors, then we can get a feature matrix. Due to the characteristics of the graph data and the existing problems of the classification processing itself, not all classification algorithms are suitable to construct the classifier. Based on the following reasons, this paper chooses an equality constrained optimization strategy based on ELM to construct the classifier: (1) The feature matrix is a sparse matrix, because most uncertain graphs are rarely part of the feature set C; (2) The problem we deal with is multiclassification; (3) The learning efficiency of the ELM algorithm is very high, while not affecting the classification efficiency. In this algorithm, the number of features is much less than the number of training samples, so we adopt Eq. (13) to solve the problem.
Experiments and Performance Evaluation Experimental Environment We conducted a large number of experiments to evaluate the performance of the proposed algorithms. The algorithm
123
Cogn Comput (2015) 7:346–358
for feature extraction was implemented using C??, whereas the classification algorithm was implemented through MATLAB simulations. All experiments were performed on a 2-GHz Intel Core2 Duo PC with 3 GB of memory, running Windows 7. Data Selection In our experiments, we used a protein database and a compound structure database. The protein structures of the protein database were obtained from the Protein Data Bank [31] and classified (listed in Table 2) using Structural Classification of Proteins (SCOP) [32]. We randomly chose four superfamilies from the SCOP class, which are, in order, All alpha proteins, All beta proteins, Alpha and beta proteins (a/b), and Alpha and beta proteins (a ? b). Then, we chose six families from the All alpha proteins, which are, in order, Bacillus cereus metalloprotein-like, C-terminal domain of alpha and beta subunits of F1 ATP synthase, Cellulases catalytic domain, Cytochrome c3-like, Calmodulin-like, and S100 proteins; Five families were chosen from the All beta proteins, which are, in order, NF-kappa-B/REL/ DORSAL transcription factors, C-terminal domain, E-set domains of sugar-utilizing enzymes, V set domains (antibody variable domain-like), C1 set domains (antibody constant domain-like), and I set domains; We chose four families from the Alpha and beta proteins (a/b), which are, in order, Adenosine/AMP deaminase, Class I aldolase, Class I DAHP synthetase, and Xylose isomerase. Three families were chosen from the Alpha and beta proteins (a ? b), which are, in order, Ubiquitin-related, First domain of FERM, and Double-stranded RNA-binding domain (dsRBD). The four selected superfamilies were named as BASE1 BASE4. We carried out a classification test on the data of BASE1 BASE4, and they are multiclassification problems of class 3 6, respectively. As stated in the literature [10], we obtained the protein graphs according to the following rules: To generate a protein graph, each graph node denotes an amino acid, whose location is represented by the location of its alpha carbon. There is an edge between two nodes if the distance between the two alpha carbons is \11.5 angstroms. Nodes are labeled based on their amino acid type, and edges are labeled with the distances between the alpha carbons. On average, each protein graph has 210 nodes and 2,200 edges. The database of compound structures obtained from PubChem [33] was classified by compound activity (listed in Table 3). In this paper, we only chose compounds as our classification data if the proteins are biologically active. We randomly chose 11 groups of NCI cancer biology identification compounds, which are assigned into four databases (BASE5 BASE8). BASE5 contains six groups
Cogn Comput (2015) 7:346–358
355
of compounds, which are, in order, Lung cancer (NCI11), Melanoma (NCI23), Prostate cancer (NCI41), Nervous sys. Tumor (NCI45), Small Cell Lung cell (NCI61), and Colon cancer (NCI81); BASE6 contains five groups of compounds, which are, in order, Lung cancer (NCI11), Nervous sys. Tumor (NCI45), Breast cancer (NCI83), Ovarian tumor (NCI109), and Leukemia (NCI123); BASE7 contains four groups of compounds, which are, in order, Melanoma (NCI23), Breast cancer (NCI83), Renal cancer (NCI145), and Yeast Anticancer Drug (NCI161); BASE8 contains three groups of compounds, which are, in order, Colon cancer (NCI81), Leukemia (NCI123), and Yeast Anticancer Drug (NCI161). We carried out a classification test on the data of BASE5 BASE8, showing that they are multiclassification problems of class 3 6, respectively. We obtained the compound graphs according to the following rules: Each atom is represented by a node labeled with the atom type, and each chemical bond is represented by an edge labeled with the bond type. On average, each compound graph has 48 nodes and 54 edges. The data obtained as described above are certain graph data, and we added a probability p (0\p 1) to each edge of all the graphs randomly to obtain uncertain graphs. Parameter Selection In this algorithm framework, there are two key parameters in the processes of frequent subgraph mining and feature extraction, named min_sup and minsco, respectively. In the stage of frequent subgraph mining, the smaller the value of min_sup chosen, the more frequent subgraphs will be generated, which increases the processing time. In the stage of feature extraction, the smaller the value of minsco chosen, the more features will be extracted, which will be helpful for classification but will increase the computational complexity of the classifiers. Therefore, we need to choose appropriate values for these parameters to balance the conflict between processing time and classification efficiency. In our experiments, we adopted the following strategies to determine the values of min_sup and minsco. For minsco, we set it to 0, 1, 2, 4, 8, etc. in turn and chose the largest one on the condition of ensuring that there are frequent subgraphs selected as the final features in each
Table 3 List of selected compound data Database
Number of selected compounds
Classes
BASE5
12,778
6
BASE6
12,818
5
BASE7
979
4
BASE8
901
3
class, which is called the best value of minsco. For min_sup, we set it to 0.1, 0.2, 0.4, 0.8, etc. in turn and chosen the smallest one such that the final number of selected features will not affect the efficiency of the classifier. This is called the best min_sup. Tables 4 and 5 present the number of frequent subgraphs and features generated from the SCOP database and compound database, with the condition of the best min_sup and minsco. ELM Classification Efficiency For the classification strategy, we used fivefold cross-validation to evaluate the efficiency of the ELM classification. Tables 6 and 7 present the classification efficiency for the SCOP database and compound database, respectively, with different excitation functions (Gaus, Sig, and radial basis function [RBF]). We evaluate the classification capacity using the following measure: accuracy rate ¼
the number of data classified correctly the total number of data in the database
Table 6 presents the training time and testing accuracy rate of the SCOP database ðBASE1 BASE4Þ using the Table 4 Number of frequent subgraphs and features generated from the SCOP database Database
Number of frequent subgraphs
Number of features
BASE1
260
159
BASE2
223
101
BASE3
165
87
BASE4
109
64
Table 5 Number of frequent subgraphs and features generated from the compound database
Table 2 List of selected SCOP data Database
Number of selected proteins
Classes
Database
Number of frequent subgraphs
Number of features
BASE1
338
6
BASE5
848
227
BASE2
309
5
BASE6
694
132
BASE3
242
4
BASE7
473
136
BASE4
148
3
BASE8
255
73
123
356 Table 6 Training time and accuracy rate of SCOP database
Cogn Comput (2015) 7:346–358
Database
ELM Gaus
Table 7 Training time and accuracy rate of compound database
Table 8 Comparison of training and accuracy rate of BP, SVM, and ELM on SCOP database
Table 9 Comparison of training and accuracy rate of BP, SVM, and ELM on compound database
Sig
Time (s)
Accuracy rate (%)
Time (s)
Accuracy rate (%)
Time (s)
Accuracy rate (%)
BASE1
0.0078
58.03
0.0209
60.38
0.0264
58.96
BASE2
0.0101
64.59
0.0312
63.79
0.0279
65.17
BASE3
0.0024
65.88
0.0028
66.74
0.0048
63.45
BASE4
0.0000
76.51
0.0000
79.57
0.0035
77.89
Database
ELM Gaus
Sig
RBF
Time (s)
Accuracy rate (%)
Time (s)
Accuracy rate (%)
Time (s)
Accuracy rate (%)
BASE5
0.0749
54.60
0.1721
55.23
0.1664
54.35
BASE6
0.0442
62.18
0.1660
60.59
0.1957
60.86
BASE7
0.0707
68.61
0.1348
65.19
0.1191
64.27
BASE8
0.0769
73.10
0.1320
72.74
0.0973
73.57
Database
ELM
BP
SVM
Time (s)
Accuracy rate (%)
Time (s)
Accuracy rate (%)
Time (s)
Accuracy rate (%)
BASE1
0.1092
63.4
0.8736
61.8
0.7956
66.4
BASE2
0.0780
64.3
0.9048
62.1
0.8236
57.2
BASE3
0.0780
66.7
0.8736
70.9
0.9516
72.8
BASE4
0.0936
75.4
0.8580
77.6
0.9672
79.2
Database
ELM
BP
SVM
Time (s)
Accuracy rate (%)
Time (s)
Accuracy rate (%)
Time (s)
Accuracy rate (%)
BASE5
0.2964
76.4
52.71
73.4
44.45
62.4
BASE6
0.2340
61.2
42.99
77.6
23.42
74.2
BASE7
0.2184
67.6
38.52
77.6
27.63
76.5
BASE8
0.1716
74.6
33.48
71.3
17.57
71.1
algorithm based on ELM, with different excitation functions (Gaus, Sig, and RBF). Under the excitation function Sig, BASE2 has the longest training time of 0.0312 s, which is acceptable. Under the random distribution, the accuracy rates of BASE1 BASE4 are, in order, 16.7, 20, 25, and 33.3 %. Using the algorithm proposed in this paper, BASE1 has the best classification effect. The accuracy rate of BASE1 is 60.36 %, under the excitation Sig, which is 3.6 times as accurate as the accuracy rate of the random distribution. Even for the worst classification effect, the accuracy rate of BASE4 is 76.51 %, under the excitation function Gaus, which is 2.3 times as accurate as the accuracy rate of the random distribution.
123
RBF
Table 7 presents the training time and testing accuracy rate of the compound database ðBASE5 BASE8Þ using the algorithm based on ELM, with different excitation functions (Gaus, Sig, and RBF). Under the excitation function Sig, BASE6 has the longest training time of 0.1957 s, which is acceptable. Under the random distribution, the accuracy rates of BASE5 BASE8 are, in order, 16.7, 20, 25, and 33.3 %. Using the algorithm proposed in this paper, BASE5 has the best classification effect. The accuracy rate of BASE5 is 55.23 %, under the excitation Sig, which is 3.3 times as accurate as the accuracy rate of the random distribution. Even for the worst classification effect, the accuracy rate of BASE4 is 72.74 %, under the
Cogn Comput (2015) 7:346–358
excitation function Sig, which is 2.2 times as accurate as the accuracy rate of the random distribution.
357
objects. A large number of experiments were carried out to verify the efficiency and effectiveness of our proposed algorithms.
Training Time and Accuracy Rate Comparison To verify the effectiveness of the proposed algorithm, we adopted ELM, back-propagation (BP), and support vector machine (SVM) to construct the classifier, according to the proposed classification framework and feature extraction algorithm. BP and SVM were realized through the packages MATLAB and LIBSVM, respectively. LIBSVM is designed by Lin (Lin Chih-Jen) et al., an associate professor at Taiwan University. As an effective package for SVM pattern recognition and regression, its advantage is its ease of use for the programmer. SVM chooses a sigmoidal kernel function. For the ELM algorithm, the number of hidden-layer nodes that we selected was 60; for BP, the number was 10; for the SVM algorithm, we set the parameter C (the penalty parameter) as 100. Table 8 presents the training time and accuracy rate comparison for constructing the classifier for the SCOP database ðBASE1 BASE4Þ. The number of training data is 200; the difference of the classification accuracy between three algorithms is little, while the average training time of ELM is more than 10 times faster than for BP or SVM. Table 9 presents the training time and accuracy rate comparison for constructing the classifier for the compound database ðBASE5 BASE8Þ. In this experiment, we add training data to around 10,000. We can see that the superiority of the ELM algorithm is obvious; the difference of the classification accuracy between three algorithms is little, while the training time of ELM is more than 100 times faster than for BP or SVM.
Conclusions With the use of advanced hardware and software technologies, uncertain big data are widely produced in many applications. As a basic data structure, the graph is suitable for modeling some kinds of uncertain big data. The task of classifying uncertain graph data, which is one of the most important components in cognitive computation, now faces huge challenges. In this paper, we investigate the problem of efficiently classifying uncertain graphs and propose a framework to classify uncertain graph objects. In this framework, we first extend the gSpan algorithm to mine all frequent subgraphs from an uncertain graph database. These frequent subgraphs form a candidate set of features. Then, a part of the discriminative features are extracted from the candidate set as final features. Finally, based on the ELM technique, we train the multiclass classifier over uncertain graph data
Acknowledgments This research was partially supported by the National Natural Science Foundation of China under Grant Nos. 61173029 and 61272182; New Century Excellent Talents in University (NCET-11-0085).
References 1. Taylor JG. Cognitive computation. Cogn Comput. 2009;1(1): 4–16. 2. Wollmer M, Eyben F, Graves A, Schuller B, Rigoll G. Bidirectional LSTM networks for context-sensitive keyword detection in a cognitive virtual agent framework. Cogn Comput. 2010;2(3):180–90. 3. Mital P, Smith T, Hill R, Henderson J. Clustering of gaze during dynamic scene viewing is predicted by motion. Cogn Comput. 2011;3(1):5–24. 4. Cambria E, Hussain A. ‘‘Sentic computing: techniques, tools, and applications’’, SpringerBriefs in cognitive computation. Dordrecht: Springer; 2012. 5. Wang Q, Cambria E, Liu C, Hussain A. Common sense knowledge for handwritten Chinese recognition. Cogn Comput. 2013;5(2):234–42. 6. Xu Y, Guo R, Wang L. A twin multi-class classification support vector machine. Cogn Comput. 2013;5(4):580–8. 7. Tsivtsivadze E, Urban J, Geuvers H, Heskes T. Semantic graph kernels for automated reasoning. In: SDM, 2011. pp. 795–803. 8. Tamas H, Thomas G, Stefan W. Cyclic pattern kernels for predictive graph mining. In: KDD, 2004. pp. 158–167. 9. Thomas G, Peter F, Stefan W. On graph kernels: hardness results and efficient alternatives. In: COLT, 2003. pp. 129–143. 10. Jin N, Young C, Wang W. GAIA: graph classification using evolutionary computation. In: SIGMOD, 2010. pp. 879–890. 11. Thoma M, Cheng H, Gretton A, Han J, Kriegel H-P, Smola AJ, Song L, Yu PS, Yan X, K.M. Borgwardt: near-optimal supervised feature selection among frequent subgraphs. In: SDM, 2009. pp. 1075–1086. 12. Jin N, Young C, Wang W. Graph classification based on pattern co-occurrence. In: CIKM, 2009. pp. 573–582. 13. Kong X, Yu PS. Semi-supervised feature selection for graph classification. In: SIGKDD, 2010. pp. 793–802. 14. Bifet A, Holmes G, Pfahringer B, Gavald R. Mining frequent closed graphs on evolving data streams. In: SIGKDD, 2011. pp. 591–599. 15. Yan X, Han J. gSpan: graph-based substructure pattern mining. ICDM, 2002. pp. 721–724. 16. Parthasarathy S, Tatikonda S, Duygu U. A survey of graph mining techniques for biological datasets. In: Managing and mining graph data, 2010. pp. 547–580. 17. Jiang C, Coenen F, Zito M. A survey of frequent subgraph mining algorithms. In: Knowledge Engineering Review, 2013. pp. 75–105. 18. Thoma M, Cheng H, Gretton A, Han J, Kriegel HP, Smola A, Orgwardt KM. Discriminative frequent subgraph mining with optimality guarantees. Stat Anal Data Min. 2010;3(5):302–18. 19. Shelokar P, Quirin A, Cordn O. MOSubdue: a Pareto dominancebased multiobjective subdue algorithm for frequent subgraph mining. Knowl Inf Syst. 2013;34(1):75–108. 20. Huang G-B, Zhu QY, Siew CK. Extreme learning machine: theory and applications. Neurocomputing. 2006;70(1):489–501.
123
358 21. Zhao Z, Chen Z, Chen Y, Wang S, Wang H. A class incremental extreme learning machine for activity recognition. Cogn Comput. 2014. doi:10.1007/s12559-014-9259-y. 22. Wang X, Shao Q, Miao Q, Zhai J. Architecture selection for networks trained with extreme learning machine using localized generalization error model. Neurocomputing. 2013;102:3–9. 23. Huang G-B, Zhou H, Ding X, Zhang R. Extreme learning machine for regression and multiclass classification. IEEE Trans Syst. 2012;42(2):513–29. 24. Miche Y, Sorjamaa A, Bas P, Simula Q, Jutten C, Lendasse A. OP-ELM: optimally pruned extreme learning machine. Neural Netw. 2010;21(1):158–62. 25. Mishra A, Goel A, Singh R, Chetty G, Singh L. A novel image watermarking scheme using extreme learning machine. In: Neural Networks (IJCNN), 2012. pp. 1–6. 26. Huang G-B, Wang DH, Lan Y. Extreme learning machines: a survey. Int J Mach Learn Cybern. 2011;2(2):107–22.
123
Cogn Comput (2015) 7:346–358 27. Zong W, Huang G-B. Learning to rank with extreme learning machine. Neural Process Lett. 2013;39(2):1–12. 28. Zong W, Huang G-B, Chen Y. Weighted extreme learning machine for imbalance learning. Neurocomputing. 2013;101:229–42. 29. Fletcher R. Practical methods of optimization. John Wiley & Sons, 2013. p. 2. 30. Cormen TH, Leiserson CE, Rivest RL, Stein C. Introduction to algorithms. In: Constrained optimization, 2001. 31. The protein structure. Retrieved May 6, 2013 from http://www. rcsb.org/pdb/. 32. Structural classification of proteins. Retrieved May 10, 2013 from http://scop.mrc-lmb.cam.ac.uk/scop/. 33. The database of compound structures. Retrieved May 8, 2013 from http://pubchem.ncbi.nlm.nih.gov.