A survey of Uncertainty Handling in Frequent Subgraph Mining ...

3 downloads 6302 Views 93KB Size Report
this context, many frequent subgraph mining algorithms have. been proposed which are based on exact or uncertain graphs. database. Therefore, the choice of ...
A survey of Uncertainty Handling in Frequent Subgraph Mining Algorithms Mohamed Moussaoui

Montaceur Zaghdoud

Jalel Akaichi

BESTMOD Laboratory Information System Department Information System Department Higher Institute of Management, Tunisia Prince Sattam bin Abdulaziz University, KSA King Khalid University, KSA Central Polytechnic School of Tunis Email: [email protected] Email: [email protected] Email: [email protected]

Abstract—Frequent subgraph mining is useful in most knowledge discovery tasks such as classification, clustering and indexing. Many algorithms and methods have been developed to mine frequent subgraphs. To have an understanding of several mining frequent subgraph algorithms, it is advantageous to establish a common framework for their study. In this paper, we propose a comparative study of several approaches by focusing on the intrinsic characteristics of these algorithms. A set of existing approaches in literature are reviewed and categorized according to the certainty nature of input which can be exact or uncertain graphs.

I. I NTRODUCTION Graphs are usually used in modelling complex structures such as protein interaction networks in biology are graphs, where each node represents a protein and each edge represents the presence of an interaction [1]. Generally, any system comprised of entities having connections between them can be represented as a graph, where the entities are represented by nodes and relationships are represented by edges. The greater expressive power of the graph encourages their use in extremely diverse domains. For example, in biology, the biochemical networks such as the metabolic pathways and the genetic regulation known as the transduction signal networks constitute a significant graph of interactions. Likewise, the graphs are used in chemical data processing in a way that the molecule structure is described as a graph which implies that the molecule catalogs are processed as graph set. In the Internet area, the rise of social networks has shown the need to model the social interactions as graphs [2]. Data mining is defined by the process of discovering patterns or models from data. The patterns, however, often consist of previously unknown and implicit information and knowledge embedded within a data set. Graph mining represents nowadays an active and attractive data mining investigation axis due to the increasing demand on the analysis of complex structured data. The problem that arises here is to ensure that the classical data mining techniques are still equally applicable on graph models or no. Furthermore, the fact that almost all useful data mining measures such as similarity and distance cannot be easily defined for graph in as intuitive way as is the case for multidimensional data. As a matter of fact, the mining algorithms for graph are more challenging to implement because of the structural nature of the data. The

978-1-5090-0478-2/15/$31.00 ©2015 IEEE

second challenge that arises in many applications of graph mining is that the graphs are usually very large scale in nature. Indeed, in practice, those graphs can reach a significant size, as in social networks or interaction graphs [2]. One of the ways used to describe a large or several graphs is to extract all of its subgraphs that frequently occur in a set of graphs. In this context, many frequent subgraph mining algorithms have been proposed which are based on exact or uncertain graphs database. Therefore, the choice of an appropriate algorithm for any particular graph mining application becomes hard task because of constraints and other application requirements or the environment necessities. Indeed, this paper discusses many graph mining methods and tries to compare them based upon some criteria and with specific attention given to uncertain reasoning in graph mining applications. II. A LGORITHMS W ITHOUT U NCERTAINTY In this section, we present the mining of frequent subgraph algorithms from a set of exact graphs. Generally, there are two types of approaches for mining frequent subgraphs namely Apriori Based Approach and Pattern Growth Approach. This distinction between these two types is revealed according to the method of construction of subgraphs candidates. Both approaches start with the construction of candidate subgraphs with small size. They stem from an ascending manner in each iteration by increasing the size of subgraphs candidates. A. Apriori-based Approaches The basic principle of apriori based approaches is to minimize the number of occurrences (isomorphism subgraph) to check. The main idea of the Apriori based substructure mining algorithm is to generate candidate subgraphs of size k+1 by joining two subgraphs of size k. Apriori-based algorithms of frequent subgraph are as below. AGM algorithm. AGM [3] is the first algorithms based on the apriori property for mining frequent subgraphs. AGM extract both frequent subgraphs connected and unconnected. The main contribution of this approach lies in the way of construction of subgraphs candidates. Indeed, AGM generates a candidate subgraph by adding a vertex to a frequent subgraph. This expansion is made by joining two frequent subgraphs slightly different.

FSG algorithm. Similarly to AGM, FSG [4] is also one of the first methods based on the apriori property. The main difference between AGM and FSG is that AGM generates a candidate subgraph of size k by adding a vertex to a frequent subgraph of size k-1 whereas FSG generates a candidate subgraph of size k by adding an edge to frequent subgraph of size k-1. FFSM algorithm. FFSM [5] uses the Canonical Adjacency Matrix (CAM), to present the labeled graphs and to detect their isomorphism. It generates patterns not by extension but by joining operators adapted to the proposed canonical encoding. Since a single graph can be presented by several CAM, the authors of FFSM have proposed the maximal code to check isomorphism tests by maintaining an embedding set for each frequent subgraph. SPIN algorithm. SPIN proposed by Yan and Han [6]. It mines only the maximal frequent subgraphs of a set of largesized graphs. A graph g is maximal if none of its proper supergraphs is frequent. To extract a maximal frequent subgraph from a set of the graphs, a new frame was created which combines the exploitation of both trees and subgraphs. We discover at first all the frequent trees of a set of graphs by creating a group of frequent subgraphs of the extracted trees. Experiments have shown the effectiveness of this algorithm that exponentially reduces the number of frequent subgraphs discovered. SPIN offers very good scalability to large graph databases. B. Pattern Growth Approaches In the pattern growth approach, the construction of subgraphs candidates of size k is made by extending the frequent subgraphs of size k-1 by a node or an edge in all possible positions. The pattern growth approaches are less expensive and faster. Pattern growth based algorithms of frequent subgraph are as below. Subdue algorithm. Subdue [7] search frequent substructures in a single graph. Subdue focuses both the frequent pattern discovery and compress the graph data set, which performs a heuristic search using the minimum description length principle. gSpan algorithm. gSpan [8] is the most popular algorithm based on the pattern growth property. In this approach, two contributions have been proposed. The first is to effectively minimize the number of subgraphs candidates. The second is to accelerate the isomorphism phase. To minimize the number of subgraph candidates, gSpan uses a concept called the rightmost extension where a new subgraph candidate is made by adding a new edge between the rightmost node and another node on the rightmost path of a graph. Given a graph G and one of its depth first search trees T, the rightmost path of G with respect to T is the rightmost path of the tree T. gSpan chooses only one depth first search tree T which produces the canonical form of G for extension. MoFa algorithm. MoFa [9] has been specifically designed to identify fragments in a set of molecules. It can be used in other application contexts. In this approach, the construction

of new subgraph candidates is made by extending the frequent subgraphs discovered by adding an edge and a node if necessary. To reduce the number of isomorphism tests, MoFa checks for each candidate subgraph if it has already been discovered. It saves all the possible extensions that can be voluminous and especially when we deal with small fragments that justifies the excessive memory consumption. CloseGraph algorithm. CloseGraph [10] developed by Yan and Han in 2003. This method is an improvement of gSpan. A subgraph G is closed if there is no super-graph G which has the same support as G. This method gave a more motivation than gSpan in terms of speed which results a higher number of discovered frequent subgraphs, with a simple extension of the gSpan algorithm without losing accuracy. The concept of closed subgraph not only reduced unnecessary subgraphs to be produced, but also substantially increases the efficiency of mining, particularly in the presence of large graphs patterns. Gaston algorithm. Gaston [2] stores the possible extensions subgraphs candidates to generate new subgraphs. These effectively appear in the database that allows for rapid isomorphism tests to accelerate the search speed of frequent subgraphs. By analysing all frequent substructures, it is observed that this set is composed of a majority of tree and path and a minority sub-graphs. A search frequent subgraph is more expensive than the search frequent tree or paths. The main idea of Gaston is to separate different types of structures (path, tree, graph) establishing a procedure for each type of structure to accelerate the search for frequent substructures. In fact, the gain in terms of time represents the difference of complexity between research of the frequent subgraphs, frequent trees and frequent paths. However, this strategy could waste time and memory. RING algorithm. RING [11] consists in finding frequent subgraphs by measuring the distance between graphs. RING use invariant vectors methodology for subgraph pattern mining, and uses the invariant distance (Euclidean distance)between the graphs instead of the edit distance or other types of graph distance. The patterns are mapped to invariant vectors in a multidimensional space. Two important steps of RING algorithm can be cited; firstly, it extracts a set of randomly frequent before putting them in different groups of subgraphs. Then, it selects the centers of the clusters (groups) as representative patterns original. Secondly, it adopts a depth-first searching algorithm (DFS) to mine representative subgraphs in a predefined space limit. RP-GD algorithm. RP-GD [12] is more efficient to generate the set of representative patterns from a set of graphs directly instead all frequent closed subgraphs patterns. RP-GD uses delta-jump pattern technique, DFS search strategy and the rightmost extension. RP-GD cannot provide a guaranteed ratio bound but it is more efficient than RP-FP. Generally, RP-GD can select a representative set by scanning closed frequent subgraphs. RP-FP algorithm. RP-FP [13] derives a representative set from frequent closed subgraphs. It works well when the size of the set of frequent closed patterns is not very large.

However, in real applications where the number of frequent closed patterns is usually very large, RP-FP algorithm does not scale well. So it is important to use an alternative algorithm that finds all patterns of representative graphs effectively maintaining the quality of results. GraphSig algorithm. GraphSig was proposed by [14] as solution of the mining discriminative subgraph patterns with low frequencies. The first step is about converting graphs into feature vectors through a random walk with restarts on each node. For selecting a meaningful feature set, domain knowledge is used. GraphSig assumes that graphs with similar feature vectors share highly frequent subgraphs. Therefore, prior probabilities of features are computed empirically to estimate statistical significance of patterns in the feature space. Then, it mines frequent subgraphs in each group with high frequency thresholds ensuing a high reduction in the computation cost. Results show that GraphSig is able to find discriminative patterns in large graph datasets and even with low frequencies. OSS algorithm. In [15], authors proposed approach for subgraphs. about sampling interesting subgraph patterns without enumerating the entire set of candidate frequent patterns. This approach is useful in the case where traditional approaches fail to run. The sampling is predefined by the user. This is performed by a random walk on the candidate subgraph partial order.When the walk converges to a desired distribution, the algorithm returns the discovered subgraph samples and stops. Although authors successfully performed experiments on small and large graph datasets, their approach stores the entire database in the memory which makes it inefficient if the database does not fit the memory. Musk algorithm. Musk [16] suggests an alternative formulation for representative frequent pattern. It is about obtaining representative patterns by sampling uniformly from the all frequent maximal patterns; by using a variant of Markov Chain Monte Carlo (MCMC) algorithm. Uniformity is achieved. In the stationary distribution of the walk, all maximal frequent pattern nodes in the partial order graph are collected uniformly. Cork algorithm. CORK [17] is a method based on the binary classification of subgraphs. The main idea in Cork is to discover frequent subgraphs that eliminate the correspondence between the graphs of positive and negative classes. CORK uses the number of correspondences to measure the discrimination power of a subgraph pattern and thereby achieves a theoretically near-optimal solution. Given a set of subgraphs, the number of matches is the total number of pairs of graphs that cannot be discriminated by these subgraphs. The number of matches can usually get good results. However, it is not perfect, because the subgraphs of discriminative power may have the same number of matches. gPLS algorithm. gPLS [18] attempts to select a set of frequent subgraphs most discriminating. gPLS uses subgraph pattern mining to discovery features for partial least squares regression so that the mining process only searches for patterns that can improve the accuracy of the resulting classifier. gPLS is based on the mathematical concept PLS (Partial Least Squares). This is an iterative algorithm based on partial least

squares regression, which extracts hidden attributes allowing better prediction iteratively from a large space attributes. An interesting point of PLS is that it depends only on the basic matrix calculations (e.g, addition and multiplication). Therefore, it is more effective than other methods based on mathematical programming. The mining algorithm is iteratively called with different weight vectors. However, latent variables have the known disadvantage of poor interpretability. Leap algorithm. Leap [19] is another approach developed to extract patterns subgraphs most discriminating. It is designed to exploit the correlation between structural similarity and significant similarity. The most significant pattern could be identified quickly by searching the grounds of different graphs. Leap used two new concepts, structural leap search and frequency descending mining. The first concept takes into account the patterns of subgraphs that are structurally identical and have a power similar discrimination. This can be used to calculate an upper limit of their narrow discrimination scores. The second concept takes advantage of the observation that subgraphs with a higher frequency are more likely to be discriminatory and therefore can reach the optimal solution faster. COM algorithm. COM [20] is a graph classification method which follows a process of pattern mining and classifier learning. COM employs a pattern exploration order such that the complementary discriminative patterns are examined first. Based on the subgraphs co-occurrences information, it constructs classification rules by assembling weak features in order to generate strong ones. Patterns are grouped into co-occurrence rules during the pattern exploration, leading to an integrated process of pattern mining and classifier learning. Evaluation of COM on protein and chemical compound datasets showed that it has competitive results in terms of classification accuracy and execution time. Besides, it produces an interpretable classifier. ORIGAMI algorithm. ORIGAMI [21] is an effective method for the extraction of all the representative patterns. This approach takes into account the representatives based distances in space pattern. ORIGAMI first uses a randomized algorithm to randomly browse the workspace, looking for areas that are not yet unexplored, to return a set of maximal patterns. ORIGAMI extracts in a second a random sample of maximal frequent subgraphs. It then extracts an -orthogonal and -representative set of maximal patterns extracted. However, the selection in ORIGAMI is straightforward and is performed after discovering the frequent maximal subgraphs. GraphRank algorithm. In [22], the authors proposed a technique for evaluating the statistical significance of frequent subgraphs in a database data. The main idea of this approach is to transform a subgraph to a feature vector and calculate the importance of each sub-graph taking into account the importance of the presence of the corresponding vector. In order to obtain a probability distribution on the support of the vector, GraphRank used the probability of the vector in a random vectors database based on the a priori probability of the basic elements. Results show that GraphRank is efficient,

and useful for ranking frequent subgraphs by their statistical significance. GTRACE algorithm. GTRACE was proposed by [23]. In this work, Authors propose a novel class of graph subsequences by introducing axiomatic rules of graph transformation, their admissibility constraints and a union graph. In fact, the main idea of this approach is to enumerate frequent transformation subsequences (FTSs) of graphs from a given set of graph sequences. The results of the experiments show high percentage of success, which is satisfactory.by using artificial datasets. EVDENSE algorithm. In [24], the authors proposed a technique to mine frequent dense subgraphs from PPI networks by extending vertices. In this work, the frequent patterns are generated based on a relative support due to the unbalance character of PPI network. Evaluation of EVDENSE on PPI datasets showed that it is effective to mine frequent coherent dense subgraphs across large number of massive graphs. GIS algorithm. In [25], authors were interested in mining frequent substructures in directed labelled graphs. The main idea is to reduce the size of the graph database using equivalence class principle. Authors presented a combination of L-R join operation, serial and mixed extensions to generate candidate substructures. They showed that GIS can avoid missing of any candidate substructures. Besides, Modified adjacency list representation is used for storing the graph/substructures efficiently. III. A LGORITHMS F OR U NCERTAIN G RAPHS A recent research has shown that uncertainties are inherent in the data in general and in particular in graphs such as social networks, chemical compounds, proteins, etc. This uncertainty can be a source of error in many applications. For example, graph theory is widely used to analyze the social networks because of its ability to represent and its simplicity. However, uncertainty arises for various reasons in large social networks [26]. The probability of an edge can represent uncertainty in link prediction [27] or the influence of one person to another [28], [29]. There are several methods that are dedicated to detect interactions between proteins. However, these methods produce a significant amount of noise interactions that do not really exist. This amounts to the inaccurate nature of proteinprotein interactions (PPIs). In this case, it is advisable to use the uncertain graph to represent a PPIs network, where the uncertainty of each edge represents the probability of the interaction that exists in practice [30]. Indeed, the uncertain graph [31] is mainly a probability distribution that indicates the existence of an edge or a node in practice. Various problems for mining uncertain graphs are discussed below. MUSE algorithm. MUSE is the first algorithm that was proposed by Z. Zou, J. Li, H. Gao, and S. Zhang in 2009 [32]. MUSE is developed to find a set of approximate set of frequent subgraph patterns from uncertain graph data. This algorithm has introduced a new measure called expected support which allows calculating the support of each candidate subgraph. This measurement is made up using the Monte

Carlo algorithm [33]. Thus, MUSE is essentially based on a threshold of probable support, minsup, and an error tolerance [0, 1] to find an approximate set of frequent subgraphs. As such all subgraphs having an expected support higher than or equal to minsup are frequent. In contrast, all subgraphs with an expected support less than (1 - ε) * minsup are not frequent. Moreover, the decisions are arbitrary for subgraphs with expected support in [(1 - ε) * minsup, minsup]. Weighted MUSE algorithm. Weighted MUSE [34] is an expanded version of MUSE algorithm. This algorithm is mainly dedicated to extract frequent subgraphs from DBLP (Digital Bibliography & Library Project). Indeed, this method provides an alternative algorithm to compute the recurrence frequency of subgraph on uncertain graph data by assigning a weight factor w(0,1) to the edges of embedding of each subgraph pattern. This avoids the calculation of expected support of a subgraph which shares the same edges with another subgraph previously assessed. Top-k Maximal Cliques algorithm (TMC). Finding Top-k Maximal Cliques algorithm proposed by Zou and al. in 2010 [35]. In graph theory, a clique in an undirected graph G = (V, E) is a set of its vertices such that every two vertices in the subset are connected by an edge. Therefore, a maximal clique is a clique which has the largest number of nodes. Due to the uncertain nature, a set of nodes cannot form a maximal clique in all graphs involved. Rather, each node has a set probability of being a maximal clique in all implicated graphs. This probability is known as the maximal-clique probability. So, the top-k maximal cliques are defined as a set of k nodes with the highest probability maximal-clique. MUSE-P algorithm. Another method has been proposed by [36] in the context of the semantic probabilistic. It is essentially based on a measure called ϕ-frequent probability that evaluates the degree of recurrence of a subgraph in an uncertain graph database. Thus, an error tolerance ε ǫ[0, 1] is taken into account in this algorithm. Generally, all subgraphs that have a ϕ-frequent probability greater than or equal to a minimum threshold τ ǫ [0, 1] can be frequent. Thus, all subgraphs with ϕ-frequent probability less than τ - ε cannot be frequent. It is rigorously proved that the calculation of the ϕ-frequent probability of subgraph S in a set of uncertain graphs is #P-hard. For this, an approximate mining algorithm is proposed to compute a set (ε, δ)-approximate frequent subgraphs. Therefore, each frequent subgraph must have at least a probability equal to ((1 - δ) / 2)s, where δ is a prespecified parameter and s is the number of edges of a subgraph S. UGRAP algorithm. UGRAP was proposed by [37] in order to reduce significantly the complexity of calculations required to determine the expected support of a subgraph candidate, UGRAP introduced an index of the uncertain graph database containing information about the edges of graphs with their probabilities and a set of connectivity information between nodes in the graph. Indeed, this information is summarized in order to reduce memory requirements in the case of large graph databases, particularly in the case of dense graphs. This

index is used later to calculate the expected support of a candidate subgraph pattern. UGRAP showed high efficiency in terms of time and memory consumption. T-PS algorithm. T-PS [38] is a solution to answer threshold-based probabilistic subgraph search over uncertain graph databases. This approach uses a filter and verification framework to speed up the search using a probabilistic inverted index. IV. U NCERTAINTY I N F REQUENT S UBGRAPH M INING In this section, we present several approaches based on uncertainty in frequent subgraph mining from a set of exact graphs. GraphSig algorithm. GraphSig was proposed by [14] as solution of the mining discriminative subgraph patterns with low frequencies. The first step is about converting graphs into feature vectors through a random walk with restarts on each node. For selecting a meaningful feature set, domain knowledge is used. GraphSig assumes that graphs with similar feature vectors share highly frequent subgraphs. Therefore, prior probabilities of features are computed empirically to estimate statistical significance of patterns in the feature space. Then, it mines frequent subgraphs in each group with high frequency thresholds ensuing a high reduction in the computation cost. Results show that GraphSig is able to find discriminative patterns in large graph datasets and even with low frequencies. OSS algorithm. In [15], authors proposed approach for subgraphs. about sampling interesting subgraph patterns without enumerating the entire set of candidate frequent patterns. This approach is useful in the case where traditional approaches fail to run. The sampling is predefined by the user. This is performed by a random walk on the candidate subgraph partial order.When the walk converges to a desired distribution, the algorithm returns the discovered subgraph samples and stops. Although authors successfully performed experiments on small and large graph datasets, their approach stores the entire database in the memory which makes it inefficient if the database does not fit the memory. Musk algorithm. Musk [16] suggests an alternative formulation for representative frequent pattern. It is about obtaining representative patterns by sampling uniformly from the all frequent maximal patterns; by using a variant of Markov Chain Monte Carlo (MCMC) algorithm. Uniformity is achieved. In the stationary distribution of the walk, all maximal frequent pattern nodes in the partial order graph are collected uniformly. Cork algorithm. CORK [17] is a method based on the binary classification of subgraphs. The main idea in Cork is to discover frequent subgraphs that eliminate the correspondence between the graphs of positive and negative classes. CORK uses the number of correspondences to measure the discrimination power of a subgraph pattern and thereby achieves a theoretically near-optimal solution. Given a set of subgraphs, the number of matches is the total number of pairs of graphs that cannot be discriminated by these subgraphs. The number of matches can usually get good results. However, it is not

perfect, because the subgraphs of discriminative power may have the same number of matches. gPLS algorithm. gPLS [18] attempts to select a set of frequent subgraphs most discriminating. gPLS uses subgraph pattern mining to discovery features for partial least squares regression so that the mining process only searches for patterns that can improve the accuracy of the resulting classifier. gPLS is based on the mathematical concept PLS (Partial Least Squares). This is an iterative algorithm based on partial least squares regression, which extracts hidden attributes allowing better prediction iteratively from a large space attributes. An interesting point of PLS is that it depends only on the basic matrix calculations (e.g, addition and multiplication). Therefore, it is more effective than other methods based on mathematical programming. The mining algorithm is iteratively called with different weight vectors. However, latent variables have the known disadvantage of poor interpretability. Leap algorithm. Leap [19] is another approach developed to extract patterns subgraphs most discriminating. It is designed to exploit the correlation between structural similarity and significant similarity. The most significant pattern could be identified quickly by searching the grounds of different graphs. Leap used two new concepts, structural leap search and frequency descending mining. The first concept takes into account the patterns of subgraphs that are structurally identical and have a power similar discrimination. This can be used to calculate an upper limit of their narrow discrimination scores. The second concept takes advantage of the observation that subgraphs with a higher frequency are more likely to be discriminatory and therefore can reach the optimal solution faster. COM algorithm. COM [20] is a graph classification method which follows a process of pattern mining and classifier learning. COM employs a pattern exploration order such that the complementary discriminative patterns are examined first. Based on the subgraphs co-occurrences information, it constructs classification rules by assembling weak features in order to generate strong ones. Patterns are grouped into co-occurrence rules during the pattern exploration, leading to an integrated process of pattern mining and classifier learning. Evaluation of COM on protein and chemical compound datasets showed that it has competitive results in terms of classification accuracy and execution time. Besides, it produces an interpretable classifier. ORIGAMI algorithm. ORIGAMI [21] is an effective method for the extraction of all the representative patterns. This approach takes into account the representatives based distances in space pattern. ORIGAMI first uses a randomized algorithm to randomly browse the workspace, looking for areas that are not yet unexplored, to return a set of maximal patterns. ORIGAMI extracts in a second a random sample of maximal frequent subgraphs. It then extracts an -orthogonal and -representative set of maximal patterns extracted. However, the selection in ORIGAMI is straightforward and is performed after discovering the frequent maximal subgraphs. GraphRank algorithm. In [22], the authors proposed a

technique for evaluating the statistical significance of frequent subgraphs in a database data. The main idea of this approach is to transform a subgraph to a feature vector and calculate the importance of each sub-graph taking into account the importance of the presence of the corresponding vector. In order to obtain a probability distribution on the support of the vector, GraphRank used the probability of the vector in a random vectors database based on the a priori probability of the basic elements. Results show that GraphRank is efficient, and useful for ranking frequent subgraphs by their statistical significance. GTRACE algorithm. GTRACE was proposed by [23]. In this work, Authors propose a novel class of graph subsequences by introducing axiomatic rules of graph transformation, their admissibility constraints and a union graph. In fact, the main idea of this approach is to enumerate frequent transformation subsequences (FTSs) of graphs from a given set of graph sequences. The results of the experiments show high percentage of success, which is satisfactory.by using artificial datasets. EVDENSE algorithm. In [24], the authors proposed a technique to mine frequent dense subgraphs from PPI networks by extending vertices. In this work, the frequent patterns are generated based on a relative support due to the unbalance character of PPI network. Evaluation of EVDENSE on PPI datasets showed that it is effective to mine frequent coherent dense subgraphs across large number of massive graphs. GIS algorithm. In [25], authors were interested in mining frequent substructures in directed labelled graphs. The main idea is to reduce the size of the graph database using equivalence class principle. Authors presented a combination of L-R join operation, serial and mixed extensions to generate candidate substructures. They showed that GIS can avoid missing of any candidate substructures. Besides, Modified adjacency list representation is used for storing the graph/substructures efficiently. V. C OMPARING G RAPH M INING A LGORITHMS Graph mining literature encloses several algorithms that have been developed using different approaches in various fields such as graph representation, nature of input and output, search strategy and completeness of output. We propose a framework for classification of them in order to help in understanding and analyzing their properties. A. Nature Of Input The nature of input may be different from one algorithm to another. There is may be a set of graph databases which consist of single small graph as Chemical molecules. However, it is possible to have a one big graph which generated from association of many small subgraphs such as social networks. B. Nature Of Output The main objective of each algorithm is to extract a reduced set of frequent subgraphs. Indeed, the nature of each output differs from one algorithm to another.

C. Graph Representation The graph representation is one of the most influential attributes of memory consumption and runtime. Generally, the graphs can be represented by the adjacency matrix, adjacency list, Hash table and Trie data structure [39]. D. Search Strategy Search strategy is categorized into DFS and BFS. BFS is a recursive algorithm that visits vertices level by level i.e. there no visit vertex of level n +1 before it explores all vertices of the level n. DFS is to explore the immediate the successors of each vertex visited by searching level by level where all nodes are considered to a particular level. Therefore, research in this manner covers all nodes in the graph. The disadvantage of the depth-first search is that the likelihood of redundancy in the subgraphs products is very high. E. Completeness Of Output The completeness of the search is a very interesting attribute to define the nature of research frequent subgraphs in graph databases. In fact, there are two techniques to extract the set of frequent subgraphs. The first technique is to extract the complete set of frequent subgraphs. The second technique returns a partial set of frequent subgraphs. Thus, there is a tradeoff between performance and completeness. It is necessary to consider that mining all frequent subgraphs is not always effective. F. Summary As there are currently many subgraph mining approaches, it is difficult and even unfair to compare them in general since the majority of the approaches were originally designed to solve a particular issue. Hence, the choice of an appropriate method highly depends on the users’ needs and the application constraints. TABLE I lists the subgraph mining approaches without uncertainty in order to help assisting such choice. In fact, approximation is usually used when the exact result is unknown or difficult to obtain such that the obtained inexact result are within required limits of accuracy. Generally, uncertainty in subgraph mining is done by structural approximation or label approximation. In TABLE II, we list all the approaches with uncertainty in frequent subgraph mining. TABLE III summarizes the related works according to discover frequent subgraphs in uncertain graph database. VI. D ISCUSSION Frequent subgraph mining from graph databases is a difficult task. This is due to two main reasons. The first one is the high complexity of the generation of subgraph candidates which is known to be NP-complete [9]. The second one is related to the computation of the support of each subgraph candidate. Indeed, the later requires a subgraph isomorphism test for each subgraph candidate with all the existent graphs in the database. In addition, a graph isomorphism test is also required for each subgraph candidate with all the already tested subgraphs to avoid duplications. The task of frequent subgraph

TABLE I: ALGORITHMS WITHOUT UNCERTAINTY Alg MoFa AGM FSG FFSM SUBDUE gSpan Gaston CloseGraph SPIN RP-FP RP-GD RING

Input Exact graphs Exact graphs Exact graphs Exact graphs Exact graphs Exact graphs Exact graphs Exact graphs Exact graphs Exact graphs Exact graphs Exact graphs

Output Frequent subgraphs Frequent subgraphs connected subgraphs Frequent subgraphs Frequent subgraphs Frequent subgraphs Maximal subgraphs Closed subgraphs Maximal subgraphs Representative subgraphs Representative subgraphs Representative subgraphs

Graph represent. Adjacency list Adjacency matrix Adjacency list Adjacency matrix Adjacency matrix Adjacency list Hash table Adjacency list Adjacency matrix Adjacency list Adjacency list Adjacency matrix

Search Strategy DFS BFS BFS BFS DFS DFS DFS DFS DFS DFS DFS DFS

Com. search Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No

Limit. Frequent graphs generated may not be exactly frequent High complexity due to multiple candidate generation High complexity due to multiple candidate generation NP-complete problem It is inefficient in dense databases It is unable to process large datasets Interesting patterns may be lost Failure detection takes lot of time overhead Interesting patterns may be lost Timer for summarizing the patterns is more than that for mining Timer for summarizing the patterns is more than that for mining Needs post processing

TABLE II: UNCERTAINTY IN FREQUENT SUBGRAPH MINING Cork gPLS Leap COM ORIGAMI Musk GraphSig OSS GraphRank GTRACE EVDENSE GIS

Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact Exact

graphs graphs graphs graphs graphs graphs graphs graphs graphs graphs graphs graphs

Discriminative subgraphs Discriminative subgraphs Discriminative subgraphs Discriminative subgraphs Representative subgraphs Representative subgraphs Discriminative subgraphs Sample of frequent subgraphs Significant subgraphs Frequent subsequences of graphs Frequent dense patterns Frequent substructures

Adjacency list Adjacency matrix Adjacency list Canonical adjacency Matrix Similarity matrix Transition probability matrix Feature vector Transition probability matrix Feature vector Adjacency list Adjacency list Adjacency list

DFS DFS DFS DFS DFS DFS DFS DFS DFS DFS DFS DFS

Yes No No Yes No No Yes No Yes Yes Yes Yes

Not scalable It is unable to process large datasets It is not suitable with unlabelled graph It is unable to process large datasets It may result in poor quality The sampling is limited to the maximal pattern only It lack discriminative subgraph patterns of low frequency It is inefficient if the database does not fit the memory Needs Pre-processing and not scalable Needs Pre-processing. It is inefficient in dense databases High execution time High memory consumption

TABLE III: ALGORITHMS FOR UNCERTAIN GRAPHS Alg MUSE Weighted MUSE MUSE-P TMC UGRAP T-PS DCR MFCS Prolog

Input Uncertain Uncertain Uncertain Uncertain Uncertain Uncertain Uncertain Uncertain Uncertain

graphs graphs graphs graphs graphs graphs graphs graphs graphs

Output Frequent subgraphs Frequent subgraphs Probabilistic frequent subgraphs Top-k Maximal Cliques Frequent subgraphs Probabilistic subgraphs Probabilistic subgraphs Highly reliable subgraphs Frequent pattern

Graph represent. Adjacency matrix Adjacency matrix Probabilistic matrix Adjacency Matrix Adjacency list Probabilistic matrix Probabilistic matrix Probabilistic matrix Probabilistic matrix

mining becomes even more difficult in the case of uncertain graphs. In fact, in all existing approaches, an uncertain graph of k edges implicates a set of 2k exact graphs. Consequently, each uncertain frequent subgraph mining is equivalent to 2k exact frequent subgraph mining. In fact, frequent subgraph mining from uncertain graph databases is generally based on expected semantics or probabilistic semantics. The expected semantics is more suitable for exploring motifs in a set of uncertain graphs and the probabilistic semantics is more suitable for extracting features from a set of uncertain graphs. However, the probability formalism suffers from certain deficiencies, if imprecise information has to be taken into account. Thus, the uncertainty may be inherent in mining frequent from exact graphs subgraphs. Therefore, not only frequent subgraphs with exact similarity but by extending it to non-exact similarity search. Due to the similitude between subgraphs are often not complete in social networks, we propose frequent subgraph mining in social networks using possibilistic approach. This innovative approach can effectively avoid the traditional method utilizing the probabilistic approach. In fact, there is an extensive formal correspondence between probability and possibility theories, where the addition operator corresponds to the maximum operator. Hence, the modelling of positive and negative information in possibility theory [40] is studied and observation is made precisely in a well-defined way using the possibility theory. It is observed illustratively that a disparity can be ascertained between what is supposedly possible as a result of inconsistency with the available knowledge and what is actually possible as often revealed from observations on the

Search Strategy BFS BFS BFS BFS BFS DFS DFS DFS DFS

Com. search Yes No No Yes Yes No Yes Yes Yes

Limit. Difficulty of calculating the expected support of each candidate pattern lack of scalability with large graphs Not scalable Interesting patterns may be lost The large number of subgraph isomorphism tests required Do not consider edge correlations Do not consider edge correlations Requiring all vertices in each subgraph to be fully connected Needs Pre-processing and not scalable

other hand. Mainly, in frequent subgraph mining (uncertain and exact) there are three key steps: candidates generation, graph and subgraph isomorphism. In fact, they represent the bottle neck especially the graph and subgraph isomorphism. Each method tries to resolve them in its own way. Yet, no Known frequent subgraph mining method has completely result this issues especially for large scale applications. VII. C ONCLUSION This paper reviews details popular approaches for mining frequent subgraphs by highlighting various algorithmic aspects. Most of the existent algorithms for frequent subgraph mining are dedicated for exact graph data and very few of them are devoted to uncertain graphs. R EFERENCES [1] S. Asthana, O. D. King, F. D. Gibbons, F. P. Roth, E. Alerting, S. Asthana, O. D. King, F. D. Gibbons, and F. P. Roth, “Predicting protein complex membership using probabilistic network reliability,” Genome Res, vol. 14, pp. 1170–1175, 2004. [2] Nijssen, “The gaston tool for frequent subgraph mining,” Electron. Notes Theor. Comput. Sci., vol. 127, no. 1, pp. 77–87, Mar. 2005. [3] Inokuchi, “An apriori-based algorithm for mining frequent substructures from graph data,” 2000, pp. 13–23. [4] M. Kuramochi and G. Karypis, “Frequent subgraph discovery,” in Proceedings of the 2001 IEEE International Conference on Data Mining, ser. ICDM ’01. Washington, DC, USA: IEEE Computer Society, 2001, pp. 313–320. [5] Jia, “A fast frequent subgraph mining algorithm,” in Young Computer Scientists, 2008. ICYCS 2008. The 9th International Conference for, nov. 2008, pp. 82 –87.

[6] Huan, “Spin: mining maximal frequent subgraphs from graph databases,” in Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’04. New York, NY, USA: ACM, 2004, pp. 581–586. [7] Cook, “Substructure discovery using minimum description length and background knowledge,” J. Artif. Int. Res., vol. 1, no. 1, pp. 231–255, Feb. 1994. [8] Yan, “gspan: Graph-based substructure pattern mining,” in Proceedings of the 2002 IEEE International Conference on Data Mining, ser. ICDM ’02. Washington, DC, USA: IEEE Computer Society, 2002, pp. 721–. [9] Borgelt, “Moss: a program for molecular substructure mining,” in Proceedings of the 1st international workshop on open source data mining: frequent pattern mining implementations, ser. OSDM ’05. New York, NY, USA: ACM, 2005, pp. 6–15. [10] Yan, “Closegraph: mining closed frequent graph patterns,” in Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’03. New York, NY, USA: ACM, 2003, pp. 286–295. [11] Shijie, “Ring: An integrated method for frequent representative subgraph mining,” in Data Mining, 2009. ICDM ’09. Ninth IEEE International Conference on, dec. 2009, pp. 1082 –1087. [12] Liu, “Efficient algorithms for summarizing graph patterns,” Knowledge and Data Engineering, IEEE Transactions on, vol. PP, no. 99, p. 1, 2010. [13] Jianzhong, “Efficient algorithms for summarizing graph patterns,” IEEE Transactions on Knowledge and Data Engineering, vol. 99, no. PrePrints, 2010. [14] S. Ranu and A. K. Singh, “Graphsig: A scalable approach to mining significant subgraphs in large graph databases.” in ICDE, Y. E. Ioannidis, D. L. Lee, and R. T. Ng, Eds. IEEE, 2009, pp. 844–855. [15] R. M. Karp and M. Luby, “Monte-carlo algorithms for enumeration and reliability problems,” in Foundations of Computer Science, 1983., 24th Annual Symposium on, Nov 1983, pp. 56–64. [16] Zaki, “Effective graph classification based on topological and label attributes,” Statistical Analysis and Data Mining, vol. 5, no. 4, pp. 265– 283, Aug 2012. [17] Marisa, “Discriminative frequent subgraph mining with optimality guarantees,” Statistical Analysis and Data Mining, vol. 3, no. 5, pp. 302–318, 2010. [18] Saigo, “Partial least squares regression for graph mining,” in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’08. New York, NY, USA: ACM, 2008, pp. 578–586. [19] X. Yan, H. Cheng, J. Han, and P. S. Yu, “Mining significant graph patterns by leap search,” in in SIGMOD 08. [20] Jin, “Graph classification based on pattern co-occurrence,” in Proceedings of the 18th ACM conference on Information and knowledge management, ser. CIKM ’09. New York, NY, USA: ACM, 2009, pp. 573–582. [21] Zaki, “Origami: A novel and effective approach for mining representative orthogonal graph patterns,” Stat. Anal. Data Min., vol. 1, no. 2, pp. 67–84, Jun. 2008. [22] H. He and A. Singh, “Graphrank: Statistical modeling and mining of significant subgraphs in the feature space,” in Data Mining, 2006. ICDM ’06. Sixth International Conference on, Dec 2006, pp. 885–890. [23] A. Inokuchi, H. Ikuta, and T. Washio, “GTRACE-RS: efficient graph sequence mining using reverse search,” CoRR, vol. abs/1110.3879, 2011. [24] M. Wang, X. Shang, D. Xie, and Z. Li, “Mining frequent dense subgraphs based on extending vertices from unbalanced ppi networks,” in Bioinformatics and Biomedical Engineering , 2009. ICBBE 2009. 3rd International Conference on, June 2009, pp. 1–7. [25] B. Chandra and S. Bhaskar, “A new algorithm for graph mining,” in Neural Networks (IJCNN), The 2011 International Joint Conference on, July 2011, pp. 988–995. [26] E. Adar and C. R, “Managing uncertainty in social networks,” IEEE DATA ENGINEERING BULLETIN, vol. 30, p. 2007, 2007. [27] Linyua and T. Zhou, “Link prediction in complex networks: A survey,” Physica A: Statistical Mechanics and its Applications, vol. 390, no. 6, pp. 1150 – 1170, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S037843711000991X [28] L. D. Raedt, A. Kimmig, and H. Toivonen, “Problog: a probabilistic prolog and its application in link discovery,” in In Proceedings of 20th International Joint Conference on Artificial Intelligence. AAAI Press, 2007, pp. 2468–2473.

[29] A. Kimmig and L. De Raedt, “Local query mining in a probabilistic prolog,” in Proceedings of the 21st International Jont Conference on Artifical Intelligence, ser. IJCAI’09. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2009, pp. 1095–1100. [Online]. Available: http://dl.acm.org/citation.cfm?id=1661445.1661620 [30] S. Skhiri and S. Jouili, “Large graph mining: Recent developments, challenges and potential solutions,” in Business Intelligence - Second European Summer School, eBISS 2012, Brussels, Belgium, July 15-21, 2012, Tutorial Lectures, 2012, pp. 103–124. [31] Z. Zou, H. Gao, and J. Li, “Discovering frequent subgraphs over uncertain graph databases under probabilistic semantics,” in Proceedings of the 16th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’10. New York, NY, USA: ACM, 2010, pp. 633–642. [32] Z. Zou, J. Li, H. Gao, and S. Zhang, “Mining frequent subgraph patterns from uncertain graph data,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 9, pp. 1203–1218, 2010. [33] C.-K. Chui, B. Kao, and E. Hung, “Mining frequent itemsets from uncertain data,” in Proceedings of the 11th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, ser. PAKDD’07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 47–58. [34] S. Jamil, A. Khan, Z. Halim, and A. Baig, “Weighted muse for frequent sub-graph pattern finding in uncertain dblp data,” in Internet Technology and Applications (iTAP), 2011 International Conference on, Aug 2011, pp. 1–6. [35] Z. Zou, J. Li, H. Gao, and S. Zhang, “Finding top-k maximal cliques in an uncertain graph,” in Data Engineering (ICDE), 2010 IEEE 26th International Conference on, March 2010, pp. 649–652. [36] J. Li, Z. Zou, and H. Gao, “Mining frequent subgraphs over uncertain graph databases under probabilistic semantics,” The VLDB Journal, vol. 21, no. 6, pp. 753–777, 2012. [37] O. Papapetrou, E. Ioannou, and D. Skoutas, “Efficient discovery of frequent subgraph patterns in uncertain graph databases,” in Proceedings of the 14th International Conference on Extending Database Technology, ser. EDBT/ICDT ’11. New York, NY, USA: ACM, 2011, pp. 355–366. [38] Z. Zou, J. Li, H. Gao, and S. Zhang, “Mining frequent subgraph patterns from uncertain graph data,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 9, pp. 1203–1218, 2010. [39] D. Kempe, J. Kleinberg, and E. Tardos, “Maximizing the spread of influence through a social network,” in Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ser. KDD ’03. New York, NY, USA: ACM, 2003, pp. 137–146. [Online]. Available: http://doi.acm.org/10.1145/956750.956769 [40] S. Benferhat, D. Dubois, S. Kaci, and H. Prade, “Modeling positive and negative information in possibility theory,” International Journal of Information Systems (IJIS), vol. 23, no. 10, pp. 1094–1118, oct 2008.

Suggest Documents