In this paper, we propose the Margin algorithm to find maximal fre- quent subgraphs. ...... http://kdd.ics.uci.edu/databases/msnbc/msnbc.html. [3] C. Borgelt and ...
MARGIN: Maximal Frequent Subgraph Mining Lini T Thomas
Satyanarayana R Valluri
ABSTRACT
The exponential number of possible subgraphs makes the problem of frequent subgraph mining a challenge. Maximal frequent mining has triggered much interest since the size of the set of maximal frequent subgraphs is much smaller to that of the set of frequent subgraphs. We propose an algorithm that mines the maximal frequent subgraphs while pruning the lattice space considerably. This reduces the number of isomorphism computations which is the kernel of all frequent subgraph mining problems. Experimental results validate the utility of the technique proposed.
1.
INTRODUCTION
Discovering interesting patterns in large datasets has a wide range of applications. Data mining techniques are applied to solve complex problems in a variety of domains. Graphs are ubiquitous in nature and many real-life applications in drug discovery, link analysis, semantic web, social networks, VLSI, etc., can be modelled using graphs. In the area of drug discovery, the chemical compounds are modelled as graphs and graph mining is used in classification [12] and to find frequent substructures [11]. Interesting patterns like web communities can be mined using the links information when web is seen as a graph [19]. Many applications require the computation of maximal frequent subgraphs such as mining contact maps [14], finding maximal frequent patterns in metobolic pathways [20], and finding large cohesive web pages. For a graph of edges, the number of possible frequent subgraphs could be as large as . Thus, frequent graph mining requires finding an exponential number of subgraphs. As the core operation of subgraph isomorphism testing is NP-complete, it is critical to minimise the number of subgraphs that need to be considered. In this paper, we propose a technique that mines the maximal frequent subgraphs of a graph database. The set of maximal frequent subgraphs is significantly smaller than the set of frequent subgraphs [16] thus providing scope for ample pruning of the exponentially large search space. A typical approach to frequent subgraph mining problem has been to find frequent subgraphs incrementally in an apriori manner. Canon-
Kamalakar Karlapalem
ical labeling has been adopted to reduce the number of times each candidate is generated [29, 15]. The apriori based approach has been further modified to suit closed subgraph mining [30] and maximal subgraph mining [16] with added pruning. In this paper, we propose the Margin algorithm to find maximal frequent subgraphs. The set of candidate subgraphs which are likely to be maximally frequent are the set of -edge frequent subgraphs that have a -edge infrequent supergraph. In this paper we refer to such a set as
(Figure 1). The Margin algorithm computes such a candidate set efficiently. By a post-processing step it then finds all maximally frequent subgraphs. The !#" step invoked within the Margin algorithm recursively finds the candidate subgraphs. The search space of apriori based algorithms corresponds to the region below the
in the graph lattice as shown in Figure 1. On the other hand, Margin explores a much smaller search space by visiting the lattice around the
$ . f ( ) nodes
Graph Lattice Margin Search Space
Apriori Based Search Space
Figure 1: Search Space Explored Web pages that are related can be determined based on the information about the set of web pages that are visited by users in one session. Each sequence of pages visited can be modelled as a graph. Given a set of such graphs, the cummulative behaviour of the web users can be mined by finding the maximal frequent subgraphs. Each maximal frequent subgraph might denote the cohesive set of topics that are of interest to most of the visitors of the website. We have run experiments on a dataset of page views of msnbc.com and reported the maximal cohesive set of topics (see section 4). We compared the performance of the Margin algorithm with that of gSpan [29]. The experimental results show that our algorithm performs upto 20 times faster than gSpan for certain datasets. We generated all frequent subgraphs from the maximal frequent subgraphs obtained by Margin and cross validated the results with that of gSpan. Table 1 shows the running time of Margin algorithm as compared to gSpan algorithm when executed on synthetic datasets of various sizes with support value 20. The description of the dataset and more results are discussed in section 4. The main contributions of our paper are as follows. A novel technique to find maximal frequent subgraphs is presented. An algo-
0
G1
c
a
0
c
1
b
2
1
2
a
c
G
0
1
0 (2)
1
c
1
a
Subgraph 0 Embedding (2)
c
a (2)
1
Count
Figure Database %'&)(*,2:+-.Graph */10
c(2) c2
Margin 32.73 44.51 45.97 64.23 90.43 132.1 156.07 183.78 265.43
gSpan 616.56 932.43 1222.2 1409.55 1532.23 1654.32 1839.38 2432.43 2842.42
gSpan Ratio: Margin 18.84 20.95 26.59 21.94 16.94 12.52 11.79 13.24 10.71
Table 1: Running time with support 20 rithm based on this technique is designed and its correctness is proved. The viability of this technique in efficiently finding maximal frequent subgraphs is shown through experimental results on both synthetic and real-life data sets. In section 2, we develop the formalism used in the paper. In section 3 we present the Margin algorithm. We report our performance result in section 4, discuss the related work in 5 and conclude our study in section 6.
2.
PRELIMINARY CONCEPTS
In this section we first provide the necessary background and notation. We adopt the same definitions as in [29] for Labeled Graph, Isomorphism and Frequent Subgraphs.
Definition 1. Labeled Graph: *3& - -657-.8 A labeled graph can be repre
4 , where sentated as a tuple, 4 is a set of vertices, 5:9;4=4 is a set of edges, 8 is a set of labels, 5 8 : 4@?ACB , is a function assigning labels to the vertices and edges. Definition 2. Isomorphism, Subgraph Isomorphism: An iso* *!H morphism is a bijective function >D4E FG B 4 such that
I
I
N
-6Q
* -.8ML &8 L$O >JA4K
N * - Q * RJ>K
N PRJ>K
8 L -PQ &)8 LO
N
N
NP H and - Q P
0
a
(2)
b (1) 1
c
(2)
2
c
b(1)
c (2)
1
(2)
g
2
b (1)
+
,2
Level 3
Level 2
Level 1
Level 0
g
Figure 3: Lattice 2
Size 200 300 400 500 600 700 800 900 1000
a
2
c 1
0
2
a
1
G2 a
Cut
2
G2
0
c (1)
c
1
a
Lattice L 2
Lattice L 1
MinSup=2
/
A subgraph isomorphism from *!H to a subgraph of .
*
to
* H
is an isomorphism from
*
For the ease of presentation, in this paper we assume undirected connected labeled graphs. We denote the%Urelationship % &)(*V+-.*/“subgraph -XWYWY-Z*\[#0 of” using 9TS . is a database of graphs . Given a graph database ]_^ a`b let
c ed -P* &=f g
%
of graphs and minimum support,
*,if d is isomorphic to a subgraph of * if d is not isomorphic to a subgraph of -m*\o c ed -h% &ji c Lkmln ed
c e d -h% denotes the occurence frequency of d in % , ^ W ., the support % -h% is of d in . A frequent subgraph is a graph d , such that c ed p ] ^ greater than or equal to a`b . &r( + - / -XWXWsWt0 % Let q be the set of frequent subgraphs of for a d d o _ ] ^ a`u . A graph d Jvq iso said to be maximal if given support there exists no subgraph dw,Jxq such that, d 9 % S dw . Let yqz9'q be the set of maximal frequent subgraphs of . Given the graph % database and a minimum support ]_^ a`b , the problem of maximal frequent% subgraph mining is to compute all maximal frequent subgraphs in . We conceptualise the search space for finding yq in the form of a graph lattice. The lattice structure to the Margin algorithm can be represented using two models, namely the replication model and the embedded model. In the embedded model, a common lattice { of the database % is constructed where all isomorphic forms of any subgraph of the % graphs in is represented exactly once. Each node in the lattice 2 stores the subgraph d it% represents along with all the embeddings of d in the database . The bottommost node corresponds to the empty % subgraph | S and the topmost nodes corresponds % to the graphs in . The embedded model of the graph database in Figure fig:2 has been shown in Figure fig:5. (The upper diamond+ property / holds as follows in the embedded model. Two graphs d d / , have a common child and d in lattice 2 having a common parent + if atleast one of the embeddings of d and d are extensions of the same embedding of d . )needed ??? ***********
*\o % In the replication model, (fig:3) for each subgraph dA}7S J o , each 2 of *\o embedding of d is represented *\o by a node in the*\lattice o . Hence if a subgraph d_}~S occurso times in then there are representations of d in the lattice 2 . Such a lattice has been described in the previous section.
A
e1
e2
e2
e1
Ci
Cj
P
In the rest of the paper we proceed using the replication model unless mentioned. With minor modifications the Margin algorithm can be applied to the embedded model.
Figure 4: Fu ' Z "6
o *o % J Lattice 2 is a graph lattice of graph . Every node * o in the . Every lattice is the embedding of a*connected subgraph of o embedding of a subgraph in occurs exactly once in the lattice. The bottom most node corresponds*\to o the empty subgraph | S and . A node is a child of the the top most nodes correspond to o & node | S in the lattice 2 , if 9 S and and differ by exactly one edge. The node is a parent of such a node in the lattice. All single node subgraphs are the children of the node |MS . |S is thus the parent of all the single node subgraphs. An edge exists in the lattice between every pair of child and parent nodes.
+ / + P ROOF . Let and be the edges on the vertices / ( + 0!incident & ( / 0!& o and in respectively. Let w and ? + + ? o as/ shown in Figure 4. Hence would be incident on in and / & ( /10 ? would be incident on in . Let . w w & ( +s0 ( /10\& ( /10 ( +s0\& o ( +s0 ? ? . Hence, & p o ? ( +Z0\&? ( /1p 0 ? ? ? Hence, w o is the common child of and w proving the F 1 - -property.
%&
(*,+1-.*/0 Example: Consider in Figure 2. To keep the example simple, we assume that all the edge labels are identical and in the figure. The corresponding latttice + not shown + - hence / * are * / and respectively are given in Figure 3. The bot2 2 of tom most node corresponds to the *\o empty % subgraph |S and the top J most nodes correspond to graph . The children of a node in the lattice denote all the supergraphs of that can be obtained by one edge. For instance the child of the subgraph by extending in 2 + is subgraph _ (by adding the edge > ). Sim' and ! are the parents of the subgraph ilarly, subgraphs / + _ b p in 2 . The subgraphsg _ and occur twice 2 *Vsince + g in E there are two embeddings , of and , of in .
In subsection 3.1 we discuss the lattice models, in 3.2, we provide the intuition behind the algorithm proposed to find the maximal frequent subgraphs. In subsection 3.3 we present the Margin algorithm followed by the proof in the final subsection 3.4.
*
3. OUR APPROACH
3.1 Lattice Representation Models G
1
0
c
Definition 3. Cut: A cut between two nodes in a - lattice represented by
) R is defined as an ordered pair
where o is the child of the node for J2 and is not frequent while *\o % is frequent. J The cut ( r ~ ) belongs to a graph , if * o . The frequent subgraph of a cut is represented by
r9 S (frequent- ) and the infrequent subgraph is represented by M
(infrequent- ).
& Example: Consider Figure 3 with ]p^ a`b . The pair ( V , ) in 2 / is marked as a cut since the subgraph is frequent with a count of 2 while its child z is infrequent with a count of 1. ~ is the M
node and is the
node. Figure 3 shows the frequency count of each node in+ the example lattice along with all / the existing cuts in the lattice 2 and 2 respectively.
We that the following property holds in every lattice 2 *\o observe % J which we exploit in our algorithm.
o
of
- -property): Property 1.o6- Upper Diamond propertyoP(- F 1 * * % Any two children w of a node , where w J for J , will have a common child.
2
a
0
c
1
a G 1(1)
G 2(0)
Lattice L
c a
c
1
2
a
c
c
b2
G (1−0,1−2) G 2(0−1)
*
For a given graph , the size of the* graph (denoted by ), refers .* All the subgraphs of equal to the numberQ of edges present in o o size form a the lattice of . The node corresponding in 2 g to |MS forms level , singleton vertex graphs form level and the g (Figure 3). nodes of size ^ form level ^ z for ^F
1
G 1(0,2) G 2(1)
b
G2
G (1−2)
b G 2(2)
g
Figure 5: Embedded Model The lattice structure to the Margin algorithm can be represented using two models, namely the replication model and the embedded model.
*\o % J In the replication model, (fig:3) for each subgraph d}~S , each embedding of is represented by a separate node in the lattice d o *\o *\o *o 2 of . Hence if a subgraph dp}TS o occurs times in then there are nodes of d in the lattice 2 . % In the embedded model, a common lattice { of the database is constructed where all isomorphic forms of any subgraph of the % graphs in is represented exactly once. Each node in the lattice 2 stores the subgraph d it% represents along with all the embeddings of d in the database . The bottommost node corresponds to the empty % subgraph | S and the topmost nodes corresponds % to the graphs in . The embedded model of the graph database in Figure 2 has been shown in Figure 5. In the rest of the paper we proceed using the replication model unless mentioned. With minor modifications the Margin algorithm can be applied to the embedded model.
3.2 Intuition
The set of candidate subgraphs that are likely to become maximal frequent are the
nodes. This is because they are frequent subgrahs having an infrequent child. The remaining nodes of the lattice cannot be maximal frequent subgraphs. In this paper, we present an approach that avoids traversing the lattice o bottom *\o up % and instead J . We prune traverses the cuts alone in each lattice 2 for the set of
nodes to give the set of maximal frequent subgraphs. Margin algorithm unlike the apriori based algorithms goes directly o to any one of the
nodes of the lattice 2 o and then finds all other
nodes by cutting across the lattice 2 . We give an insight below into the approach developed. Finding the initial
node * of+ our % result set is a trivial dropping , ensuring that the resulting of edges from the initial graph J subgraph is connected until we find the first frequent subgraph . Our initial cut is thus ( \; F ) where \ is the infrequent child of . We devise *o % an algorithm #!#" which for each cut J belongs to , recursively extends the cut to explore all cuts *\o in .
C¬
CÂÃ
P
(a)
Step1: The node in lattice 2 o can have many parents that are frequent or infrequent, one of which is .+ Consider the + frequent parent in Figure 6(b). Cut ( U ) exists since + is frequent while is infrequent. Thus, for an intial cut ( x ), all frequent parents of are reported as
nodes. Step2: Consider all the children +- /- of any frequent parent of as in Figure 6(c). Each of them can be frequent or infrequent. / / (a): Consider an infrequent / child . Cut ( ) exists since is frequent while is infrequent. Thus, for an intial cut ( a ), foro each frequent o parent ¡ of that has an infrequent child , the cut ( + ¡ ) is reported. (b): Consider a frequent child . By F - - Z 1 "6 , + the nodes and have a common child ¢ . ¢ is infre+ quent as its parent is infrequent. Hence, cut ( ¢£ \ ) exists. Thus, for an intial cut ( ' ), for each frequent paro ent ¡ o of consider each of its frequent child . Theo cut ( ¢¤ ) is reported where ¢ is the common child of and . Step3: Consider all parents ` + - ` / - `¥ of an infrequent par/ ent of as in Figure 6(d). Each such parent can + - be frequent or infrequent. Consider frequent parents ` `$¥ (Fig/ ure/ 6(d))+ of an infrequent parent of . Hence, the cuts / ( F` / ) and + ( ¦` ¥ )./ However, if step 1 is called on the cut ( F` ), the cut ( F`¥ ) is found.o Thus, for an initial cut ( § ), for each infrequent parent o of , consider any o one frequent parent ` ¡ of . The cut ( ` ¡ ) is invoked.
Example: Consider a section of the lattice as illustrated in Figure 6(e). - +sLet - / cut ( z F ) be the initial - + cut. Node has three / parents of which subgraphs are frequent and is infrequent. Step1: Among the parents of , consider the frequent parents
C
¸¹ ©¨©
C2¨
P 2
´µ S2 (d)
N
¼½
M1
²³
P
S1
P (c)
¶·
M
ÆÇ
C1
P
º» ÌÍ ÄÅ CÈÉ 3
C
Infrequent subgraph Frequent subgraph
P 2
P 1
ÊË S1
(e)
Next, we provide an intuition to the !#" algorithm used to find the nearby o given an initial cut ( ; ) (Figure 6(a)) in o * cuts the lattice 2 of . Recursively invoking !#" on*oeach newly found cut with the below three steps, finds all cuts in .
°±
P 2
P1 (b)
C¾¿
M
C1
C2
ª «ª « P
ÀÁ ®¯
S2
S3 T1
T2
Figure 6: u$!Î"
+ and . All frequent parents are reported as
nodes as their child is infrequent. - + / + Step2: has three children and while has two children / and ¥ . In step2(a), considering the frequent parent , the cuts ( + ¦ ) and ( U ¦ + ) are reported.+ Considering the frequent parent , the cuts ( x ) and ( ¥ ) are reported. In step2(b),+ + considering the frequent parent , the cut ( ¢ R ) is found. does not have an frequent children and hence does not report any cuts in this step. / Step3: The infrequent parent of is considered and the cuts / + / ( ` ) and ( `¥ ) are reported. All the cuts found by the first iteration of the !#" function are marked in Figure 6(e). The algorithm u$!Î" is recursively called for each cut that is found. Hence, due to !#" + Ï ¢ being invoked on the cut ( ), applying step2(b) the cut ( = + Similarly, due to u $ ! Î" being invoked on cut ¢ / ) is found. + / ¦ ` ( ¦` ), applying step1, the cut ( ) is found and applying ¥ / + / / step3, the cuts ( ` Ð ) and ( ` bÐ ) are found. The proof that the algorithm finds all the cuts in the lattice is given in section 3.4. The detailed algorithm u$!Î" is given in section 3.3. To the best of our knowledge, no such approach has been proposed so far in the literature of maximal frequent subgraph mining.
3.3 The Margin Algorithm
Table 2 shows the ¢z d ^ algorithm to&rfind Ñ the globally maximal frequent subgraphs y q . Initially, y q (line 1) and the graphs % Òaq is the set of locally maximum subgraphso in are * unexplored. o in each which is initially empty(line 3). A representative * o *\o of , is the first
node found among the subgraphs of %)&Ó(* + -.* / -sWXWXW-.* [ 0 * o . Initially, given the graphs , for each J % o *\o , we find the representative for* o (line 3). This is done by iteratively dropping an edge from until a connected frequent subgraph is found. The !#" algorithm finds the nearby cuts and recursively calls !#" on each newly found *o cut. % The algorithm functions J *o in a manner that finding one cut in would find all cuts in as shall be proved in section 3.4. In line
S3
%'&)(*,+-Z*/-XWWXWs-.*\[M0 Input: Graph Database , Output: Set of Maximal Frequent Graphs yq Algorithm: Ñ 1 MF= *\o % J 2. For each do 3. Òq =| o *\o 4. Find the representativeo ofo o o 5. !#"X eÒ$q , \ where \ is the infrequent child of 6. Merge(yq ,Òaq ) Table 2: Algorithm Margin Input: *o Òq : The maximal frequent subgraphs seen so far in . Cut: Output: The updated set of maximal frequent subgraphs Òq . Algorithm: +X- /-XWsWXWXÔ o & ÔMÕ -Xbe 1. Let Ô WXWXWsthe - children of . 2. for eacho Ô , ^ do 3. if Ô is frequent & o Òq Òaq§?ÖÔ 4. o o 5. for each parent \Ô of Ô do o 6. if \Ô is infrequent do o o 7. ExpandCut( Òbq , \Ô Ô ) o 8. if \Ô is frequent do o 9. Find common parent ¢ ofo and \Ô 10. ExpandCut(Òbq , ¢ \Ô ) o 11. if Ô is infrequent o o 12. if one frequent child !Ô ofo Ô exists o !Ô ) 13. ExpandCut(Òbq ,Ô Table 3: Algorithm ExpandCut(Òaq , 6 the gobally maximal frequent subgraphs yq found so far are merged with the local maximal frequent subgraphs Òq found in *\o . The merge function finds the new globally maximal set by removing all subgraphs that are not maximal due to the subgraphs in Òaq and adds subgraphs *o from Òq that are globally maximally frequent so far including . Table 3 shows the #!#" algorithm which expands a given cut such that its neighboring cuts will be explored. The input to the algorithm are the set of maximal frequent subgraphs yq found so far (initially empty),
' . o o and the cut o For each parent Ô of , if Ô is frequent, Ô is added to yKq (lines 3-4).
For each infrequent child !Ô o of Ô o , u$!Î" is called o o on the cut ( !Ô Ô ) (line 6-7). For each frequent child !Ô o of Ô o , let ¢ be the common o child o of and !Ô . #!#" is called on the cut ( ¢j !Ô ) (line 8-10). o On the otherhand, ifo Ô is infrequent and there exists atleast one o frequent parent of o o Ô , then, !#" is o called on the cut (Ô \Ô ) where \Ô is one frequent parent of Ô . (lines 11-13).
3.3.1 Further Optimisations
Each invocation of the !#" algorithm finds a list of cuts. #!#" is recursively invoked on each newly found cut. Optimisations to reduce the number of revisited cuts are essential in
)
order to reduce the computation time spent on verifying whether the cut has been visited previously. We discuss these optimisations separately inorder to keep the main algorithm simple. Few of the optimisations that help speeden the results are given below.
Let the cut ( V 1 ) be the cut on which u$!Î" has been invoked. Line 1 of the !#" algorithm computes the o parents of . Consider a frequent o parento Ô of . If there exists atleast one frequent child !Ô of Ô , then by Fu - property, o there exists a node ¢ which is the common child of !Ô and . being infrequent, ¢ is infrequent. It can be shown that lines 5-10 of the o u$!Î" algorithm that iterate over all the children of Ô can beo replaced by calling u$!Î" on just one cut ( ¢ T!Ô ). This leads to reducing the o number of revisited cuts. Also, a single frequent child o !Ô needs to be generated instead of all the children of !Ô in line 5. Consider an invocation of #!#" on a cut ( @ 7 ). Lines 11-13 of o o the algorithm check for infrequent parents Ô of . If Ô is found among the set of infrequent graphs already visited, then, the o !#" invoked on cut ( × ) skips executing lines 12-13 on Ô . Consider an invocation of u$!Î" on cut ( v ). The + -XWXWXWXparents Ô Õ of are computed in line 1. For some o & Ô o Ô , Ô since-sWX WXWXis- a parent of . Therefore, in line 5 all ¦Õ6Ù of are computed and explored. the children RÕPØ & k If any child RÕ k is infrequent, then, u$!Î" is invoked on cut ( RÕ ). In the invocation of #!#"
k on the cut ( ¦Õ ¦ ), the children of are recomputed and revisited. Since the children of are already explored in the invocation of !#" on the cut ( z F ), their reexplorationk can be avoided in the invocation of !#" on cut ( Õ ) by passing the appropriate information.
2
q 2
G2
Supergraphs of P P Common Cut Subgraphs of P
Figure 7: Finding
$ Frequent subgraphs wherein the count value of each frequent subgraph need not be reported can be obtained by reporting all subgraphs of yKq or the
nodes. Isomorphism computations are required inorder to eliminate duplicates alone from the subgraphs generated from yKq . On the other hand, the apriori based approach computes frequency for each generated subgraph. This requires finding isomorphic forms of each generated subgraph in the entire graph database. Furthermore, techniques in [24] can be adapted to provide an approximate frequency count value for the frequent subgraphs.
3.4 Proof of correctness
In this subsection, we prove the correctness of Margin algorithm. We first prove the* essential o % claim that trivially implies * o % that having found one cut in *\o J leads to all cuts in J % . Final an initial cut in each would thus find us all the cuts in which is pruned to compute the maximal frequent subgraphs. Note that | S is always considered frequent.
+ + / / *\o Claim 1. Given two cuts ( F ) and ( ) in JÚ , invoking !#" on one cut finds the other cut. o & o - o -P57-.8 & - P ROOF. Let
4 for ^ . We consider o a o F 2 6 Û Ü Ý 2 lattice as described below which is contained in lattice of * o . +Þ / &:Ñ , o consider the largest common subgraph " Case1: + If / of and in lattice 2 (Figure 8). The node " is the lowermost o o node of the lattice 2Û6ÜÝ and hence is at level 0 of 2FÛ6ÜÝ . Note that
q 1
q 5 q 6
ìëì
Y2
ë
Y3
q 4
G1
X 0Y 0
q 3
3.3.2 Discussion
Consider+ a cut ( ) *V ). + The bold *,+lines in Figure 7 represent the . r}TS being frequent, all the parents lattice 2 of the graph of are frequent due to which no M
node + (hence no cut) can lie among the subgraphs of in the lattice 2 in the figure. Similarly all the supergraphs of being infrequent, no
node (hence no cut) would lie among the supergraphs of the . Hence the subgraph and supergraph space of d can be pruned out of the candidate set for frequent maximal subgraphs. Hence, each cut found prunes our search space substantially. Margin traverses all the nodes on the o I ^ ,the infrequent parents of the M
nodes cuts in the lattice 2 and their infrequent parents. The remaining nodes on the lattice need not be explored. We prove in section 3.4 o that*\all o our % approach finds all the
nodes in all the lattice 2 of J which is finally pruned to get the maximal frequent subgraphs.
p 0
q 0
Y4
êéê
Y çK èèK ç 1çèèç
ï Xï î ïK ïK é íîí
p 1
ò ñK X ñò ðK ð ðð ñK2ñ K 1
p 2 p 3 X3
å
= C1 4
3 2
f2
g
1
÷øK øK ÷ ø÷ø÷
X5
Lattice L
f3
g
Frequent Subgraphs
p 5
C2= f 5 f4
g
,
p 4
P2
1
g
öõö
ô X 4õ óK óKôóó æåæ
ä ââá ãK ãKãäã K á Y6 à áKá ßK K ß P ßàß Y5
1
f1
Lattice D
Level 0 of Lattice L
t
Infrequent Subgraphs Unknown Subgraphs
g
Level 0 of Lattice D
+Þ / &Ñ o Figure 8: Lattice 2FÛ6ÜÝ : o o the level 0 of 2FÛ6ÜÝ occurs at theo level "ur of lattice 2 since, as mentioned earlier, in lattice 2 , the level of a node is one added + to the adding an edge SZØAJ + size of the node. Extend " too d by ^A g in lattice 2 oÛ6ÜÝ to
oúùb+ "m incident on " . Extend d for + o o küû d , by taking an edge S Ø Jý d incident on d until oúùb+F& + + + o d . Extend in the lattice + / 2Û6ÜÝ to . Similary, extend o " tooúùb + by adding an edge ¡mØ Jþ "m incident ono " . Extend too / g ü k û for ^ by taking an edge ¡ Ø Jþ incident on oúùb+¦& / g o until . For every two nodes in level ^F of 2FÛ6ÜÝ which o o has a common parent in 2 Û6ÜÝ , construct the common child in 2 Û6ÜÝ . Such a common child always exists by the Fÿ~ - property. + Þ / &Ñ & -m57-.8 Case2: If , / consider +
+4 / *\o the shortest path V ? V ? that connects and in as in Figure 9. Hence, * o forms a connected path is a sequence o ( ] +-XWXWXWs- subgraph 0 of . A shortest ] oüù+ for ] of edges where edge ] is adjacent + + to & X X W W X W ^ ] ] is # , / is incident on an +¦edge & ] in+ and incident on an edge in . Let subgraph d ? Z S Ø where SZØ + + +7& ] ? ¡mØ is the edge of incident on/ ] . Similarly let ] where ¡mØ is the edge of incident on # . The lowermost o node of the lattice 2 Û6ÜÝ is | S at level 0 (Figure 10). Level 1 of the o lattice(2FÛ6ÜÝ ) consists of the set of vertices + 4 + , which are children ?Öd o ?> oüù+ forms the nodes of of the node |MS . The set of edges o the level 2 + of theo lattice 2 ÛhÜÝ . Extend d g to d o oúby ùb+¦taking & + an edge kúû S Ø Jx d incident on d for ^F . Extend until d + + o oúùb+ o in lattice 2Û6Ü/ Ý to o . Similarly, extend to by taking o oúùb+ an kúû edge ¡ Ø JU in incident on to form a subgraph o the lattice 2FÛ6ÜÝ . For every two nodes in level ^T which have a o common parent in the lattice 2FÛ6ÜÝ , construct the common child. Below is the proof for case 1 +which uses Figure 8 that also holds for / o case 2. All supergraphs of + and / in lattice 2 Û6ÜÝ are & infrequent +-XWXWXWX- [ while all subgraphs of and are frequent.+ Let
o 6 Û Ü Ý be the ordered sequence of supergraphs of in lattice such 2 + WWXW [ -XWXWsWXthat / . Let be the& ordered
o 2 & in lattice such that sequence of supergraphs of 6 Û Ü Ý + WsWXW E as illustrated in Figure 8. Any nodeo6- |S o in the lattice 2Û6ÜÝ is now referred to as a unique tuple ( w ) such
P
P 1 m 1
m2
above claim+ by induction. Base Case: To [ prove that the claim + &C 7 holds for . + The + cut on the path - [D+ corresponds to the + ). & The +Xchild(
) of is infrequent. initial cut ( - [D+ . If7 [node Consider node + + - [D+ N D+ is frequent, cut - [EDathe is found by lines(8( N R × N ) on the path ! #" algorithm. If node is infrequent, cut 10) + of the + + - [EDa+ ( N 7 ) on the path is found by lines(6-7) of the Hence the next cut is found either on path + 7[D+ !#" algorithm. or path . Hence, conditions 1 and 2 of the claim are satisfied.
P 2
m
Ep
Gi
&
Figure 9: Path
-
4
-P5~-.8
)
X 0Y0
Y1
2
Y2
Y4
Y5
+*+
Y6
*
'&'
Y3
X1
1 X2
!!
,
X3
#"#
X4
" %X $%$ 5 )() (
Y7 P1 EP VP
&
X
6 P2
C2
C1 (X 2 Y7 ) = P
g
g
=
X3 Y7 X4 Y7 X5 Y7 X6 Y7
,
-,-
Frequent Subgraphs Infrequent Subgraphs Unknown Subgraphs
+ Þ / &zÑ o Figure 10: Lattice 2 Û6ÜÝ : o that is the smallest supergraph of in / and w - is the smallest supergraph of in . By construction, =( . ). The node o |MS in lattice 2 Û6ÜÝ is numbered differently. + Consider& the child o - [ /10 of | S such that / 0 is a subgraph of . Let / 0
N - for ^ 2] (see Figure 8). Node |S corresponds to tuples ( w [ ) some 3 & ^ -WXWXWX- ] (Figure 10) for 4 as and in the proof. x +Kwhen & +X-srequired WXWXWX5
Further we define the set of paths each o + oo - ùwhere + & of form (( g -WX6 J WXWX5 - is the union of edges Xw ),( 71w o )) for 4 T / 8 & 7 + X X W X W X W 9 7 [ / . Similary 5 J WWs5 - is the ooúùb+ - where each & g -XWX: ] union of edges of form (( Xw ),( Xw )) for ^ (see Figure 8). oo- [ Notice that I ^×( g ) is infrequent while ( + ) is frequent o g for each in . Hence, for each w J;5 - , o for 4 , there ex ists exactly one cut. Arguing similarly, ( ) is infrequent while I - o o g ) is frequent g in , ^ (< for . Hence, there lies one / I each 7 cut on each wx= . Thus,+ there+ are ] ; cuts in the J 5 , 4 o lattice / 2Û6ÜÝ / . Given the initial cut + ( + ¦ ), we are - [ to prove +s- that [ M F N 7 [ ) cut ( ) is reachable. Cut ( )=cut ( N o by the construction / / of the lattice 2 Û6ÜÝ and lies on the path . The . final cut ( ) to be reached lies on the path > The Rechability Sequence: + -sWXWXWXù [ $ 7 [ The order of paths ono which o + / the cuts from path to are found where ? J 5 or J?5 satisfies the following: 1. for w 2. for w
&
&;7
Û
J@5
+
,
Û
ùb+F&
/ 7 b ù +R& J@5 , Û Û
A 4 A 2 4
17 [ P ROOF . Since to , the initial +&B 7[ we traverse the pathsù$from [& . We prove the
and the final path path
+ -sWXWXWXInduction Step: Assuming that the claim holds for , F2 ] , we prove that the claim holds for ù+ . Let +X- /-XWXÛ WXWs-