MARGIN: Maximal Frequent Subgraph Mining - CiteSeerX

15 downloads 1766 Views 294KB Size Report
In this paper, we propose the Margin algorithm to find maximal fre- quent subgraphs. ...... http://kdd.ics.uci.edu/databases/msnbc/msnbc.html. [3] C. Borgelt and ...
MARGIN: Maximal Frequent Subgraph Mining Lini T Thomas

Satyanarayana R Valluri

ABSTRACT

The exponential number of possible subgraphs makes the problem of frequent subgraph mining a challenge. Maximal frequent mining has triggered much interest since the size of the set of maximal frequent subgraphs is much smaller to that of the set of frequent subgraphs. We propose an algorithm that mines the maximal frequent subgraphs while pruning the lattice space considerably. This reduces the number of isomorphism computations which is the kernel of all frequent subgraph mining problems. Experimental results validate the utility of the technique proposed.

1.

INTRODUCTION

Discovering interesting patterns in large datasets has a wide range of applications. Data mining techniques are applied to solve complex problems in a variety of domains. Graphs are ubiquitous in nature and many real-life applications in drug discovery, link analysis, semantic web, social networks, VLSI, etc., can be modelled using graphs. In the area of drug discovery, the chemical compounds are modelled as graphs and graph mining is used in classification [12] and to find frequent substructures [11]. Interesting patterns like web communities can be mined using the links information when web is seen as a graph [19]. Many applications require the computation of maximal frequent subgraphs such as mining contact maps [14], finding maximal frequent patterns in metobolic pathways [20], and finding large cohesive web pages. For a graph of edges,  the number of possible frequent subgraphs could be as large as . Thus, frequent graph mining requires finding an exponential number of subgraphs. As the core operation of subgraph isomorphism testing is NP-complete, it is critical to minimise the number of subgraphs that need to be considered. In this paper, we propose a technique that mines the maximal frequent subgraphs of a graph database. The set of maximal frequent subgraphs is significantly smaller than the set of frequent subgraphs [16] thus providing scope for ample pruning of the exponentially large search space. A typical approach to frequent subgraph mining problem has been to find frequent subgraphs incrementally in an apriori manner. Canon-

Kamalakar Karlapalem

ical labeling has been adopted to reduce the number of times each candidate is generated [29, 15]. The apriori based approach has been further modified to suit closed subgraph mining [30] and maximal subgraph mining [16] with added pruning. In this paper, we propose the Margin algorithm to find maximal frequent subgraphs. The set of candidate subgraphs which are likely to be maximally frequent are the set of  -edge frequent subgraphs that have a  -edge infrequent supergraph. In this paper we refer to such a set as

  (Figure 1). The Margin algorithm computes such a candidate set efficiently. By a post-processing step it then finds all maximally frequent subgraphs. The !#" step invoked within the Margin algorithm recursively finds the candidate subgraphs. The search space of apriori based algorithms corresponds to the region below the

  in the graph lattice as shown in Figure 1. On the other hand, Margin explores a much smaller search space by visiting the lattice around the

$  . f ( ) nodes

Graph Lattice Margin Search Space

Apriori Based Search Space

Figure 1: Search Space Explored Web pages that are related can be determined based on the information about the set of web pages that are visited by users in one session. Each sequence of pages visited can be modelled as a graph. Given a set of such graphs, the cummulative behaviour of the web users can be mined by finding the maximal frequent subgraphs. Each maximal frequent subgraph might denote the cohesive set of topics that are of interest to most of the visitors of the website. We have run experiments on a dataset of page views of msnbc.com and reported the maximal cohesive set of topics (see section 4). We compared the performance of the Margin algorithm with that of gSpan [29]. The experimental results show that our algorithm performs upto 20 times faster than gSpan for certain datasets. We generated all frequent subgraphs from the maximal frequent subgraphs obtained by Margin and cross validated the results with that of gSpan. Table 1 shows the running time of Margin algorithm as compared to gSpan algorithm when executed on synthetic datasets of various sizes with support value 20. The description of the dataset and more results are discussed in section 4. The main contributions of our paper are as follows. A novel technique to find maximal frequent subgraphs is presented. An algo-

0

G1

c

a

0

c

1

b

2

1

2

a

c

G

0

1

0 (2)

1

c

1

a

Subgraph 0 Embedding (2)

c

a (2)

1

Count

Figure Database %'&)(*,2:+-.Graph */10

c(2) c2

Margin 32.73 44.51 45.97 64.23 90.43 132.1 156.07 183.78 265.43

gSpan 616.56 932.43 1222.2 1409.55 1532.23 1654.32 1839.38 2432.43 2842.42

gSpan Ratio: Margin 18.84 20.95 26.59 21.94 16.94 12.52 11.79 13.24 10.71

Table 1: Running time with support 20 rithm based on this technique is designed and its correctness is proved. The viability of this technique in efficiently finding maximal frequent subgraphs is shown through experimental results on both synthetic and real-life data sets. In section 2, we develop the formalism used in the paper. In section 3 we present the Margin algorithm. We report our performance result in section 4, discuss the related work in 5 and conclude our study in section 6.

2.

PRELIMINARY CONCEPTS

In this section we first provide the necessary background and notation. We adopt the same definitions as in [29] for Labeled Graph, Isomorphism and Frequent Subgraphs.

Definition 1. Labeled Graph: *3& - -657-.8 A labeled graph can be repre

4   , where sentated as a tuple, 4 is a set of vertices, 5:9;4=4 is a set of edges, 8 is a set of labels, 5 8 : 4@?ACB , is a function assigning labels to the vertices and edges. Definition 2. Isomorphism, Subgraph Isomorphism: An iso* *!H morphism is a bijective function >D4E FG B 4  such that

I

I

N

-6Q

* -.8ML &8 L$O >JA4K 

N * - Q * RJ>K 

N PRJ>K

8 L -PQ &)8 LO

N 

N

NP H  and - Q P

0

a

(2)

b (1) 1

c

(2)

2

c

b(1)

c (2)

1

(2)

g

2

b (1)

+

,2

Level 3

Level 2

Level 1

Level 0

g

Figure 3: Lattice 2

Size 200 300 400 500 600 700 800 900 1000

a

2

c 1

0

2

a

1

G2 a

Cut

2

G2

0

c (1)

c

1

a

Lattice L 2

Lattice L 1

MinSup=2

/

A subgraph isomorphism from *!H to a subgraph of .

*

to

* H

is an isomorphism from

*

For the ease of presentation, in this paper we assume undirected connected labeled graphs. We denote the%Urelationship % &)(*V+-.*/“subgraph -XWYWY-Z*\[#0 of” using 9TS . is a database of  graphs . Given a graph database ]_^ a`b let

c ed -P*  &=f g 

%

of  graphs and minimum support,

*,if d is isomorphic to a subgraph of * if d is not isomorphic to a subgraph of -m*\o c ed -h%  &ji c  Lkmln ed

c e d -h%  denotes the occurence frequency of d in % , ^ W ., the support % -h%  is of d in . A frequent subgraph is a graph d , such that c ed p ] ^ greater than or equal to a`b . &r( + - / -XWXWsWt0 % Let q be the set of frequent subgraphs of for a d d o _ ] ^ a`u . A graph d Jvq iso said to be maximal if given support there exists no subgraph dw,Jxq such that, d 9 % S dw . Let yqz9'q be the set of maximal frequent subgraphs of . Given the graph % database and a minimum support ]_^ a`b , the problem of maximal frequent% subgraph mining is to compute all maximal frequent subgraphs in . We conceptualise the search space for finding yq in the form of a graph lattice. The lattice structure to the Margin algorithm can be represented using two models, namely the replication model and the embedded model. In the embedded model, a common lattice { of the database % is constructed where all isomorphic forms of any subgraph of the % graphs in is represented exactly once. Each node in the lattice 2 stores the subgraph d it% represents along with all the embeddings of d in the database . The bottommost node corresponds to the empty % subgraph | S and the topmost nodes corresponds % to the graphs in . The embedded model of the graph database in Figure fig:2 has been shown in Figure fig:5. (The upper diamond+ property / holds as follows in the embedded model. Two graphs d d / , have a common child and d in lattice 2 having a common parent + if atleast one of the embeddings of d and d are extensions of the same embedding of d . )needed ??? ***********

*\o % In the replication model, (fig:3) for each subgraph dA}7S J o , each 2 of *\o embedding of d is represented *\o by a node in the*\lattice o . Hence if a subgraph d_}~S occurso  times in then there are  representations of d in the lattice 2 . Such a lattice has been described in the previous section.

A

e1

e2

e2

e1

Ci

Cj

P

In the rest of the paper we proceed using the replication model unless mentioned. With minor modifications the Margin algorithm can be applied to the embedded model.

Figure 4: •Fu – —'€ – Z – "6™

o *o % J Lattice 2 is a graph lattice of graph . Every node * o in the . Every lattice is the embedding of a*connected subgraph of o embedding of a subgraph in occurs exactly once in the lattice. The bottom most node corresponds*\to o the empty subgraph | S and . A node  is a child of the the top most nodes correspond to o & node €‚ | S in the lattice 2 , if €ƒ9 S  and  and € differ by exactly one edge. The node € is a parent of such a node  in the lattice. All single node subgraphs are the children of the node |MS . |S is thus the parent of all the single node subgraphs. An edge exists in the lattice between every pair of child and parent nodes.

+ / + P ROOF . Let and be the edges on the vertices / ( + 0!incident & ( / 0!&  o and  in € respectively. Let €‹ w and €‹? + + ? o  as/ shown in Figure 4. Hence would be incident on  in  and / & ( /10   š  ? would be incident on in . Let . w w & ( +s0 ( /10\& ( /10 ( +s0\& o ( +s0 ?  ? . Hence, š & ›€p o ? ( +Z0\&? ( ›/1€p 0 ?  ?  ? Hence, š w o is the common child of  and w proving the •F 1– - — -property.

%„&…(*,+1-.*/0 Example: Consider in Figure 2. To keep the example simple, we assume that all the edge labels are identical and in the figure. The corresponding latttice + not shown + - hence / * are * / and respectively are given in Figure 3. The bot2 2 of tom most node corresponds to the *\o empty % subgraph |S and the top J most nodes correspond to graph . The children of a node † in the lattice denote all the supergraphs of † that can be obtained † by one edge. For instance the child of the subgraph by extending ‡‰ˆ  in 2 + is subgraph ‡ˆ  ˆ_‡ (by adding the edge  ˆ>‡ ). Simˆ'‡ and ‡!ˆ‹Š are the parents of the subgraph ilarly, subgraphs /  + _ ˆ b ‡ p ˆ Š in 2 . The subgraphsg  ˆ_ ‡ and ‡ occur twice  2 *Vsince + g  in ˆ ˆ E ˆ ‡ ‡ there are two embeddings  , of  and , of in .Œ

In subsection 3.1 we discuss the lattice models, in 3.2, we provide the intuition behind the algorithm proposed to find the maximal frequent subgraphs. In subsection 3.3 we present the Margin algorithm followed by the proof in the final subsection 3.4.

*

3. OUR APPROACH

3.1 Lattice Representation Models G

1

0

c

Definition 3. Cut: A cut between two nodes in a - lattice represented by

) R€ is defined as an ordered pair

 € where  o is the child of the node € for ‘J’2 and  is not frequent while *\o % € is frequent. J The cut ( r ~€ ) belongs to a graph , if * o . The frequent subgraph € of a cut is represented by

 r9 S (frequent- ) and the infrequent subgraph  is represented by “M

 (infrequent- ).

&  Example: Consider Figure 3 with ]p^ a`b . The pair ( ‡ˆVŠ  , ‡ ) in 2 / is marked as a cut since the subgraph ‡ is frequent with a count of 2 while its child ‡ˆzŠ is infrequent with a count of 1. ‡~ˆ”Š is the “M

 node and ‡ is the

 node. Figure 3 shows the frequency count of each node in+ the example lattice along with all / the existing cuts in the lattice 2 and 2 respectively.

We that the following property holds in every lattice 2 *\o observe % J which we exploit in our algorithm.

o

of

- — ˜ -property): Property 1.o6- Upper Diamond propertyoP(- •F 1– * *˜ % Any two children   w of a node € , where   w J for J , will have a common child.

2

a

0

c

1

a G 1(1)

G 2(0)

Lattice L

c a

c

1

2

a

c

c

b2

G (1−0,1−2) G 2(0−1)

*

For a given graph , the size of the* graph (denoted by   ), refers .* All the subgraphs of equal to the numberQ of edges present in o o size form a Ž the lattice of . The node corresponding Ž in 2 g to |MS forms level , singleton vertex graphs form level  and the g (Figure 3). nodes of size ^ form level ^ z for ^F

1

G 1(0,2) G 2(1)

b

G2

G (1−2)

b G 2(2)

g

Figure 5: Embedded Model The lattice structure to the Margin algorithm can be represented using two models, namely the replication model and the embedded model.

*\o % J In the replication model, (fig:3) for each subgraph dœ}~S , each embedding of is represented by a separate node in the lattice d o *\o *\o *o 2 of . Hence if a subgraph dp}TS o occurs  times in then there are  nodes of d in the lattice 2 . % In the embedded model, a common lattice { of the database is constructed where all isomorphic forms of any subgraph of the % graphs in is represented exactly once. Each node in the lattice 2 stores the subgraph d it% represents along with all the embeddings of d in the database . The bottommost node corresponds to the empty % subgraph | S and the topmost nodes corresponds % to the graphs in . The embedded model of the graph database in Figure 2 has been shown in Figure 5. In the rest of the paper we proceed using the replication model unless mentioned. With minor modifications the Margin algorithm can be applied to the embedded model.

3.2 Intuition

The set of candidate subgraphs that are likely to become maximal frequent are the

 nodes. This is because they are frequent subgrahs having an infrequent child. The remaining nodes of the lattice cannot be maximal frequent subgraphs. In this paper, we present an approach that avoids traversing the lattice o bottom *\o up % and instead J . We prune traverses the cuts alone in each lattice 2 for the set of

 nodes to give the set of maximal frequent subgraphs. Margin algorithm unlike the apriori based algorithms goes directly o to any one of the

 nodes of the lattice 2 o and then finds all other

 nodes by cutting across the lattice 2 . We give an insight below into the approach developed. Finding the initial

 node  * of+ our % result set is a trivial dropping , ensuring that the resulting of edges from the initial graph J subgraph is connected until we find the first frequent subgraph  . Our initial cut is thus ( \; F ) where \ is the infrequent child ‡ of  . We devise *o % an algorithm #!#" which for each cut J belongs to , recursively extends the cut to explore all cuts *\o in .

C¬­

CÂÃ

P

(a)

Ÿ Step1: The node  in lattice 2 o can have many parents that are frequent or infrequent, one of which is € .+ Consider the + € frequent parent in Figure 6(b). Cut ( U ‰€ ) exists since + € is frequent while  is infrequent. Thus, for an intial cut ( x € ), all frequent parents of  are reported as

 nodes. Ÿ Step2: Consider all the children  +-  /-  of any frequent parent € of  as in Figure 6(c). Each of them can be frequent or infrequent. / / (a): Consider an infrequent / child  . Cut (   € ) exists since € is frequent while  is infrequent. Thus, for an intial cut ( ’ a€ ), foro each frequent o parent €¡ of  that has an infrequent child  , the cut (  + € ¡ ) is reported. (b): Consider a frequent child  . By •F – - — - € – Z 1– "6™ , + the nodes  and  have a common child ¢ . ¢ is infre+ quent as its parent € is infrequent. Hence, cut ( ¢£ \ ) exists. Thus, for an intial cut ( ' ‰€ ), for each frequent paro ent € ¡ o of  consider each of its frequent child  . Theo cut ( ¢¤  ) is reported where ¢ is the common child of  and  . Ÿ Step3: Consider all parents ` + - ` / - `¥ of an infrequent par/ ent € of  as in Figure 6(d). Each such parent can + - be frequent or infrequent. Consider frequent parents ` `$¥ (Fig/ ure/ 6(d))+ of an infrequent parent € of  . Hence, the cuts / ( € F` / ) and + (€ ¦` ¥ )./ However, if step 1 is called on the cut ( € F` ), the cut ( € F`¥ ) is found.o Thus, for an initial cut ( § € ), for each infrequent parent € o of  , consider any o one frequent parent ` ¡ of € . The cut ( € ž` ¡ ) is invoked.

Example: Consider a section of the lattice as illustrated in Figure 6(e). - +sLet - / cut ( z F€ ) be the initial - + cut. Node  has three / parents € € € of which subgraphs € € are frequent and € is infrequent. Step1: Among the parents of  , consider the frequent parents €

C

¸¹ ©¨©

C2¨

P 2

´µ S2 (d)

N

¼½

M1

²³

P

S1

P (c)

¶·

M

ÆÇ

C1

P

º» ÌÍ ÄÅ CÈÉ 3

C

Infrequent subgraph Frequent subgraph

P 2

P 1

ÊË S1

(e)

Next, we provide an intuition to the !#" algorithm used to find the nearby o given an initial cut ( ; ž€ ) (Figure 6(a)) in o * cuts the lattice 2 of . Recursively invoking !#" on*oeach newly found cut with the below three steps, finds all cuts in .

°±

P 2

P1 (b)

C¾¿

M

C1

C2

ª «ª « P

ÀÁ ®¯

S2

S3 T1

T2

Figure 6: u$!Î"

+ and € . All frequent parents are reported as

 nodes as their child  is infrequent. - + / + Step2: € has three children   and  while € has two children  / and  ¥ . In step2(a), considering the frequent parent € , the cuts (  + ¦€ ) and ( U ¦€ + ) are reported.+ Considering the frequent parent € , the cuts ( x € ) and (  ¥ € ) are reported. In step2(b),+ + considering the frequent parent € , the cut ( ¢„ R ) is found. € does not have an frequent children and hence does not report any cuts in this step. / Step3: The infrequent parent € of  is considered and the cuts / + / ( € ‰` ) and ( € ‰`¥ ) are reported. All the cuts found by the first iteration of the !#" function are marked in Figure 6(e). The algorithm u$!Î" is recursively called for each cut that is found. Hence, due to !#" + Ï ¢ ž  being invoked on the cut ( ), applying step2(b) the cut ( †= + Similarly, due to   u      $    !  Î" being invoked on cut ¢ / ) is found. + / € ¦ ` ( € ¦` ), applying step1, the cut ( ) is found and applying ¥ / + / / step3, the cuts ( ` Ð ) and ( ` bÐ ) are found. Œ The proof that the algorithm finds all the cuts in the lattice is given in section 3.4. The detailed algorithm u$!Î" is given in section 3.3. To the best of our knowledge, no such approach has been proposed so far in the literature of maximal frequent subgraph mining.

3.3 The Margin Algorithm

Table 2 shows the ¢z – d ^  algorithm to&rfind Ñ the globally maximal frequent subgraphs  y q . Initially,  y q (line 1) and the graphs % Òaq is the set of locally maximum subgraphso in are * unexplored. o in each which is initially empty(line 3). A representative *  o *\o of , is the first

 node found among the subgraphs of %)&Ó(* + -.* / -sWXWXW-.* [ 0 * o . Initially, given the graphs , for each J % o *\o , we find the representative  for* o (line 3). This is done by iteratively dropping an edge from until a connected frequent subgraph is found. The !#" algorithm finds the nearby cuts and recursively calls !#" on each newly found *o cut. % The algorithm functions J *o in a manner that finding one cut in would find all cuts in as shall be proved in section 3.4. In line

S3

%'&)(*,+-Z*/-XWWXWs-.*\[M0 Input: Graph Database , Output: Set of Maximal Frequent Graphs yq Algorithm: Ñ 1 MF= *\o % J 2. For each do 3. Ò q =| o *\o 4. Find the representativeo  ofo o o ˆ 5. !#"X eÒ$q , \   where \ is the infrequent child of  6. Merge(yq ,Òaq ) Table 2: Algorithm Margin Input: *o Ò q : The maximal frequent subgraphs seen so far in . Cut: €‹ ž Output: The updated set of maximal frequent subgraphs Ò q . Algorithm: +X- /-XWsWXWXÔ o & ÔMÕ -Xbe 1. Let Ô WXWXWsthe - ‡ children of € . 2. for eacho Ô , ^ do  3. if Ô is frequent & o Ò q Òaq§?ÖÔ 4. o o 5. for each parent €\Ô of Ô do o 6. if €\Ô is infrequent do o o 7. ExpandCut( Òbq , €\Ô ˆ Ô ) o 8. if €\Ô is frequent do o 9. Find common parent ¢ ofo € and €\Ô ˆ 10. ExpandCut(Òbq , ¢ €\Ô ) o 11. if Ô is infrequent o o 12. if one frequent child !Ô ofo Ô exists o ˆ !Ô ) 13. ExpandCut(Òbq ,Ô Table 3: Algorithm ExpandCut(Òaq , € 6 the gobally maximal frequent subgraphs yq found so far are merged with the local maximal frequent subgraphs Ò q found in *\o . The merge function finds the new globally maximal set by removing all subgraphs that are not maximal due to the subgraphs in Òaq and adds subgraphs *o from Ò q that are globally maximally frequent so far including . Table 3 shows the #!#" algorithm which expands a given cut such that its neighboring cuts will be explored. The input to the algorithm are the set of maximal frequent subgraphs yq found so far (initially empty),

' € . o o and the cut o For each parent Ô of € , if Ô is frequent, Ô is added to yKq (lines 3-4).

Ÿ For each infrequent child !Ô o of Ô o , u$!Î" is called o o on the cut ( !Ô ‰Ô ) (line 6-7). Ÿ For each frequent child !Ô o of Ô o , let ¢ be the common o child o of  and !Ô . #!#" is called on the cut ( ¢j !Ô ) (line 8-10). o On the otherhand, ifo Ô is infrequent and there exists atleast one o frequent parent of o o Ô , then, !#" is o called on the cut (Ô €\Ô ) where €\Ô is one frequent parent of Ô . (lines 11-13).

3.3.1 Further Optimisations

Each invocation of the !#" algorithm finds a list of cuts. #!#" is recursively invoked on each newly found cut. Optimisations to reduce the number of revisited cuts are essential in

ˆ  )

order to reduce the computation time spent on verifying whether the cut has been visited previously. We discuss these optimisations separately inorder to keep the main algorithm simple. Few of the optimisations that help speeden the results are given below.

Ÿ Let the cut ( V 1€ ) be the cut on which u$!Î" has been invoked. Line 1 of the !#" algorithm computes the o parents of  . Consider a frequent o parento Ô of  . If there exists atleast one frequent child !Ô of Ô , then by •Fu – - — property, o there exists a node ¢ which is the common child of !Ô and  .  being infrequent, ¢ is infrequent. It can be shown that lines 5-10 of the  o u$!Î" algorithm that iterate over all the children of Ô can beo replaced by calling u$!Î" on just one cut ( ¢‚ T!Ô ). This leads to reducing the o number of revisited cuts. Also, a single frequent child o !Ô needs to be generated instead of all the children of !Ô in line 5. Ÿ Consider an invocation of #!#" on a cut ( @ 7€ ). Lines 11-13 of o o the algorithm check for infrequent parents Ô of  . If Ô is found among the set of infrequent graphs already visited, then, the  o !#" invoked on cut ( × € ) skips executing lines 12-13 on Ô . Ÿ Consider an invocation of u$!Î" on cut ( v € ). The + -XWXWXWXparents Ô Õ of  are computed in line 1. For some o & Ô o Ô , Ô since-sWX€ WXWXis- a parent of  . Therefore, in line 5 all ¦Õ6Ù of € are computed and explored. the children RÕPØ & k If any child RÕ  k  is infrequent, then, u$!Î" is invoked on cut ( RÕ ‰€ ). In the invocation of #!#"

k on the cut ( ¦Õ ¦€ ), the children of € are recomputed and revisited. Since the children of € are already explored in the invocation of !#" on the cut ( z F€ ), their reexplorationk can be avoided in the invocation of !#" on cut (  Õ € ) by passing the appropriate information.

2

q 2

G2

Supergraphs of P P Common Cut Subgraphs of P

Figure 7: Finding

$  Frequent subgraphs wherein the count value of each frequent subgraph need not be reported can be obtained by reporting all subgraphs of yKq or the

 nodes. Isomorphism computations are required inorder to eliminate duplicates alone from the subgraphs generated from yKq . On the other hand, the apriori based approach computes frequency for each generated subgraph. This requires finding isomorphic forms of each generated subgraph in the entire graph database. Furthermore, techniques in [24] can be adapted to provide an approximate frequency count value for the frequent subgraphs.

3.4 Proof of correctness

In this subsection, we prove the correctness of Margin algorithm. We first prove the* essential o % claim that trivially implies * o % that having found one cut in *\o J leads to all cuts in J % . Final an initial cut in each would thus find us all the cuts in which is pruned to compute the maximal frequent subgraphs. Note that | S is always considered frequent.

+ + / / *\o Claim 1. Given two cuts (  F€ ) and (  ž€ ) in JœÚ , invoking !#" on one cut finds the other cut. o & o - o -P57-.8 & -  P ROOF. Let 

4   for ^  . We consider o a o F 2 6 Û  Ü Ý 2 lattice as described below which is contained in lattice of * o . +žÞ / &:Ñ   , o consider the largest common subgraph " Case1: + If  / of  and  in lattice 2 (Figure 8). The node " is the lowermost o o node of the lattice 2žÛ6ÜÝ and hence is at level 0 of 2FÛ6ÜÝ . Note that

q 1

q 5 q 6

ìëì

Y2

ë

Y3

q 4

G1

X 0Y 0

q 3

3.3.2 Discussion

Consider+ a cut ( )  € *V ). + The bold *,+lines in Figure 7 represent the . €r}TS being frequent, all the parents lattice 2 of the graph of € are frequent due to which no “M

 node + (hence no cut) can lie among the subgraphs of € in the lattice 2 in the figure. Similarly all the supergraphs of  being infrequent, no

 node (hence no cut) would lie among the supergraphs of the  . Hence the subgraph and supergraph space of d can be pruned out of the candidate set for frequent maximal subgraphs. Hence, each cut found prunes our search space substantially. Margin traverses all the nodes on the o I ^ ,the infrequent parents of the “M

 nodes cuts in the lattice 2 and their infrequent parents. The remaining nodes on the lattice need not be explored. We prove in section 3.4 o that*\all o our % approach finds all the

 nodes in all the lattice 2 of J which is finally pruned to get the maximal frequent subgraphs.

p 0

q 0

Y4

êéê

Y çK èèK ç 1çèèç

ï Xï î ïK ïK é íîí

p 1

ò ñK X ñò ðK ð ðð ñK2ñ K 1

p 2 p 3 X3

å

= C1 4

3 2

f2

g

1

÷øK øK ÷ ø÷ø÷

X5

Lattice L

f3

g

Frequent Subgraphs

p 5

C2= f 5 f4

g

,

p 4

P2

1

g

öõö

ô X 4õ óK óKôóó æåæ

ä ââá ãK ãKãäã K á Y6 à áKá ßK K ß P ßàß Y5

1

f1

Lattice D

Level 0 of Lattice L

t

Infrequent Subgraphs Unknown Subgraphs

g

Level 0 of Lattice D

+Þ / &Ñ o   Figure 8: Lattice 2FÛ6ÜÝ :  o o the level 0 of 2FÛ6ÜÝ occurs at theo level  "ur of lattice 2 since, as mentioned earlier, in lattice 2 , the level of a node is one added + to the adding an edge SZØAJ + ˆ size of the node. Extend " too d by ^A g in lattice 2 oÛ6ÜÝ to

›€ oúùb+ "m incident on " . Extend d for + o o küû d , by taking an edge S Ø Jý ›€ ˆ d  incident on d until oúùb+F& + + + o d € . Extend € in the lattice + / 2žÛ6ÜÝ to  . Similary, extend o " tooúùb + by adding an edge ¡mØ Jþ ›€ ˆ "m incident ono " . Extend too / g ü k û for ^‰ by taking an edge ¡ Ø Jþ ›€ ˆ  incident on oúùb+¦& / g o until € . For every two nodes in level ^F of 2FÛ6ÜÝ which o o has a common parent in 2 Û6ÜÝ , construct the common child in 2 Û6ÜÝ . – Such a common child always exists by the •Fÿ~  - — property. + Þ / &Ñ & -m57-.8 Case2: If  , / consider +

+4 /  *\o the shortest path €  V ?  V ? € that connects  and  in as in Figure 9. Hence, * o forms a connected path € is a sequence o ( ] +-XWXWXWs- subgraph 0 of . A shortest ] oüù+ for ]        of edges where edge ] is adjacent + + to  & X X W  W X W ^ ˆ ] ]   is    #  , / is incident on an +¦edge & ] in+ € and incident on an edge in € . Let subgraph d ? Z S Ø where SZØ + + +7& ]     ? ¡mØ is the edge of € incident on/ ] . Similarly let ]  where ¡mØ is the edge of € incident on   # . The lowermost o node of the lattice 2 Û6ÜÝ is | S at level 0 (Figure 10). Level 1 of the o lattice(2FÛ6ÜÝ ) consists of the set of vertices + 4 + , which are children ?Öd o ?> oüù+ forms the nodes of of the node |MS . The set of edges  o the level 2 + of theo lattice 2 ÛhÜÝ . Extend d g to d o oúby ùb+¦taking & + an edge kúû ˆ S Ø Jx€ d incident on d for ^F € . Extend until d + + o oúùb+ o € in lattice 2žÛ6Ü/ Ý to  o . Similarly, extend to by taking o oúùb+ an kúû ˆ edge ¡ Ø JU€ in incident on to form a subgraph o the lattice 2FÛ6ÜÝ . For every two nodes in level ^T  which have a o common parent in the lattice 2FÛ6ÜÝ , construct the common child. Below is the proof for case 1 +which uses Figure 8 that also holds for / o case 2. All supergraphs of +  and /  in lattice 2 Û6ÜÝ are & infrequent +-XWXWXWX- [ ™ ™ while all subgraphs of € and € are frequent.+ Let

o 6 Û  Ü Ý be the ordered sequence of supergraphs of in lattice such  2 + WWXW  [ -XWXWsWXthat  ™   ™    / ™  . Let    be the& ordered

o  2   &   in lattice such that  sequence of supergraphs of 6 Û  Ü Ý + WsWXW       E as illustrated in Figure 8. Any nodeo6- †  |S o in the lattice 2žÛ6ÜÝ is now referred to as a unique tuple ( ™ w ) such

P

P 1 m 1

m2

above claim+ by induction. Base Case: To [ prove that the claim + &C 7 holds for  . + The + cut on the path  - [D + corresponds to the + € ). & The +Xchild(

 ™ ) of  is infrequent. initial cut (  - [D + ™  . If7 [node Consider node + + - † [D + N D + † is frequent, cut - [EDathe is found by lines(8( N ™ R × N ™  ) on the path             !  #" † algorithm. If node is infrequent, cut 10) + of the + + - [EDa+ ( N ™ 7 € ) on the path  is found by lines(6-7) of the Hence the next cut is found either on path + 7[D + !#" algorithm. or path  . Hence, conditions 1 and 2 of the claim are satisfied.

P 2

m

Ep

Gi

&

Figure 9: Path €

-

4

-P5~-.8 

)

X 0Y0

  



Y1

 

2

Y2



Y4





Y5

+*+

Y6

*

'&'

  



Y3



X1





1 X2

!!

,

X3

#"#

X4

" %X $%$ 5 )() (

Y7 P1 EP VP

&

X

6 P2

C2

C1 (X 2 Y7 ) = P

g

g

=

X3 Y7 X4 Y7 X5 Y7 X6 Y7

,

-,-

Frequent Subgraphs Infrequent Subgraphs Unknown Subgraphs

+ Þ / &zÑ o Figure 10: Lattice 2 Û6ÜÝ : € € o that  is the smallest supergraph of † in  / and ™w - is the smallest supergraph of † in . By construction,  =(  ™. ). The node o |MS in lattice 2 Û6ÜÝ is numbered differently. + Consider& the child o - [  /10 of | S such that   / 0 is a subgraph of  . Let   / 0

N ™  - for ^ 2‹] (see Figure 8). Node |S corresponds to tuples ( w ™ [ ) some 3 & ^ -WXWXWX- ] (Figure 10) for 4 as and in the proof. x +Kwhen & +X-srequired WXWXWX5  

 Further we define the set of paths each o + oo - ùwhere + & of form ((  g -WX6 J WXWX5 - is the union of edges  ™Xw ),( ™71w o )) for 4 T / 8 &  7  + X X W X W X W 9  7 [ / ˆ   . Similary 5 J WWs5 - is the ooúùb+ - where each & g -XWX: ] ˆ  union of edges of form (( ™Xw ),( ™Xw )) for ^ (see Figure 8). oo- [ Notice that I ^×(  g ™ ) is infrequent while ( ™ + ) is frequent o g for each  in  . Hence, for each  w J;5 - , o for 4  , there ex ists exactly one cut. Arguing similarly, ( ) is infrequent while  ™ I - o o g   ™ ) is frequent ™ g in , ^ž (< for . Hence, there lies one / I each 7 cut on each wx= . Thus,+ there+ are ] ; cuts in the J 5 , 4  o lattice / 2žÛ6ÜÝ / . Given the initial cut + (  + ¦€ ), we are - [ to prove +s- that [  ™ M F N 7 [ ™  ) cut (  € ) is reachable. Cut (  € )=cut ( N o by the construction / / of the lattice 2 Û6ÜÝ and lies on the path . The   . final cut (  € ) to be reached lies on the path > The Rechability Sequence: + -sWXWXWXù [ $ 7 [ The order of paths  ono which o   + / the cuts from path to   are found where  ? J 5 or  J?5 satisfies the following: 1. for  w 2. for  w

& 

&;7

Û

J@5

+

,

Û

ùb+F&

/ 7 b ù +R& J@5 , Û Û

A Ž  4 A Ž 2 4

17 [ P ROOF . Since to   , the initial +‹&B 7[ we traverse the pathsù$from [‘&   . We prove the

and the final path  path 

+ -sWXWXWXInduction Step: Assuming that the claim holds for   , F2 ] ” , we prove that the claim holds for  ù+ . Let  +X-  /-XWXÛ WXWs- 

Suggest Documents