Supplementary paper Mining frequent stem patterns ...

Supplementary paper Mining frequent stem patterns from unaligned RNA sequences ∗ Michiaki Hamadaa,b,c , Koji Tsudaa,d , Taku Kudoe, Taishin Kina and Kiyoshi Asaia,f a

Computational Biology Research Center, National Institute of Advanced Industrial Science and Technology (AIST), 2-41-6, Aomi, Koto-ku, Tokyo, 135-0064, Japan b Mizuho Information & Research Institute, Inc, 2-3, Kanda-Nishikicho,Chiyoda-ku,Tokyo 101-8443, Japan c Department of Computational Intelligence and System Science, Tokyo Institute of Technology, 4259 Nagatsuta, Midori-ku, Yokohama, 226-8503, Japan e Max Planck Institute for Biological Cybernetics, Spemannstr. 38, 72076 T¨ubingen, Germany e Google, inc, 26-1, Sakuracho, Shibuya, Tokyo, 150-8512, Japan f Graduate School of Frontier Sciences, University of Tokyo, 5-1-5, Kashiwanoha, Kashiwa, 277–8562, Japan

July 3, 2006

∗ This

paper is supplementary paper for “Mining frequent stem patterns from unaligned RNA sequences” submitted to Bioinformatics

1

1 Technical details of graph mining with label taxonomy In this section we describe detail algorithm for graph mining using original paper. We firstly summarize elementary definitions.

1.1 Preliminary definition Throughout this supplementary paper, the word “graph” means a labeled directed graph which has no multi edges or self-loop edges. Such graph is represented as 5-tuple (V, E, LV , LE , lb) where V = {v1 , . . . , vk } is a set of vertexes, E = {(vi , vj )|vi , vj ∈ V } is a set of edges ((vi , vj ) means the edge whose direction is from vertex vi to vertex vj ), LV is the set of vertex labels, LE is the set of edge labels and lb : V ∪ E → LV ∪ LE is a mapping from vertex or edge to its label. The topology of graph G is defined by the graph which is removed all labels in G. First we introduce a label taxonomy with cost as below for the purpose of flexible pattern matching. Definition 1 (Label taxonomy with cost) Given label set LS, a label taxonomy with cost is defined as the DAG (directed acyclic graph) whose vertex v is a label in LS and has a cost cost(v) ≥ 0. In above definition, a relation A −→ B means that label A is a generalized label of B (label B is a specialized label of A). Modeling a label taxonomy as DAG, not a forest, enables us to represent multiple taxonomies [6]. c(A) and c(B) are a cost of generalization of label A and B respectively and usually satisfy c(B) < c(A) because it is natural any labels have the lower cost than its generalized label. For a label v in taxonomy T , τT (v) is defined as a set of labels which consists of v and its ancestors in taxonomy. When we do not consider a taxonomy of label, a graph G′ is said to be a subgraph of G if there exists a mapping from V ′ to V which maintains all labels of vertex, labels of edge, and direction of edge. Now the concept of subgraph with label taxonomy is defined as below. Definition 2 (subgraph [4]) Given taxonomy T and graph G = (V, E, LV , LE , lb). Graph G′ = (V ′ , E ′ , LV ′ , LE ′ , lb′ ) is called a subgraph of G with Taxonomy G (denoted by G′ ⊑ G) if there exists a map φ : V ′ → V 1. For all v ∈ V ′ , l′ (v) ∈ τT (l(φ(v))) 2. For all (vi , vj ) ∈ E ′ , l′ (vi , vj ) ∈ τT (l(φ(vi ), φ(vj ))) It is easily seen that the above definition of subgraph enhances the usual definition of subgraph. For simplicity, if there is a taxonomy of label we use the word subgraph as the subgraph with taxonomy in the reminder of this paper. Definition 3 (Clique) A graph is said to be clique if there exists edge between two arbitrary vertexes. Definition 4 (support [4]) For a set of graphs GS = {G1 , . . . , GN } and Taxonomy T , the support of pattern P is defined as |{Gi |Gi ∈ GS, P ⊑ Gi }| . (1) support(P ) = |GS| Definition 5 (cost of pattern) Given taxonomy, cost of a pattern P is defined by the average cost of all labels in pattern and denoted by cost(P ).

1.2 Problem formulation of our biological problem As described in original paper, our biological problem is formulated as below using above definitions. Formulation 1 Given a set of stem graphs, label taxonomy, minimum support minsup and maximum cost maxcost, completely enumerate every stem pattern P that satisfies the following conditions: 1. support(P ) ≥ minsup. 2. P is a clique. 3. cost(P ) ≤ maxcost. Enumerating the patterns satisfying condition 1 in Formulation 1 is direct application of usual graph mining techniques e.g., [4], [8]. Now our formulation has additional conditions and using these we enumerate patterns more efficiently. 2

1.3 Graph mining with label taxonomy Our approach for enumerating patterns that satisfy each condition in Formulation 1 is to use tree-based data structure whose node represents a pattern. This tree-based data structure, called DFS code tree, is proposed by Yan et al. [8, 9]. Our basic strategy is to search in DFS code tree, while pruning subtree of this tree by using constraints of derived patterns. 1.3.1 DFS code DFS code is a string representation of a graph. When performing a depth-first search in graph, a corresponding DFS tree can be constructed (e.g., [1]), and when building a DFS tree we subscript vertex of graph by the order of the depth-first discovery time. We denote G subscripted with a DFS tree T by GT , called DFS subscripting. Given GT , the forward edge set contains all the edges in the DFS tree, denoted by ETf and the backward edge set contains all the edges which are not in the DFS tree, denoted by ETb . (b) and (c) in Figure 1 show two DFS subscripting of graph (a). From now on, (vi , vj ) (simply written as (i, j)) is viewed as an ordered pair to represent an edge and if i < j, it is forward edge; otherwise, a backward edge. Yan et al. [8] introduce a linear order ≺T on ETf ∪ ETb , that is, e1 ≺T e2 where e1 = (i1 , j1 ) and e2 = (i2 , j2 ) if and only if one of the following statements is true: 1. e1 , e2 ∈ ETf , and j1 < j2 or i1 > i2 ∧ j1 = j2 2. e1 , e2 ∈ ETb , and i1 < i2 or i1 = i2 ∧ j1 < j2 . 3. e1 ∈ Eb,T , e2 ∈ Ef,T , and i1 < j2 . 4. e1 ∈ Ef,T , e2 ∈ Eb,T , and j1 ≤ i2 . For GT , DFS code, denoted by code(G, T ), is defined by edge sequence {ei } in above order with additional label and direction information, that is, ei is denoted by 6-tuple (i, j, li , lj , l(i,j) , d(i,j) ) ∈ {ETf ∪ ETb } × LV × LV × LE × {+1, −1} where (i, j) means a forward edge or backward edge, li and lj are the labels of vi and vj respectively, l(i,j) is a label of edge (i, j) and d(i,j) is the direction of the edge 1 . For example, DFS code of (b) and (c) in Figure 1 is (0, 1, A, B, a, 1) (1, 2, B, C, b, −1) (1, 3, B, D, a, −1) (3, 0, D, A, c, 1) and (0, 1, D, B, a, 1) (1, 2, B, A, a, −1) (2, 0, A, D, c, −1) (1, 3, B, C, b, −1) respectively. As seeing this example, DFS code of one graph is not unique. a b

0 A

A c

B C

a (a)

D

0 D

a

c

1 B b

2 C

a

D 3

(b)

a

c

1 B a

b

2 A

C 3

(c)

Figure S1: Depth first search and DFS code of graph. (a) Example of graph. (b), (c) Two different DFS subscripting of graph (a). Red numbers represent DFS subscription and bold edges and dashed edge represent forward edges and backward edges respectively. DFS code of (b) is (0, 1, A, B, a, 1)(1, 2, B, C, b, −1)(1, 3, B, D, a, −1)(3, 0, D, A, c, 1) and (c) is (0, 1, D, B, a, 1) (1, 2, B, A, a, −1) (2, 0, A, D, c, −1) (1, 3, B, C, b, −1). By definition it is easily seen that it requires several conditions for generating a new DFS code by adding one edge to a DFS code i.e. extension of a DFS code must be the right most extension of its DFS code, formally stated as below: Given graph G and DFS code tree T . For α = code(G, T ) = (a0 , a1 , · · · , an ), ak = (ik , jk ), ak+1 = (ik+1 , jk+1 ), the following statements hold: 1. ak is a forward edge and ak+1 is a forward edge, then ik+1 ≤ jk and jk+1 = jk + 1. 2. ak is a forward edge and ak+1 is a backward edge, then ik+1 = jk and jk+1 < ik . 3. ak is a backward edge and ak+1 is a forward edge, then ik+1 ≤ ik and jk+1 = ik + 1. 4. ak is a backward edge and ak+1 is a backward edge, then ik+1 = ik and jk < jk+1 . 1d (i,j)

is defined by 1 if the direction of edge is from vi to vj and defined by −1 if direction of edge is from vj to vi .

3

0 A

0 A

0 A

0 A

1 B

1 B

1 B

1 B

a

b

2 C

a

b (a)

D 3

b

2 C

a

b (b)

b

D 3

2 C

a

b (c)

E 3

b

2 C

a (d)

F 3

Figure S 2: Example of DFS lexicographic order of patterns. The DFS lexicographic order of above patterns are (a) < (b) < (c) < (d). This order of DFS code is slightly different from Yan’s order (Remark that the Yan’s order of above patterns are (a) < (b) < (d) < (c)). See main text for details. 1.3.2 DFS lexicographic order and minimum DFS code In the remainder of this paper, both a set of vertex labels and a set of edge labels have a liner order. Like Yan et al. [9], we introduces a linear order of the set of all DFS codes, called DFS lexicographic order, which is defined by order of lexicographic combination on {ETf ∪ ETb } × LV × LV × LE × {+1, −1}: if DFS code α = (a0 , · · · , am ) and β = (b0 , · · · , bn ), then α < β if and only if either of the following is true: 1. ∃t, 0 ≤ t ≤ min(m, n), ak = bk for k < t, at < bt . 2. ak = bk for 0 ≤ k ≤ m and m ≤ n. The difference between Yan’s order and our order is that the priority of two vertex labels is higher than that of edge label in lexicographic combination. This change is essential for proposition 2 and proposition 3. Figure 2 shows an example of DFS lexcographic order of patterns. As described in previous section, there are several DFS codes for a graph G and we define minimum DFS code of G in above order as minimum DFS code. Minimum DFS code of graph is thought as canonical string representation of that graph. 1.3.3 DFS code tree and gSpan’s algorithm For efficient enumeration of patterns, Yan et al. introduces a tree-structure whose node corresponds a pattern, called DFS code tree (Figure 3). Definition 6 (DFS code tree [8, 9]) DFS code tree, denoted by T, is a tree-structure whose node represents a DFS code, relation between a node and its child node is given by right most extension, and the set of child nodes with the same parent is ordered in DFS lexicographic order. We also denote the tree after pruning subtrees whose node is non-minimum DFS code from DFS code tree T as Tmin . Main contributions of Yan’s work are to introduce DFS code tree T , to prove 1. T includes all patterns, and 2. Tmin also includes all patterns and to propose an algorithm (implemented as software gSpan) for enumerating patterns whose support is greater than given minimum support by depth-first search of Tmin , using anti-monotonicity of support (namely the frequency of a pattern is always smaller than or equal to that of its subgraph). 1.3.4 Reducing search space in DFS code tree using our constraints In this section we describe the methods in order to reduce search space when enumerating all patterns that satisfy each condition in Formulation 1. The naive way to realize this is to enumerate patterns in Tmin and discard patterns which does not satisfy condition 2 or 3, although this strategy is very inefficient. In this paper we propose a more efficient method which positively uses condition 2 and 3 in Formulation 1 conditions for reducing search space. Using next proposition the DFS code tree in our problem is much smaller than the original one.

4