chyle hemolysis. G. GL. Fig. 13. Example of extracted patters for class=fibrosis. 3) canonical labeling to accurately count identical patterns. The improved B-GBI.
Mining Patterns from Structured Data by Beam-wise Graph-Based Induction Takashi Matsuda, Hiroshi Motoda, Tetsuya Yoshida and Takashi Washio Institute of Scientific and Industrial Research, Osaka University 8-1, Mihogaoka, Ibaraki, Osaka 567-0047, JAPAN
Abstract. Graph-Based Induction (GBI) extracts typical patterns from graph data by stepwise pair expansion (pairwise chunking). It is very efficient because of its greedy search strategy but at the same time it suffers from the incompleteness of search. Improvement is made on its search capability without imposing much computational complexity by 1) incorporating a beam search, 2) using a different evaluation function to extract patterns that are more discriminatory than those simply occurring frequently, and 3) adopting canonical labeling to enumerate identical patterns accurately. This new algorithm, now called Beam-wise GBI, B-GBI for short, was tested against the promoter dataset from UCI repository and shown successful in extracting discriminatory substructures. Effect of beam width on the number of discovered attributes and predictive accuracy was evaluated. The best result obtained by this approach was better than the previously best known result. B-GBI was then applied to a real-world data, Hepatitis dataset provided by Chiba University. Our very preliminary results indicate that B-GBI can actually handle graphs with a few thousands nodes and extract discriminatory patterns.
1
Introduction
Over the last few years there has been a number of research works on data mining in seeking for better performance. Better performance includes mining from structured data, which is a new challenge, and there have only been a few works on this subject. Since structure is represented by proper relations and a graph can easily represent relations, knowledge discovery from graph structured data poses a general problem for mining from structured data. Some examples amenable to graph mining are finding typical web browsing patterns, identifying typical substructures of chemical compounds, finding typical subsequences of DNA and discovering diagnostic rules from patient history records. Majority of the methods widely used is designed for data that do not have structure and are represented by attribute-value pairs. Decision tree[14, 15], and induction rules[12, 3] relate attribute values to target classes. Association rules also uses this representation. However, the attribute-value pair representation is not suitable to represent a more general data structure, and there exist problems that need a more powerful representation. Most powerful representation that
can handle relation and thus, structure, would be inductive logic programming (ILP) [13] which uses the first-order predicate logic. It can represent general relationship embedded in data, and has a merit that domain knowledge and acquired knowledge can be utilized as background knowledge. However, the time complexity causes problem [6] in exchange for its rich expressibility. Much more efficient approach has recently been proposed that employs the version space analysis and limit the type of representation to graph fragments, linearly connected substructures [5]. A variety of constraints can be imposed such as placing a minimum support for a positive class and a maximum support for a negative class [9]. AGM (Apriori-based Graph Mining)[8] is another recent work that can mine association rules in a given graph dataset. A graph transaction is represented by an adjacency matrix, and the frequent patterns are mined by an extended Apriori algorithm. AGM can extract all connected/disconnected induced subgraphs by complete search. It is reasonably efficient but, in theory, its computation time increases exponentially with input graph size and support threshold. These approaches can use only frequency for their evaluation function. SUBDUE[4], which is probably the closest to our approach, extracts a subgraph which can best compress an input graph based on MDL principle. The found substructure can be considered a concept. This algorithm is based on a computationally-constrained beam search. It begins with a substructure comprising only a single vertex in the input graph, and grows it incrementally expanding a node in it. At each expansion it evaluates the total description length (DL) of the input graph, and stops when the substructure that minimizes the total description length is found. After the optimal substructure is found and the input graph is rewritten, next iteration starts using the rewritten graph as a new input. This way, SUBDUE finds a more abstract concept at each round of iteration. As is clear, the algorithm can find only one substructure at each iteration. Graph-Based Induction (GBI) [18, 10] is a technique which was devised for the purpose of discovering typical patterns in a general graph data by recursively chunking two adjoining nodes. It can handle a graph data having loops (including self-loops) with colored/uncolored nodes and links. There can be more than one link between any two nodes. GBI is very efficient because of its greedy search. GBI does not lose any information of graph structure after chunking, and it can use various evaluation functions in so far as they are based on frequency. It is not, however, suitable for graph structured data where many nodes share the same label because of its greedy recursive chunking without backtracking, but it is still effective in extracting patterns from such graph structured data where each node has a distinct label (e.g., World Wide Web browsing data) or where some typical structures exist even if some nodes share the same labels (e.g., chemical structure data containing benzene rings etc). Efficiency of GBI comes from its greedy search in exchange for search incompleteness. It cannot find all the important typical patterns although our past application to various domains produced acceptable results [10]. In this paper we, first, report on improvements made to enhance the search capabil-
ity without sacrificing efficiency too much by 1) incorporating a beam search, 2) using a different evaluation function to extract patterns that are more discriminatory than those simply occurring frequently, and 3) adopting canonical labeling to enumerate identical patterns accurately. This new algorithm is implemented and now called Beam-wise GBI, B-GBI for short. Second, we report on an experiment using the promoter dataset (a small DNA dataset) from UCI repository and show the improvements work as intended. Effect of beam width on the number of discovered patterns and predictive accuracy were evaluated. The best result obtained by this approach was better than the previously best known result [17]. Finally, we report on an initial result of the analysis in which B-GBI was applied to a real-world data, Hepatitis dataset that was provided by Chiba University. Preliminary results indicate that B-GBI can actually handle graphs with thousands of nodes and extract discriminatory patterns. The paper is organized as follows. Section 2 briefly describes the framework of B-GBI focusing on the improvement made to GBI. Section 3 shows the experimental results applied to the promoter dataset, and section 3 the preliminary results applied to hepatitis dataset. Section 6 concludes the paper with summary of the results and the planned future work.
2
Beam-wise Graph-Based Induction
GBI employs the idea of extracting typical patterns by stepwise pair expansion as shown in Fig. 1. “Typicality” is characterized by the pattern’s frequency or the value of some evaluation function which is calculated by the pattern’s frequency. In Fig. 1 the shaded pattern consisting of nodes 1, 2, and 3 is thought typical because it occurs three times in the graph. GBI first finds the 1→3 pairs based on its frequency, chunks them into a new node 10, then in the next iteration finds the 2→10 pairs, chunks them into a new node 11. The resulting node represents the shaded pattern.
1
1 3 7
7
1
5
2
4 6
1
4
3
2
2
3
8
8
11
4 6
9
5 9
1 3
2
7
7
5
5
4
11
11
3 2
10 2
11
Fig. 1. The basic idea of the GBI method
It is possible to extract typical patterns of various sizes by repeating the stepwise pair expansion (pairwise chunking). Note that the search is greedy. No backtracking is made. This means that in enumerating pairs any pattern which
has once been chunked into one node is never restored to the original pattern. Because of this, all the ”typical patterns” that exist in the input graph are not necessarily extracted. The problem of extracting all isomorphic subgraphs is known to be NP-complete. Thus, GBI aims at extracting only meaningful typical patterns of a certain size or less. Its objective is not finding all typical patterns nor finding all frequent patterns. As described earlier, GBI can use any criterion that is based on the frequency of paired nodes (or patterns). However, for finding a pattern that is of interest, any of its subpatterns must be of interest because of the nature of repeated chunking. In Fig. 1 the pattern 1→3 must be typical for the pattern 2→10 to be typical. Said differently, unless pattern 1→3 is chunked, there is no way of finding the pattern 2→10. Frequency measure satisfies this monotonicity. However, if the typicality criterion chosen does not satisfy this monotonicity, repeated chunking may not find good patterns even though the best pair based on the criterion is selected at each iteration. This motivated us to improve GBI by allowing use of two criteria, one for frequency measure for chunking and the other for finding typical (e.g., discriminatory) patterns after chunking. The latter criterion does not necessarily hold monotonicity property1 . Any function that is discriminatory can be used, such as Information Gain [14], Gain Ratio [15] and Gini Index [2], and some others (e.g. the one used in Section 4), all of which are based on frequency. When each node has a distinct label in the input graph, no ambiguity arises in selecting a pair to be chunked and GBI performs well. However, since the search in GBI is greedy, when the same label is shared by plural nodes in the input graph, there arises ambiguity when there are ties in the frequency or there is a chain of nodes of the same label. For example, in the case of the structure like a → a → a, we don’t know which a → a is best to chunk. To relax this ambiguity problem, a beam search is incorporated to GBI within the framework of greedy search. A certain fixed number of pairs ranked from the top are allowed to be chunked individually in parallel. To prevent each branch from growing exponentially, the total number of pairs to chunk is fixed at each level of branch. Thus, at any iteration step, there is always a fixed number of chunking that is performed in parallel. The new stepwise pair expansion repeats the following four steps. Step 1 Extract all pairs consisting of connected two nodes in all graphs. Step 2a Select all typical pairs based on the typicality criterion from among the pairs extracted in Step 1, rank them according to the criterion and register them as typical patterns. If either or both nodes of the selected pairs have already been rewritten (chunked), they are restored to the original patterns before registration. Step 2b Select, from among the pairs extracted in Step 1, a fixed number of frequent pairs from the top and register them as the candidate patterns to 1
Frequency can be used as the typicality measure, in which case, these two criteria are the same
chunk. If either or both nodes of the selected pairs have already been rewritten (chunked), they are restored to the original patterns before registration. Stop when there is no more pattern to chunk (i.e. there is no pair whose frequency is above an input specified threshold). Step 3 Replace each of the selected pairs in Step 2b with one node and assign a new label to it. Delete a graph for which no pair is selected and branch (copy) a graph for which more than one pair are selected. Rewrite each remaining graph by replacing all the occurrence of the selected pair in the graph with a node with the newly assigned label. Go back to Step 1. An example of state transition of B-GBI is shown in Fig.2 in case that the beam width is 5. The initial condition is the single state cs. All pairs in cs are enumerated and ranked according to both the frequency measure and the typicality measure. Top 5 pairs according to the frequency measure are selected, and each of them is used as a pattern to chunk, branching into 5 children c11 , c12 , . . . , c15 , each rewritten by the chunked pair. All pairs within these 5 states are enumerated and ranked according to the two measures, and again the top 5 ranked pairs according to the frequency measure are selected. The state c11 is split into two states c21 and c22 because two pairs are selected, but the state c12 is deleted because no pair is selected. This is repeated until the stopping condition is satisfied. Increase in the search space improves the pattern extraction capability of GBI. The output of B-GBI is a set of ranked typical patterns extracted at Step 2a. These patterns are typical in the sense that they are more discriminatory than non-selected patterns in terms of the typicality criterion used. cs
c11
c12
c13
c14
c15
c21
c22
c23
c24
c25 A
c31
c32
c33
c34
c35
Fig. 2. An Example of State Transition of B-GBI when the beam width = 5
C
B
A
C
B
Fig. 3. Two Different Pairs Representing Identical Pattern
Another improvement made in conjunction with B-GBI is canonical labeling. GBI assigns a new label for each newly chunked pair. Because it recursively chunks pairs, it happens that the new pairs that have different labels happen to be the same pattern (subgraph). A simple example is shown in Fig. 3. They represent the same pattern, but the way they are constructed is different.
To identify whether the two pairs represent the same pattern or not, each pair is represented by its canonical label[16, 7] and only when the label is the same, they are regarded as identical. A graph can be represented by an adjacent matrix. An adjacency matrix can be mapped into a corresponding code (see below). The same graph with a different node numbering results in a different adjacency matrix. Note that node number is different from node label. Thus, the same graph can have different codes. The canonical labels is the code which takes maximum (or minimum) value for the same graph. Thus, if the canonical label is the same for two graphs, they are ensured to be isomorphic. The basic procedure of canonical labelling is as follows. Nodes in the graph are grouped according to their labels (node colors) and the degrees of node (number of links attached to the node) and ordered lexicographically. Then an adjacency matrix is created using this node ordering. When the graph is symmetric, the upper triangular elements are concatenated scanning either horizontally or vertically to codify the graph. When the graph is asymmetric, all the elements in both triangles are used to codify the graph in a similar way. If there are more than one node that have identical node label and identical degrees of node, the ordering which results in the maximum (or minimum) value of the code is searched. Let M be the number of nodes in a graph, N be the number of groups of the nodes, and pi (i = 1, 2, . . . , N ) be the number of the nodes within N group i. The search space can be reduced to i=1 (pi !) from M ! by using the node above stated node grouping. The code of an adjacency matrix for the case in which elements in the upper triangle are vertically concatenated is defined as
a11 a12 a22 A=
... ... .. .
a1n a2n .. . ann
code(A) = a11 a12 a22 a13 a23 . . . ann j n
m k+j−i = aij . (L+1) k=j+1
(1) (2)
j=1 i=1
Here L is the number of different link labels. It is possible to further prune the search space. We choose the option of vertical concatenation. Elements of the adjacency matrix of higher ranked nodes form higher elements of the code. Thus, once the locations of higher ranked nodes in the adjacency matrix are fixed, corresponding higher elements of the code are also fixed and are not affected by the order of elements of lower ranks. For example, in Eq. 1 the first three elements in the code(A) are determined by the first two ranked nodes and these elements arenot affected
by the nodes of the lower ranks. This reduces the search space N N of i=1 (pi !) to i=1 (pi !). However, there still is a problem of combinatorial explosion for a case where there are many nodes of the same labels and the same degrees of node such as the case of chemical compounds because the value of pi becomes large. We can make the best of already determined nodes of higher ranks. Assume that the nodes vi ∈ V (G)(i = 1, 2, . . . , N ) are already determined in a graph G. Consider finding the order of the nodes ui ∈ V (G)(i = 1, 2, . . . , k) within the same group that gives the maximum code value. The node that comes to vN +1 is the one in
ui (i = 1, . . . , k) that has a link to the node v1 because the highest element that vN +1 can make is a1N +1 and the node that makes this element not 0, that is, the node that is linked to v1 gives the maximum code. If there are more than one node or no node at all that has a link to vN +1 , the one that has a link to v2 comes to vN +1 . Repeating this process determines which node comes to vN +1 . If no node can’t be determined after the last comparison at vN , permutation within the group is needed. This is explained using an example in Fig. 4. Assume that nodes 1, 2 and 3 have already been determined. Nodes 4, 5 and 6 are in the same group. The fourth node is the node that has a link to the highest ranked node 1, which is the node 4. Likewise, the fifth and the sixth nodes are the nodes 5 and 6 respectively. In this case, node ordering is uniquely determined. If l nodes can be determined by this procedure, the search space can be reduced from k! to (k − l)!.
4 1
d a
5 2
d b
6 3
d c
a (1) b(2) c (3) d a 1 0 b 1 c d d d
d
d
code = 1 0 1 ? ? ? ? ? ? ? ? ? ? ? ? Fig. 4. Determination of Node Ordering within a Group
3
Experimental Evaluation of B-GBI
The proposed method is tested against the promoter dataset in UCI Machine Learning Repository [1]. A promoter is a genetic region which initiates the first step in the expression of an adjacent gene (transcription). The promoter dataset consists of strings that represent nucleotides (one of A, G, T or C). The input features are 57 sequential DNA nucleotides and the total number of instances is 106 including 53 positive instances (sample promoter sequences) and 53 negative instances (non-promoter sequence). This dataset was explained and analyzed in [17]. The data is so prepared that each sequence of nucleotides is aligned at a reference point, which makes it possible to assign the n-th attribute to the n-th nucleotide in the attribute-value representation. In a sense, this dataset is encoded using domain knowledge. This is confirmed by the following experiment. Running C4.5[15] gives a predictive error of 16.0% by leaving one out cross validation. Randomly shifting the sequence by 3 elements gives 21.7% and by 5 elements 44.3%. If the data is not properly aligned, standard classifiers such as C4.5 that use attribute-value representation does not solve this problem. One of the advantage of graph representation is that it does not require the data to be aligned at a reference point.
In this paper, each sequence is converted to a graph representation assuming that an element interacts up to 10 elements on both sides (See Fig. 5.). Each sequence, thus, results in a graph with 57 nodes and 515 links. The minimum support for chunking is set at 20% and Gini index is used as a criterion to select typical patterns. Using Gini index also as the measure for selecting pairs to chunk would not work because almost all the pairs exist in every instance and there is no discriminatory pairs at an early stage of chunking.
5
6
4
4
3
3
2
2
1
a
2
1
t
g
1
c
1
1
2
a
t 1
・・・・・・・ ・
2 3
3 4 5
Fig. 5. Conversion of DNA Sequence Data to a graph
First effects of employing canonical labeling was investigated using artificial datasets. The average number of links per node is fixed to 3 and links are randomly generated. The number of nodes was changed from 10 to 50. The first dataset has 3 labels for both node and link and the second dataset has no label for both. The beam width was set at 1. Figure 6 shows how the number of patterns extracted differs with and without canonical labeling and how much fraction is used for canonical labeling. In theory the computational complexity of GBI without canonical labeling algorithm is quadratic and in practice it is linear to the size of a graph [11]. For the first dataset there is no difference in the number of patterns and the computation time for canonical labeling is negligible. For the second dataset there is a substantial difference between the number of patterns but the computation time required for canonical labeling is small. The promoter dataset has 4 node labels and 10 link labels and is closer to the first dataset and the effect of canonical labeling would probably be small, but this option is used in the following analysis of the promoter dataset. Figure 7 shows how the computation time and the number of extracted typical patterns increases as the beam width is increased. Computation time increases almost linearly to the beam width as projected. It is noted the number of patterns does not necessarily increase monotonically. This depends on how state branches as chunk proceeds. To evaluate how discriminatory these patterns are, they are treated as binary attributes of each sequence to build a classifier by C4.5[15]. In [17] C4.5Rule was used instead of C4.5 and the accuracy is averaged over 10 trials of 10 fold
3000
5 4.5
Total Computational Time (Ln=3, Ll = 3)
)s 3.5 ndo ce (s 3 e m iT la2.5 no tia 2 tu p om C1.5
2000
1500
No. of links per node is set to 3. Ln : No. of Node Labels Ll : No. of Link Labels
1
1000
500
sn re tt aP fo .o N
Computational Time for Canonical Labelling (Ln=1, Ll = 1) Total Computational Time (Ln=1 Ll = 1) No. of Patterns w/o Canonical Labelling (Ln=3, Ll = 3) No. of Patterns with Canonical Label (Ln=3, Ll = 3) No. of Patterns w/o Canonical Labelling (Ln=1, Ll = 1) No. of Patterns with Canonical Labelling (Ln=1, Ll = 1)
0.5 0
Computational Time for Canonical Labelling (Ln=3, Ll = 3)
2500
4
0
10
20
30
40
No. of Nodes in a Graph
50
60
0
Fig. 6. Effect of canonical labeling on the number of extracted patterns and the computation time for artificial graphs of different size 2500
30000
25000
2000
20000
sn re tt aP15000 fo .o N
1500
1000
10000 No. of Patterns Computation Time
5000
0
0
5
10
15
20
500
25
)s dn oc es ( e m iT no tia tu p m o C
0
Beam Width
Fig. 7. Effect of beam width on Computation time and the number of extracted patterns for the promoter dataset
cross validation. Further, 10 trials of windowing is used for a training dataset (9 folds of the whole dataset) to induce 10 rule sets from which an optimized rule
set is constructed. The details of how much fraction of the data is used for each trial of windowing and what kind of optimization is performed is not given in the paper, thus we were not able to reproduce the same results. In our experiment, windowing did not give any advantage and cross validation is performed only once. Figure 8 shows how the prediction error rate changes as the number of folds changes for various algorithms. C4.5Rule gave 9.4% error for leaving one out cross validation, which is consistent with [17].
30
C4.5 Rules for test C4.5 Rules for train C4.5 Tree for test C4.5 Tree for train GBI+C4.5 Rules for test GBI+C4.5 Rules for train
25
20
)% ( et a15 R orr rE 10
5
0
0
20
40
60
No. of Folds of Cross Validation
80
100
Fig. 8. Effect of the number of folds of cross validation on prediction error rate for the promoter dataset (Beam width=2, top 50 patterns)
As Fig. 7 indicates, the number of extracted patterns by B-GBI is very large whereas the number of data is too small. Thus, using all of them blindly as input attributes to C4.5 is prohibitive. Since the extracted patterns are ranked by typicality (selected by Gini index) only the first n ranked attributes (n=5, 10, 20, 30, 40, 50) are used. The beam width is varied from 1 to 25. The best result was searched within these parameter space. It was obtained when n=40, beam width=2 and number of fold is 103 (leaving one out). These are summarized in Table 1. The best prediction error rate is 2.8%, which is better than the result 3.8 % obtained by KBANN using M-of-N expression [17]. Since KBANN uses domain knowledge to configure an artificial neural network, refines its structure by training and extracts rules, it is worth mentioning that C4.5Rule combined with B-GBI that does not use any domain knowledge induced rules that predict better.
Table 1. Summary of prediction error rate for the promoter dataset Fold C4.5 tree rules C4.5w tree rules ID3 tree rules GBI+C4.5 tree 20 rules GBI+C4.5 tree 40 rules GBI+C4.5 tree 50 rules
10 24.4 18.1 18.6 12.3 25.3 16.7 8.6 8.6 7.5 7.5 6.5 6.5
30 18.9 14.2 15.0 15.3 21.9 12.2 12.2 10.0 3.9 3.9 4.7 4.7
50 16.7 12.0 14.3 14.7 18.0 15.3 10.0 10.0 4.7 4.7 4.7 4.7
lvo 16.0 9.4 14.2 11.3 17.0 12.3 11.3 9.4 2.8 2.8 4.7 4.7
However, there are some problems in using all the top ranked patterns for C4.5 as Fig. 9 shows. For a typicality function like Gini index which is not monotonic to chunking steps, it is not necessarily true that a pattern extracted for a beam width is always included in patterns extracted for a larger beam width. Further, as the beam width increases there are more patterns that are closely related to each other (similar patterns). Thus, it happens that some of the top ranked patterns are correlated to each other. There is no mechanism in BGBI to try to extract patterns that are independent to each other although each pattern extracted is good in terms of typicality measure. This partly explains the strange behavior of error curves versus beam width. The following rules correspond to the best result. Unfortunately the patterns in the antecedent do not match the domain theory reported in [17]. Finding good rules (in terms of predictive performance) does not necessarily means being able to identify the correct domain knowledge.
4
Application to the Analysis of Hepatitis Dataset
After confirming the basic performance of B-GBI, we have attempted to analyze the hepatitis dataset that was provided by Chiba University. The results shown in this paper is very preliminary. The purpose here is to show a potential utility of B-GBI to a real-world data that has some structure. 4.1
Dataset used
The dataset contains long time-series data (about 20 years from 1882 to 2001) on laboratory examinations of 771 patients of hepatitis B and C. The data can be broadly split into two categories. The first data include administrative information such as patient’s information (age and date of birth), pathological classification of the disease, date of biopsy, result of biopsy, and duration of interferon
25
20 30 40 50
20
)% ( 15 et a R ro rr E10
5
0
0
5
10
Beam Width
15
20
25
Fig. 9. Effect of beam width on prediction error rate of C4.5Rule using patterns found by GBI (LVO) for promoter dataset
If If If If
T?T???????T?A = y then Promoter C??????T???T?A = y then Promoter ATTT = y then Promoter C?AA = y then Promoter
If T?A???G?A = y then Promoter If A?AT?A = y then Promoter If T?T???????T?A = n A?T???T??????C = n ATTT = n C?AA = n A?AT?A = n T?A???G?A = n then non- Promoter Fig. 10. Classification rules induced by C4.5Rule using B-GBI generated patterns as binary attributes
therapy. The second data include temporal records of blood test and urinalysis. It can be further split into two subcategories, in-hospital and out-hospital examination data. In-hospital examination data contain the results of 230 examinations that were performed using the hospital’s equipment. Out-hospital examination data contain the results of 753 examinations, including comments of staffs, performed using special equipment on outside facilities. Consequently, the temporal data contain the results of 983 types of examinations. These were given in 6 different tables. The goal is broad but our first attempt focuses on find-
ing typical data patterns, if any, that are strongly correlated to fibrosis (progress of fibrosis, discrete values: F0(mild)-F4(severe)) and activity (activity of virus, discrete values: A1(mild)-A3(severe)). The biopsi data are not measured for all patients and limiting to non acute hepatitis B resulted in 116 patients. 4.2
Preprocessing and Data Conversion to Graphs
Monthly data are averaged and new reduced dataset is generated because the date of visit is not synchronized across different patients and it is considered that the progress of hepatitis is slow. Numerical average is taken for numeric attributes and maximum frequent value is used for nominal attributes over 28 days interval. Further numeric values are discretized into three intervals (low, normal and high) when the normal range is given. If there are no data in the 28 days interval, these are treated as missing values and no attempt is made to estimate these values. Missing values means that there are no corresponding nodes in the graph. One patient record is mapped into one colored directed graph. Assumption is made that there is no direct correlation between two sets of measurements that are more than two years apart. Thus, time correlation is considered only within the interval of two years. Figure 11 shows an example of graph for a particular patient when the activity is taken as a class. The center of each star is a hypothetical node for a particular visit and this node is connected to every hypothetical node up to 700 days later. mid, 1, 1, 1, 1, 2, 2,
date, 19810428, 19820525, 19810722, 19811025, 19900324, 19900425,
sex, M, M, M, M, F, F,
inf, A2PI, ………, type, n, ?, ………, C, n, ?, ………, C, n, ?, ………, C, n, ?, ………, C, n, ?, ………, B, n, ?, ………, B,
subtype, activity, F1 mild, A1 F1 mild, A1 F1 mild, A1 F1 mild, A1 CAH2B, A2 CAH2B, A2
Average in 1 month
Convert to graph 4 months
3 months n
1 month
inf
M
n inf
F1 mild
type
C
subtype
F1 mild
inf
M
sex
subtype
n 1 month
M
sex
n
type
C
inf
M
sex
sex
subtype
type
2 months F1 mild
C
subtype
type
C
F1 mild
3 months
Fig. 11. An example of graph when the activity is taken as a class
Eq. 3 was used as the typicality evaluation function. Here, nCk indicates the number of instances with class Ck that have a typical pattern in question and
NCk is the total number of instances with class Ck . This function takes the pattern is irrelevant and the class distribution does not change. nC1 nC2 nC2 NC1 NC2 NC2 M ax. k=K nC , k=K nC , ..., k=K nC , k k k k=1 NCk
k=1
NCk
1 K
when
(3)
k=1 NCk
The beam width was set at 3. Tables 2 and 3 shows the size of generated graphs and the number of extracted patterns for each threshold. The maximum graph size is over 5,000 (number of nodes). Huge number of patterns (more than 100,000) is extracted. The threshold 1/No of Class indicates no discriminatory power (meaning the same as the default distribution). Thus threshold 0.8 indicates that the extracted patterns are highly discriminatory. The computation time for the threshold 0.4 is 16 hours by PC with CPU of Athlon MP 1600+ and Main memory of 1GB.
Table 2. Size of the graphs (class = Activity) Class A1 A2 A3 All No. of graphs 51 54 11 116 No. of average nodes 1975 1758 1776 1855 Max. node number 6723 6688 4572 6723 Min. node number 63 53 214 53
Table 3. Number of extracted patterns (class = Activity) Thres. No. of patterns 0.4 192826 0.5 172437 0.6 140991 0.7 117628 0.8 105426
A few top ranked patterns are shown for each class in Figs. 12 and 13. It turns out that many of the highly discriminatory rules consist of patterns of one shot measurements, which is contrary to our expectation. We show here only two of them. There are not many patterns that involve time evolution of measurements. We show some of them here. However, as the number of patients in each class whose measurements satisfy these patterns indicates, these patterns are highly discriminatory (See the initial distribution in Table 2 for class=activity. The initial distribution for class=fibrosis is F0=1,F1=49, F2=33, F3=26, F4=12 (We deleted F0 from the analysis because there is only one patient). We showed these results to the medical doctor who provided us with the data. He was not used to see the measurements this way and had difficulty in interpreting the patterns, but most of the patterns were interpretable and so far there was no surprise.
5
Conclusion
Graph based induction GBI is improved in three aspects by incorporating: 1) two criteria, one for chunking and the other for task specific criterion to extract more discriminatory patterns, 2) beam search to enhance search capability and
16 months
n
n
h
n n
n
TP
n
n
I-BIL LAP
TP
n
T-CHO T-BIL
n
LAP T-CHO
n
Evaluation = 0.905 A1 1 A2 1 A3 4
n
I-BIL
UA
h
n
n TP
n inf
TP
Evaluation = 0.827 A1 2 A2 2 A3 4
n
I-BIL
CAH2B
n
inf UA
inf
n UA
I-BIL
TP
n
h
T-BIL
TP
n
chyle
Evaluation = 0.726 A1 2 A2 9 A3 6
0
0 inf GPT I-BIL LAP T-BIL T-CHO TP UA UN D-BIL G. GL
n
GPT
subtype
hemolysis
T-BIL
5 months
Evaluation = 0.827 A1 2 A2 2 A3 4
n
I-BIL
n
n
T-BIL
n
inf
n
I-BIL
I-BIL
n
17 months
n
n
TP
n
n
TP
inf
n
inf
n
n
inf
GPT
UN
n
UA
6 months
n h
inf
UN inf GPT
h
Evaluation = 0.827 A1 2 A2 2 A3 4
GPT
I-BIL
n
n
n
interferon therapy glutamic-pyruvic transminase bilirubin, indirect leucine aminopeptidase bilirubin, total cholesterol, total protein, total uric acid blood urea nitrogen bilirubin, direct gamma-globulin
Fig. 12. Example of extracted patters for class=activity
17 months
n
n
D-BIL
h
0 chyle
TP
chyle
GPT
I-BIL
n
n
Evaluation = 0.706 F1 9 F2 3 F3 1 F4 9
n TP
h
hemolysis
0
n UA
G. GL
chyle
UN
n
Evaluation = 0.730 F1 2 F2 3 F3 3 F4 8
0
Fig. 13. Example of extracted patters for class=fibrosis
3) canonical labeling to accurately count identical patterns. The improved B-GBI was tested against a classification problem of small DNA promoter sequence and the effect of improvement was analyzed. B-GBI was able to extract discriminatory patterns, and using them as binary attributes C4.5Rule induced rules that outperformed the best known result without using any domain knowledge. With this confirmation, B-GBI was applied to a real-world hepatitis data to search for measurements patterns that are strongly correlated to fibrosis of liver and virus activity. Although the results are still preliminary to discuss the values of the discovered patterns, we believe that B-GBI can actually handle graphs with a few thousands nodes and extract discriminatory patterns.
Immediate future work includes to 1) evaluate the effect of improvements individually in a more systematic way, 2) use feature selection methods to filter out less useful and redundant patterns and 3) continue to analyze the hepatitis data in close interaction with medical experts.
Acknowledgement This work was partially supported by the grant-in-aid for scientific research on priority area “Active Mining” funded by the Japanese Ministry of Education, Culture, Sport, Science and Technology. Special thanks are due to Chiba University for providing us with the hepatitis dataset.
References 1. C. L. Blake, E. Keogh, and C.J. Merz. Uci repository of machine leaning database, 1998. http://www.ics.uci.edu/∼mlearn/MLRepository.html. 2. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees. Wadsworth & Brooks/Cole Advanced Books & Software, 1984. 3. L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. The cn2 induction algorithm. Machine Learning, 3:261–283, 1989. 4. D. J. Cook and L. B. Holder. Graph-based data mining. IEEE Intelligent Systems, 15(2):32–41, 2000. 5. L. De Raedt and S. Kramer. The levelwise version space algorithm and its application to molecular fragment finding. In Proc. the 17th International Joint Conference on Artificial Intelligence, pages 853–859, 2001. 6. L. Dehaspe, H. Toivonen, and R. D. King. Finding frequent substructures in chemical compound. In Proc. the 4th International conference on Knowledge Discovery and Data Mining, pages 30–36, 1998. 7. S. Fortin. The graph isomorphism problem, 1996. 8. A. Inokuchi, T. Washio, and H. Motoda. An apriori-based algorithm for mining frequent substructures from graph data. In Proc. of the 4th European Conference on Principles of Data Mining and Knowledge Discovery, pages 13–23, 2000. 9. S. Kramer, L. De Raedt, and C. Helma. Molecular feature miing in hiv data. In Proc. the 7th ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pages 136–143, 2001. 10. T. Matsuda, T. Horiuchi, H. Motoda, and T. Washio. Extension of graph-based induction for general graph structured data. In Knowledge Discovery and Data Mining: Current Issues and New Applications, Springer Verlag, LNAI 1805, pages 420–431, 2000. 11. T. Matsuda, H. Motoda, and T. Washio. Graph-based induction and its applications. Advanced Engineering Informatics, 16(2):135–143, 2002. 12. R. S. Michalski. Learning flexible concepts: Fundamental ideas and a method based on two-tiered representaion. In Machine Learning, An Artificial Intelligence Approiach, 3:63–102, 1990. 13. S. Muggleton and L. de Raedt. Inductive logic programming: Theory and methods. Journal of Logic Programming, 19(20):629–679, 1994. 14. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81–106, 1986. 15. J. R. Quinlan. C4.5:Programs For Machine Learning. Morgan Kaufmann Publishers, 1993.
16. R. C. Read and D. G. Corneil. The graph isomorphism disease. Journal of Graph Theory, 1:339–363, 1977. 17. G. G. Towell and J. W. Shavlik. Extracting refined rules from knowledge-based neural networks. Machine Learning, 13:71–101, 1993. 18. K. Yoshida and H. Motoda. Clip : Concept learning from inference pattern. Journal of Artificial Intelligence, 75(1):63–92, 1995.