A Review of Some Bayesian Belief Network Structure

0 downloads 0 Views 287KB Size Report
a) Urinary System (D1): The data set was made available to. UCI Machine learning ... A data instance of all these 19 variables is labelled with. 'class' i.e. whether ...
A Review of Some Bayesian Belief Network Structure Learning Algorithms Sangeeta Mittal

S.L. Maskara

Department of CS & IT JIIT, Noida UP, India [email protected]

Flat 2W, Block-G Soura Niloy Housing Complex Kolkata, India [email protected]

Abstract— Bayesian Belief Networks (BBNs) are useful in modeling complex situations. Such graphical models help in giving better insight and understanding of the situation. Many algorithms for machine learning of BBN structures have been developed. In this paper six different algorithms have been reviewed by constructing BBN structures for two different datasets using various algorithms. Some inferences have been drawn from the results obtained from the study which may help in decision making. Keywords-Bayesian Belief Network, Structure Urinary System Diseases, Hepatitis Domain Diseases

Learning,

The six structure learning algorithms for constructing BBNs have been briefly described in the second section. The third section presents the actual construction of the BBN structures. This section also includes performance measure to find the efficacy of the algorithms. Based on the BBN graphs obtained in the third section, performance measures of these algorithms has been computed and reported in the fourth section. Inference on the efficacy of these algorithms has also been discussed in the fourth section. Some conclusions have been drawn in the last section. II.

I.

INTRODUCTION

A Bayesian Belief Network (BBN) is a quantitative graphical representation of the data in a problem domain with probabilistic relationships among them [1]. BBNs have variety of applications in machine learning and artificial intelligence, particularly, in the fields of computational biology & bioinformatics, medical expert systems, image processing, information retrieval, decision support systems, natural language interpretation, planning, computer vision, robotics, data mining [2][3]. In this paper, some BBN construction algorithms are presented. To illustrate how BBNs are constructed, two diseases have been taken as an example. A real life practical problem can be characterized as a set of many attributes, each represented by a random variable having a large range. These attributes are taken as nodes in a BBN. In most of the practical situations, the nodes have interrelationships which can be described in terms of conditional probabilities. The large number of nodes, the large uncertain datasets associated with them and the probabilistic interrelationships render the construction of BBN for a given situation a challenging task. To enable machine construction of BBNs for complex practical problems, a large number of learning algorithms have been developed over the years [14]. These algorithms can be grouped broadly in two categoriesconstraint based and score based. Of Course, one can combine the approaches of the two categories resulting in a hybrid technique. In this paper the BBNs constructed by using six algorithms drawn from both these categories have been presented with a view to find their efficacy. Available datasets for two different diseases have been taken to construct the BBNs.

STRUCTURE LEARNING ALGORITHMS FOR BBNS

Bayesian Belief Networks are Directed Acyclic Graphs (DAG) with nodes representing variables. The links joining the nodes depict possible relationship [4] [5]. Each node of the graph represents a random variable and each arc represents a direct probabilistic dependence between two variables. A BBN conveys a joint probability distribution of its variables, which is the product of the local distributions of each node and its parents. The DAG represents the structure of dependencies between nodes and gives the qualitative part of BBN. Quantification consists of prior probability distributions over those variables that have no predecessors in the network and conditional probability distributions over those variables that have predecessors. These probabilities can be easily calculated using available data and maximum likelihood estimation. When no data are available expert judgment can be used. In particular, a domain where no independencies exist will be represented correctly by a BBN that is a complete graph. A. Description of various BBN Structure Learning Algorithms Given data, to obtain model of a given problem, several algorithms for learning the structure of a BBN from the data have been proposed [14]. These can be subdivided into two general approaches: methods based on conditional independence tests, and methods based on a scoring function and a search procedure. There are also hybrid algorithms that use a combination of independence-based and scoring-based methods. The algorithms based on independence tests perform a qualitative study of the dependence and independence relationships between the variables in the domain, and attempt

to find a network that represents these relationships as far as possible. Some of the algorithms based on this approach can be found in [8] [11] [12]. The algorithms based on a scoring function attempt to find a graph that maximizes the selected score; the scoring function is usually defined as a measure of fit between the graph and the data. A scoring function is combined with a search method in order to measure the goodness of each explored structure from the space of feasible solutions. During the exploration process, the scoring function is applied in order to evaluate the fitness of each candidate structure to the data. Each algorithm is characterized by the specific scoring function and search procedure used. The scoring functions are based on different principles, such as entropy, Bayesian approaches or the Minimum Description Length (MDL) [6] [7]. Out of several algorithms in each category above, here some representative samples of the different approaches for learning Bayesian networks have been used. These are described as below. PC Algorithm[4]- Named after its inventors, Peter and Clarke [4] , it is one of the most common algorithm in this category that finds I(Xi,Xj|S) , that is mutual independence of variables Xi and Xj given a set of parent vertices, S. It starts by forming the complete undirected graph, which it then thins by removing edges with conditional independence relationships. The set of variables conditioned only needs to be a subset of the set of variables adjacent to one or other of the variables conditioned; this is constantly changing as the algorithm progresses. Bayesian Network Power Constructor (BNPC) [5] - Cheng et al. [5] developed a CI-based algorithm for BBN structure learning namely the three-phase dependency analysis algorithm, TPDA. The three phases of the TPDA algorithm are drafting, thickening and thinning. Drafting: Computes the mutual information of each pair of nodes as a measure of closeness and produces an initial set of edges based on pair-wise mutual information. Thickening: Add arcs when the pairs of nodes cannot be dseparated according to the CI test. Thinning: Each edge of the current graph is examined using CI tests and redundant edges are removed if the two nodes of the edge can be d-separated. An edge orientation procedure is also conducted if the node ordering is not known. This algorithm, given a sufficient quantity of training data D will perform well if the actual model is monotone DAGfaithful [5]. Minimum Weighted Spanning Tree algorithm [8]- It works by simplifying the structures to be a weighted tree. To assign weights to the edges Mutual Information (MI) between each pair of nodes is calculated where Mutual Information between a pair of nodes, A and B is defined as in eq 1. MI ( A , B ) =

∑ P ( A , B ) log A,B

2

⎡ P ( A, B ) ⎤ ⎢ P ( A) P ( B ) ⎥ ⎣ ⎦

(1)

Spanning Tree algorithms by Kruskal’s or Prim’s for creating maximum weighted spanning trees is used for creating the graph. K2 Algorithm [10] Another metric for reducing search space has been devised by Cooper et al in [10]. The method requires an initial node order to be provided. With respect to given enumeration order, the search space is explored to maximize probability of structure given data i.e. maximize P(G1|D)/ P(G2|D) which is equivalent to P(G1,D)/ P(G2,D). Initially Bayesian score was proposed to be used; recently use of other scores mentioned above is also being made. The algorithm also requires data to be discrete valued, independent and complete. Here three initialization orders of K2 are considered, first with MWST initialization (K2+MWST), second with its reverse (K2MWST) and third a random initialization. Greedy Search (GS) [9]: The algorithm initializes search with a DAG and tries choosing a graph in its neighbourhood (one edge insertion, deletion or reversal) with better score. The step loops with new graph. The biggest drawback of greedy search is being caught in local maximum. Having big neighbourhood may reduce this problem to some extent. Greedy Equivalent Search [9]: Here the GS is improved with searching confined to only the space of Completed Partial DAGs (CPDAG). CPDAG space is quite reduced space as many DAGS can be represented as same CPDAGs. Moreover, number of DAGs with same score also will be reduced. The experiments performed in this paper used the BNT Structure Learning Package of Bayesian Network Toolbox for Matlab [11]. III.

CONSTRUCTION OF BBN STRUCTURES USING VARIOUS ALGORITHMS

The applications considered here to be modeled using Bayesian Belief Networks are health related problems. Real clinical data sets of symptoms as variables of typical diseases of urinary system (D1) and Hepatitis domain disease (D2). a) Urinary System (D1): The data set was made available to UCI Machine learning Repository in 2009 [12]. It is a multivariate dataset with eight attributes out which two are the presence or absence of typical diseases namely acute inflammation (cystitis) (‘D1’) and nephritis (‘D2’) of urinary system. Other attributes are body temperature of the patient (‘T’), occurrence of nausea (‘N’), lumbar pain (‘L’), urine pushing (‘U’), micturition pains (‘M’) and burning of urethra (‘B’). Data Pre-processing: All except the one variable are dichotomous and have values ‘yes’ or ‘no’. Only continuous attribute, body temperature is discretized using initial histogram technique to divide it into four classes. In all there are 120 instances, where each instance is one patient’s data. To obtain a true model of the underlying problem, the structure of dependencies among the variables was elicited after elaborate discussions with three domain experts who in

our case are practicing physicians. The structure obtained is shown in Figure 1 (a). It is believed that it models interactions among the variables reasonably as in actual domain. This graph is later used for comparison of the graphs obtained with learning algorithms. b) Hepatitis domain disease (D2):- The dataset has twenty variables and represents following information namely Age (‘B’), Gender (‘C’), Steroid Intake (‘D’), Antivirals (‘E’), Fatigue (‘F’), Malaise (‘G’),Anorexia (‘H’),Liver Big (‘I’),Liver Firm (‘J’),Spleen (‘K’),Spiders (‘L’), Ascites (‘M’), Varices (‘N’), Bilirubin (‘O’), Alkaline Phosphate (‘P’), SGOT (‘Q’), Albumin (‘R’), Protime (‘S’) and Histology (‘T’). A data instance of all these 19 variables is labelled with ‘class’ i.e. whether the patient died or lived. This is shown as node (‘A’) in the graph. There are in all 155 instances of such data available. There were few missing values for some variables. These values were filled using single imputation by the most frequent values of the class variable. The expert’s model of dependencies within these variables is represented in figure 1(b).

(a)

(b)

Figure1. (a) Expert’s Model of Urinary System Diseases (D1) (b) Expert’s Model of Hepatitis Domain Disease (D2)

A. Structures as Obtained After learning with chosen algorithms, eight different networks were obtained for disease D1; these are represented in Fig. 2.

(a)

(e)

(f)

(g) (h) Figure2. Urinary System Disease Model constructed by (a) BNPC Algorithm (b) Greedy Search (c) PC algorithm (d) MWST (e) K2+MWST (f) K2-MWST (g) K2-Random Initialization (h) GES

The edges found by the algorithms indicate statistical association between a pair of variables. GES algorithm found the most number of edges and PC the least. Similar to Figure 2, network structures were obtained for the other dataset but have not been reproduced here for scarcity of space. As can be seen various algorithms when executed over same set of attributes and their instances give quite discriminated structures. This is so as the selected algorithms are driven by different principles and/or metrics, so the resulting models differ somewhat in the edges they extract. The obtained networks represent a common set of arcs that can act as a starting model when the model is unknown. These core set of arcs common in all structures, of D1 and D2 are represented in Figure 3 (a) and (b) respectively. The common directed/undirected edges in all structures are represented by solid lines and edges common to all networks except one are represented as dashed lines. In all, there are four edges common to all networks and two are common to all but PC algorithm in Disease D1. Any consensus BBN may be built from this shared structure. In the disease D2 seven edges have found to be commonly extracted by all algorithms while the other three are common in all but one network structures.

(b)

(c)

(d)

(a) (b) Figure3. Edges shared by all network structures (solid lines) and edges common to all except one structure (dashed lines) in (a) Disease D1 (b) Disease D2.

It is aimed to compare the performance of retrieved models on a real problem but even if it is not so the arcs appearing in all the learnt networks could be considered as being a starting for the real representation model. B. Performance Measures To compare networks obtained with different algorithms and their resemblance to given dataset, some commonly used performance metrics have been calculated and compared for all the learnt networks. The general notations used are : n denotes the number of variables in the problem domain and N the number of records in dataset D. Each ri, where i is between 1 to n, represent the possible values of variable xi and qi the number of possible instantiations for the set of parent nodes of xi, Pa(xi). Nij is used as the number of records in dataset for which Pa(xi ) takes its jth value, Nijk is used as the number of records in dataset for which Pa(xi) takes its jth value and for which xi takes its kth value. Using these notations, the quality of the network structures are calculated as BDeu and BIC scores explained below. a) Bayesian – Dirichlet metric[6] with likelihood equivalence and uniform priors (BDeu) measure the marginal likelihood P(D|G) ) score by making four assumptions on P(G,D) namely, multinomial, parameter independence, Dirichlet distribution of probabilities and parameter modularity. Under these assumptions, logged version of Bayesian Dirichlet (BD) score is defined as in eq. 2. ' ri n qi ) Γ(Nij' ) Γ( Nijk + Nijk ) + ∑log( )) BD(G, D) = log(P(G)) + ∑∑(log( ' ' Γ( Nij + Nij ) k =1 Γ( Nijk ) i =1 j =1

(2)

where Γ is the Gamma Function satisfying Γ (x+1) = x Γ (x) and P(G) represents the prior probability of BBN DAG G. Specifying all parameters of BD score is a formidable task, hence some of its variants are more useful. Hence BDe Score that only finds scores for non equivalent DAGs, i.e. the DAGs in the same search space that yields different parameters Consideration of uniform prior probability of every DAG leads to BDeu Score that has been used here. Refer [16] for detailed account of the calculation of this score. b) Bayesian Information Criterion (BIC)[6] – measures the likelihood P(D|G) given the CPTs of graph G, estimated using maximum likelihood. BIC unlike above score includes a penalty term for the complexity of the network. Assuming uniform prior probability of all graphs, BIC is calculated as n

qi

ri

BIC (G, D) = ∑∑∑ N ijk log i =1 j =1 k =1

N ijk N ij

1 n − ∑ qi (ri − 1) log( N ) 2 i =1

(3)

The BIC metric can also be seen as Minimum Description Length metric. Both these metrics are to be interpreted as the higher the value (presented here in log version) is, the better the network is. c) Editing Measure[11]- For evaluating the obtained network structures, an ‘editing measure’ defined by minimal number of arc operations needed to transform the obtained graphs into the one (Figure 1(a) and (b)) elicited from our domain experts. Arc reversal is not counted as the experts elicit a causal

structure from class of disease to causes, which is not considered in learning algorithms. IV.

INFERENCE ABOUT EFFICAY OF DIFFERENT ALGORITHMS

For all the structures obtained for diseases D1 and D2, the BDeu & BIC Scores and Editing Measures were calculated and are reproduced in table 1 and 2 below. The number in brackets represents the rank of the algorithms on a particular metric. Table 1: BDeu & BIC scores of the obtained structures and editing measure from the true network of Disease D1 PC

BNPC

MWST

K2+ MWST

K2MWST

K2+Random Initialization

BDeu Score

530.34 (7)

547.27 (8)

-503.86 (6)

-375.1 (2)

-371.21 (1)

-438.05 (5)

BIC Score

-536.7 (6)

-568.9 (7)

-510 (4)

-472.17 (3)

-514.10 (5)

-617.87 (8)

6 (2)

5 (1)

5 (1)

8 (4)

Editing Measure

5 (1)

11 (5)

GS

GES

397.88 (3) 432.96 (1) 7 (3)

421.07 (4) 475.42 (2) 11 (5)

Table 2: BDeu & BIC scores of the obtained structures and editing measure from the true network of Disease D2 PC BDeu Score BIC Score Editing Measure

-2837.30 (8) 22095.00 (8) 12 (1)

BNPC

MWST

2299.50 (6) 2396.50 (7) 15 (2)

2309.40 (7) 2346.10 (6) 16 (3)

K2+ MWST 2167.20 (2) 2174.80 (1) 21 (5)

K2MWST 2174.20 (4) 2191.20 (4) 17 (4)

K2+Random Initialization -2168.20 (3) -2183.60 (3) 22 (5)

GS

GES

2161.30 (1) 2181.20 (2) 17 (4)

2199.20 (5) 2209.40 (5) 23 (6)

Conditional Independence based algorithms have low scores but reproduce the structure well. MWST also gives low editing measure in this case but as the search space for DAG is very limited, it can’t be relied upon much in general. K2 algorithms are very sensitive to initial enumeration order and give varying results upon changing that order. This feature of K2 can make it best algorithm as user’s prior knowledge can be encoded in the form of initialization order and a close structure be obtained. Classical Greedy Search is giving good scores and editing distance is also fair. From table 1 and 2, it is evident that no single algorithm has emerged as winner on all three metrics. The GS approach gave best scores and BNPC and MWST the best editing measures. In table 2 though it can be observed that both BIC and BDeu scores mostly vary in sync with each other. The BDe score maximizes likelihood of the structure by adding arcs whenever a dependency is seen in data, while BIC along with maximum likelihood also penalizes complexity induced due to increases arcs, hence in K2-MWST and K2+R which are the densest networks, the BIC scores are not good due to penalization even though they have good BDeu scores. It is also observed that the true model doesn’t have the best scores. This is so as the data used to learn BBN structure is small and algorithms try to maximize likelihood of the structures on the given data only. This is known as the problem of “over fitting”. Though the results can be explained given the working of the algorithms, to further strengthen the conclusions results were sought on increasing data sizes. As the true probability of the

true network was not available hence a complete true BBN could not be constructed. Therefore original small datasets were expanded to larger datasets by sampling data from the BBN created by each of the algorithms. In order to test the influence of various data sizes over performance of learning all the three metrics were calculated over various data sizes. The results of these metrics are shown in Table 3. The BDeu and BIC scores have been calculated on additional 20000 samples of data generated from each graph. The BIC score may not increase even if the editing measure is reduced as the increase in likelihood is compensated by the penalty term increased due to complexity. Table 3: BDeu, BIC scores and Editing Measures for varying data sizes

PC

BNPC

MWST

K2+T

K2-T

K2+R

GS

GES

Dataset Size Æ

120

1200

2400

5000

10000

BDe Score

-83050.4

-82815.4

-83022.6

-83002.2

-83162.5

-83553.7

BIC Score

-83089.3

-82853.9

-83061.4

-83041.1

-83201.4

-83590.5

Editing Measure

6

4

4

4

4

4

BDe Score

-85072.8

-85159.4

-84695.1

-85045.1

-85046.2

-85265.2

BIC Score

-85128.7

-85219.3

-84751.1

-85110.9

-85103

-85325.6

15000

Editing Measure

5

5

6

6

5

5

BDe Score

-74133.6

-74133.6

-74133.6

-74133.6

-74133.6

-74133.6

BIC Score

-74154.7

-74154.7

-74154.7

-74154.7

-74154.7

-74154.7 5

Editing Measure

5

5

5

5

5

BDe Score

-52775.2

-52524.2

-52759.3

-52777.2

-52524.2

-52833.9

BIC Score

-53122.5

-52823.2

-53040.3

-53124.7

-52823.2

-53181.5

Editing Measure

5

9

11

9

10

10

BDe Score

-50507.8

-50428.8

-50608.6

-50483.7

-50485.7

-50506.3

BIC Score

-50835.4

-50991

-51076.8

-50811.4

-50813.4

-50834

Editing Measure

8

9

9

8

8

9

BDe Score

-50034.4

-49890.8

-49983.2

-50079.6

-50022.2

-49992.9

BIC Score

-50783.7

-50640.1

-50732.4

-50828.7

-50771.5

-50742.2

Editing Measure

11

12

12

12

12

12

BDe Score

-56958.9

-57185.6

-57306

-56494.2

-57172.6

-57023.7

BIC Score

-57061.9

-57284.1

-57393.3

-56588.9

-57261.2

-57121.4 8

Editing Measure

7

8

9

8

9

BDe Score

-51514

-52824.3

-52687

-51416

-51416

-51416

BIC Score

-52963

-52936.2

-52804.9

-52629

-52629

-52629

Editing Measure

11

12

11

9

9

9

In editing measure all algorithms don’t improve any; rather some induce extra edges and drift more away from true graph. The scores stabilize to large extent that means the structure and the likelihood doesn’t vary always on increasing data sizes and tend to saturate. This suggests that networks can’t be improved much if data is derived from same source, which may have errors. For this reason similar results obtained for D2 are not reproduced here. V.

CONCLUSION

In this paper, datasets of two different diseases, six BBN learning algorithms and three performance measures have been taken. The dataset of the Urinary Disease may be considered to be less complex compared to that of the

Hepatitis. The Editing Measure indicates the extent of matching between the true graphs and the constructed graphs. From the editing measure results, it is observed that some of the constructed graphs have more similarity with the true graphs, whereas the others are quite apart. This holds true for both the datasets. Thus one can infer that the editing measure can give a general comparison of the algorithms but it may not be possible to decide which the best algorithm is. In addition to the editing measure, BDeu and BIC scores are also used for performance evaluation. It is observed that for the case of less complex datasets, the two scores indicate dissimilar rankings. However for the complex dataset it is seen that both the scores give similar rankings. Another inference is drawn that the rankings based on editing measure and scores don’t agree. Therefore it is not possible to draw a sharp comparison of the different algorithms. Either the domain expert makes the final decision based on his/her experience or intuition or some better performance measure has to be worked out. Of course the algorithms and the BBNs do provide a greater insight into the complex practical problem. REFERENCES [1]

S. B. Kotsiantis, “Supervised Machine Learning: A Review of Classification Techniques”,In Proceeding of conference on Emerging Artificial Intelligence Applications in Computer Engineering,2007. [2] D. Heckerman, A. Mamdani, and M. P. Wellman, “Real-world applications of Bayesian networks”, Commun. ACM 38, 3 (March 1995), 24-2. 1995. [3] E Philippot , Y Belaid and A. Belaid, “Bayesian Networks Learning Algorithms for Online Form Classification”, In Proceedings of the 20th International Conference on Pattern Recognition (ICPR '10) (2010). [4] Spirtes P, Glymour C, Scheines R : Causation, prediction, and search, 2nd edn. MIT press, Cambridge (2000) [5] J Cheng et al, “Learning Bayesian networks from data: an information theory based approach”, Artif Intell 137:43–90 ,2002 [6] D Heckerman, D Geiger, D M Chickering, “Learning Bayesian networks: the combination of knowledge and statistical data”, Mach Learn 20(3):197–243(1995) [7] T. Roos, T. Silander, P. Kontkanen, and P. Myllym ”Bayesian network structure learning using factorized NML universal models”, In Proc. ITA’08,2008. [8] C. Chow and C. Liu, “Approximating discrete probability distributions with dependence trees”, IEEE Transactions on Information Theory, 14(3):462–467, 1968 [9] D. Chickering, D. Geiger, and D. Heckerman, “Learning bayesian networks: Search methods and experimental results”, In Proceedings of Fifth Conf. on Artificial Intelligence and Statistics, pages 112–128, 1995. [10] GF Cooper, E Herskovits,“A Bayesian method for the induction of probabilistic networks from data”, Mach Learn 9(4):309–347(1992) [11] P Leray , Francois O : BNT structure learning package: documentation and experiments. Technical Report 2004/PhLOF, PSI, LITIS Laboratory. [12] UCI repository of machine learning databases. Available at:

http://archive.ics.uci.edu/ml/

[13] B. Wemmenhove et al. , “Inference in the promedas medical expert system”,In Artificial intelligence in medicine, volume 4594/2007 of lecture notes in computer science. Springer, Berlin, pp 456–460 (2007) [14] D Margaritis, “ Learning Bayesian network model structure from data”,Technical Report CMU-CS-03-153 (2003).

Suggest Documents