Estimating Well-Performing Bayesian Networks using ...

7 downloads 0 Views 216KB Size Report
the actual states of the parents. Hence, the ... parents in the conventional BN model, and any given ..... there is a band of models with roughly linear corre-.
Estimating Well-Performing Bayesian Networks using Bernoulli Mixtures

Geo Jarrad The Business Intelligence Group CSIRO Mathematical and Information Sciences PMB 2, Glen Osmond, South Australia 5064. e-mail: Geo [email protected]

Abstract

A novel method for estimating Bayesian network (BN) parameters from data is presented which provides improved performance on test data. Previous research has shown the value of representing conditional probability distributions (CPDs) via neural networks (Neal 1992), noisy-OR gates (Neal 1992, Diez 1993) and decision trees (Friedman and Goldszmidt 1996). The Bernoulli mixture network (BMN) explicitly represents the CPDs of discrete BN nodes as mixtures of local distributions, each having a di erent set of parents. This increases the space of possible structures which can be considered, enabling the CPDs to have ner-grained dependencies. The resulting estimation procedure induces a model that is better able to emulate the underlying interactions occurring in the data than conventional conditional Bernoulli network models. The results for arti cially generated data indicate that over tting is best reduced by restricting the complexity of candidate mixture substructures local to each node. Furthermore, mixtures of very simple substructures can perform almost as well as more complex ones. The BMN is also applied to data collected from an online adventure game with an application to keyhole plan recognition. The results show that the BMNbased model brings a dramatic improvement in performance over a conventional conditional Bernoulli BN model. 1

INTRODUCTION

The problem of estimating the parameters and structure of a Bayesian network (BN) from prior information and observed data has received a great deal of

attention in recent years (Buntine 1996 gives a review of the literature). A common approach for structure estimation is some form of Bayesian model selection (BMS), which typically selects the single BN model (or an equivalence class of such models) with the maximum likelihood (ML). Alternatively, prior weights can be chosen to penalise complex structures, and the BN model with the the maximum a posteriori (MAP) likelihood can be found instead (Heckerman 1995). The BMS approach is reasonable when there is a large amount of data and the BN has a small number of nodes (Cooper and Herskovits 1992), but this condition often does not hold true (Friedman and Koller 2000), in which case the model chosen by BMS might not be representative of the underlying processes. That is, the selected model will often over t the data and not generalise to new data. An alternative to BMS is to average over many BNs with di erent structures, using a mixture of Bayesian networks (an MBN | Thiesson et al. 1997), Typically, the BN parameters are integrated out to form the marginal likelihood of the data for each BN structure. Bayesian model averaging (BMA) takes this a step further by summing over the individual BN structures, which gives better average predictions (Hoeting et al. 1999). The diÆculty with such approaches is that the marginal likelihood is usually diÆcult to compute. In particular, when data are missing, or when mixtures are used, the marginal likelihood does not have a closed form and a large-sample approximation (Kass et al. 1988) is often used. Such approximations, however, typically require very large amounts of data to be accurate. The method proposed in this paper is not concerned with structure estimation per se, but rather with the use of structure to obtain good estimates for the conditional probability distributions (CPDs) of a given Bayesian network. This single BN could be obtained from BMS, or could be speci ed in advance by a

domain expert. The Bernoulli mixture network (BMN) model (which is a discrete version of the more general Bayesian mixture-network model) is based on a novel idea of McMichael (1998), in which the CPD of each node in a BN is treated as a mixture of local distributions, each having a di erent subset of parents. Previous approaches to using local representations of CPDs have included neural networks (Neal 1992), noisy-OR gates (Neal 1992, Diez 1993) and decision trees (Friedman and Goldszmidt 1996[7]). These approaches, including the BNM, require the CPDs to be estimated from the data, and hence they di er from methods which approximate known CPDs by other distributions (e.g. Tresp et al. 1999). There is also a special relationship between the BMN and the MBN. Recall that the MBN averages over the joint distributions of global BN structures for various node orderings, whereas the BMN averages over CPDs of local structures for a single BN. This means that the MBN can in general span a much larger space of network structures than the BMN. However, an MBN restricted to one particular ordering reduces to an equivalent BMN (see Section 2.3). Empirical evidence has veri ed that the performances of the restricted MBN and the BMN on complete data are the same. As a consequence, the BMN, which is stored as a single BN, is more eÆcient than the corresponding MBN, which is stored as a collection of BNs with some duplication of parameter estimates. 1

2

THE MIXTURE MODEL

P ( j; S)

=

V Y i=1

P (i ji ; i ) ;

(1)

where, in a slight abuse of notation, i here represents both the ordered list of nodes in P i , and the arbitrary values of those nodes. This BN model is known as the conventional model, in order to distinguish it from the varieties of mixture models presented in the next two sections. For a discrete BN, in which the CPD of each node is multinomial, equation (1) further reduces to P ( j; S)

=

V Y i=1

(2)

i; ; ; i

i

where i = [[ijk ]Qj ]Rk . For convenience, the conditional probability table (CPT) of the ith node will simply be represented from now on as i = [ijk ]. i

=1

i

=1

2.2 A LOCAL MIXTURE MODEL The Bayesian mixture-network model is formed from the conventional model, of the previous section, by decomposing the local structure Se i into a listM of Mi candidate substructures, namely S i = (S i;m )m , so that S = (S i )Vi . Observe that in this slightly altered notation, the conventional model is now equivalent to a mixture model with a singleton list S i = (Se i ) of substructures for each node i , i.e. Mi  1. In general, for each mixture component mi = 1; 2; : : : ; Mi , the substructure S i;m  Se i has the corresponding BN parameter i;m , so that i = (i;m )Mm and  = (i )Vi . The mixture weights are parameterised by = ( i )Vi , where i = ( i;m )Mm . The likelihood of the joint state of the BN for this mixture model is i

i =1

i

=1

i

i

2.1 BACKGROUND

i

=1

A Bayesian network (BN) is a directed acyclic graph (DAG) G = fV ; Sg, with vertex set V and acyclic structure S, coupled with a collection  of parameters. It is assumed that G has a particular node ordering I = f1; 2; : : : ; V g, with each node (or vertex or variable) labelled as i 2 V for index i 2 I. Under the Markov property, G is decomposable in the sense that each node i depends only upon its set of parents P i = fj j j ! i g, where j ! i denotes a directed arc from node j to i . This in turn de nes the local structure Se i = fj 2 I j j 2 P i g for each node i , so that the overall structure is S = (Se i )Vi . A similar decomposibility  = (i )Vi is assumed for the BN parameters, where each node i has the local parameter i . Given these de nitions, the likelihood of the joint state =1

=1

1

of the BN is

The term \Bernoulli" is used here in the n-ary or multinomial sense.

=1

i

P ( j ; ; S)

where

=

P (i ji ; i ; i ) Mi X

mi =1

V Y i=1

i

i =1

i

i =1

P (i ji ; i ; i ) ;

(3)

=

P (mi ji ; i ) P (i ji ; i ; mi ) :

(4)

The mixture weights are further assumed to depend only upon the candidate substructures, and not on the actual states of the parents. Hence, the Bayesian mixture-network model is given by P ( j ; ; S) = Mi V X Y

i=1 mi =1

P (mi j i ) P (i ji ; i;m ) : i

(5)

The discrete form of this model is the Bernoulli mixture network (BMN) model: P ( j ; ; S)

Mi V X Y

=

i=1 mi =1

i;mi i;i ;i ;mi ;

(6)

which is parameterised by: P (mi j i ) = i;m ; (7) P (i ji ; i;m ) = i; ; ;m ; (8) where the CPT for the mi th substructure of the ith node is i;m = [ijk;m ]. The total number of possible substructures for node i is given by Mi = 2P , where there are Pi = jP i j parents in the conventional BN model, and any given substructure can have zero, one, two or more of the candidate parents selected. Observe that for the given node ordering I, the ith node can have a maximum of i 1 parents (i.e. Pi  i 1). The special substructure in which all i 1 candidate parents are selected is known as the full substructure. Correspondingly, the BN structure with full substructure for every node is called the full structure. It is desirable, however, to limit the number of selected parents in each substructure in order to reduce over tting (see Section 4.1). Thus, if at most pi  Pi parents are allowed to be selected from the candidate set, then i

i

i

i

i

i

i

i

Mi

=

pi X



p~=0



Pi p~

 2P ; i

(9)

2.3 A GLOBAL MIXTURE MODEL In contrast to the Bayesian mixture model of the previous section, the mixture of Bayesian networks (MBN) model averages over a selection of global BN structures, possibly for a variety of node orderings. In particular, the overall collection of BN parameters has the decomposition  = (m )Mm for m = (m;i )Vi , where m;i parameterises the CPD of the ith node in the mth structure. Here M is the total number of candidate structures. Likewise, the mixture parameters are decomposed as = ( m )Mm . Hence, the likelihood of the joint state of all nodes is P ( j ; ; S) = =1

=1

=1

m=1

P (mj )

V Y

i=1

P ( j ; ; S)

where

M X

=

m=1

m

V Y

P (i ji ; m;i ) :

(10)

(11)

m;i; ; ; i

i=1

P (mj ) P (i ji ; m;i )

i

= m ; (12) = m;i; ; ; (13) and m;i = [m;ijk ] is the appropriate CPT. Observe that this MBN has some resemblence to the BMN of the previous section. In fact, the MBN is a generalisation of the BMN. To see this, the explicit node ordering I is rst imposed upon the MBN, giving M = 2V V = candidate structures. Next, the mixture weight m of the mth structure is decomposed on a node-by-node basis as m = QVi m;i . Equation (11) then becomes (

i

i

1) 2

=1

P ( j ; ; S)

=

=

V 2X Y i

V 1)=2 2V (X

V Y

m=1

i=1

m;i m;i;i ;i

1

i=1 mi =1

m ~ (i;mi );i m ~ (i;mi );i;i ;i ;

(14)

where each pair (i; mi ) is mapped by m~ to a unique m. Thus, by equating m;i with i;m and m;i; ; with i; ; ;m , it can be seen that the MBN with restricted ordering is equivalent to the BMN given by equation (6), where Mi = 2i . Furthermore, the BMN representation is more compact than that of MBN, in the sense that it can P span the M = 2V V = global V structures using only i Mi = 2V 1 local mixture components, which can be collapsed to V parameter estimates. i

i

i

i

i

i

1

(

where equality holds only if pi = Pi . Typically pi might be roughly Pi =2, which is the expected number of parents if any given directed arc has equal probability of being present or absent.

M X

The discrete form of this MBN is then just

1) 2

=1

3

PARAMETER ESTIMATION

Parameters for both the conventional BN model and the BMN are estimated from a collection X = (xd )Nd of N observed cases, known as the training data. Missing data are handled using a collection Z of uniformly distributed indicator variables, in conjunction with the expectation{maximisation (EM) algorithm (Dempster et al. 1977), which has known monotonic convergence (Boyles 1983, and Wu 1983). The conditional probabilities are then obtained using maximum a posteriori (MAP) estimation (McMichael 1998), as described brie y below. Recall that the BMN model is given by equation (6), viz =1

P ( j ; ; S)

=

Mi V X Y

i=1 mi =1

i;mi i;i ;i ;mi :

(15)

+ ( i;m + Ai;m ) log 2i;m ]

Allowing for missing data, the likelihood of the dth case is P (xdj ; ; S ; Z ) = Qi Y Ri Y Mi V Y Y

i=1 j =1 k=1 mi =1

zdi;mi zdijk;mi i;mi ijk;mi

i

i

i

=

Qi X Ri X Mi V X X

[Nijk;m log ijk;m

i=1 j =1 k=1 mi =1

i

+ Ai;m log

i

(17)

where Nijk;m

i

Ai;m

i

= =

N X d=1 N X d=1

P (i = j; i = kjxd ; 0 ; 0 ; mi )

 P (mi jxd ; 0 ; 0 ) ;

(18)

P (mi jxd ; 0 ; 0 ) :

(19)

The MAP objective function F now combines the expected log-likelihood with joint priors for the parameters, along with a condition h( ) to ensure the mixture weights sum to unity. The prior density for  is p(jS)

= a

Qi Y Ri Y Mi V Y Y i=1 j =1 k=1 mi =1

0 Nijk;m i

ijk;m ; i

(20)

for normalising constant a, and the prior density for the mixture weights is p( jS)

= b

Mi V Y Y i=1 mi =1

i;mi i;mi

;

(21)

for normalising constant b. Thus, the objective function is F (; ; 0 ; 0 ) = Q(; ; 0 ; 0 ) + g() + h( ) + log p(jS) + log p( jS) =

Qi X Ri X Mi V X X

i=1 j =1 k=1 mi =1

[(Nijk;m + Nijk;m ) log ijk;m 0

i

i

i

i

i

ik;m

i

i=1 k=1 mi =1 " Mi V X X + i i;mi mi =1 i=1

4

Qi X j =1

3

15

ijk;m

i

#

1 ;

(22)

where the normalising constants a and b have been neglected. The MAP estimates are then found from the local maximum conditions @ @F = 0 and @ @F = 0 to be Nijk;m + Nijk;m ; (23) ijk;m = Nik;m + Nik;m i;m + Ai;m ; (24) i;m = +N i;mi

ijk;mi

0

i

i

i

i

0

i

i

i

i;

i

i;mi ] ;

i

+

(16)

;

where zdi;m = 1 if submodel mi is the correct choice at the ith node for the dth observed case, and zdijk;m = 1 if, further, the family (i ; i ) is in state (j; k); otherwise zdi;m = 0 and zdijk;m = 0. The expected loglikelihood is then Q(; ; 0 ; 0 )   = E Z log P (X j ; ; S ; Z ) j X ; 0 ; 0 i

i

Ri X Mi V X X

PQ where Nik;m = P j Nijk;m (and likewise for Nik;m ), and i; = M m i;m . There are two problems with this formulation. The rst is that an expert can reasonably be expected to specify the prior counts Nijk , but can not be expected to specify the mixture counts Nijk;m , since there will in general be too many submodel interactions with which to cope. The second problem is that equation (18) for Nijk;m is both ambiguous and intractible, for much the same reason. For instance, equation (18) speci es the use of the mi th submodel at the ith node, but does not specify which conditional probabilities are to be used at other nodes. These problems can both be solved by following the design philosophy that the BMN reduces to a conventional BN after parameter estimation. Thus, each submodel count Nijk;m is computed from the table [Nijk ] of family data counts by summing over the states of all parents not included in the mi th submodel. Likewise, each Nijk;m can be computed from table [Nijk ] of prior family counts supplied by the expert. The family data counts themselves are computed from the conventional BN containing all candidate parents, via i

=1

i

0

i

i

i =1

i

i

0

0

i

i

i

0

0

i

Nijk

N X

=

d=1

); P (i = j; i = kjxd ; 

(25)

where  = ( i )Vi ,  i = [ijk ] and  i) ijk = P (i = j ji = k;  =1

= =

Mi X

mi =1 Mi X mi =1

P (mi j i ) P (i = j ji = k; i;m ) i

i;mi ijk;mi :

(26)

4

EMPIRICAL RESULTS

4.1 ARTIFICIAL EXPERIMENT

2

Cross−validation of maximum−likelihood model selection −3.15 2

−3.2

mean log likelihood for the testing data

The performance of the BMN is analysed for the arti cially-created true network shown in Figure 1. The V = 4 nodes One, Two, Three and Four have 3, 2, 2 and 3 discrete states, respectively. The conditional probability tables were randomly initialised, and the resulting true model was used to randomly generate Ntrain = 100 complete cases of training data and Ntest = 2000 complete cases of testing data.

poorly on the testing data, which indicates over tting. Apart from these and even worse-performing models, there is a band of models with roughly linear correspondence between the two scores, with the empty BN with independent nodes (1) at the low-performance end and the true BN (2) at the high-performance end.

One

Three

Two

Figure 1: The true model structure

1

−3.4 3 4 −3.45

all models 1. empty model 2. true model 3. full model 4. selected model

−3.25

−3.2

−3.15 −3.1 −3.05 mean log likelihood for the training data

−3

−2.95

−2.9

Figure 3: Performances of conventional BN models Cross−validation of Bayesian mixture−network models −3.15 2

−3.2

1) 2

One

−3.35

−3.55 −3.3

mean log likelihood for the testing data

(

−3.3

−3.5

Four

Bayesian model selection (Ramoni and Sebastiani 1997) with uniform prior model weights results in the maximum-likelihood choice of the class of models with the full structure shown in Figure 2. This equivalence class contains 24 models, corresponding to the full structure for each of the V ! = 24 possible node orderings. In total, there are V !  2V V = = 1536 di erent possible network models.

−3.25

4 5

−3.25 6 −3.3

−3.35 1. empty BN model 2. true BN model 3. full BN model 4. mixture model based on full BN 5. mixture model with restricted parents (

Suggest Documents