A Bayesian Network Classifier with Inverse Tree ... - Semantic Scholar

1 downloads 87 Views 290KB Size Report
[3] J. Binder, D. Koller, S. Russel, and K. Kanazawa. Adaptive probabilistic networks with hidden variables. ... [29] J. Pearl. Probabilistic Reasoning in Intelligent.
Research Track Paper

A Bayesian Network Classifier with Inverse Tree Structure for Voxelwise Magnetic Resonance Image Analysis Rong Chen

Edward H Herskovits

Department of Radiology University of Pennsylvania Philadelphia, PA, 19104, USA [email protected]

Department of Radiology University of Pennsylvania Philadelphia, PA, 19104, USA [email protected]

ABSTRACT

Keywords

We propose a B ayesian-network classifier with inverse-tree structure (BNCIT) for joint classification and variable selection. The problem domain of voxelwise magnetic-resonance image analysis often involves millions of variables but only dozens of samples. Judicious variable selection may render classification tractable, avoid over-fitting, and improve classifier performance. BNCIT embeds the variable-selection process within the classifier-training process, which makes this algorithm scalable. BNCIT is based on a Bayesiannetwork model with inverse-tree structure, i.e., the class variable C is a leaf node, and predictive variables are parents of C; thus, the classifier-training process returns a parent set for C, which is a subset of the Markov blanket of C. BNCIT uses voxels in the parent set, and voxels that are probabilistically equivalent to them, as variables for classification of new image data. Since the data set has a limited number of samples, we use the jackknife method to determine whether the classifier generated by BNCIT is a statistical artifact. In order to enhance stability and improve classification accuracy, we model the state of the probabilistically equivalent voxels with a latent variable. We employ an efficient method for determining states of hidden variables, thus reducing dramatically the computational cost of model generation. Experimental results confirm the accuracy and efficiency of BNCIT.

Bayesian network, Markov blanket, Classifier, Magnetic Resonance Image

1. INTRODUCTION 1.1 Motivation Many domains of classification involve thousands or even millions of variables. Considering a binary supervised-learning problem with class variable C, an instance is mapped to an element in {+, −} 1 . C = + and C = − are referred to as the positive class and the negative class, respectively. A data set D with n instances has m observed (potentially) predictive variables V = {Vi }, and one class variable C. D = {(V(i) , C (i) ) : V(i) ∈ Vm , C (i) ∈ {+, −}}n i=1 ; m is the dimension of D. Given D, a central goal of classifier training is to generate a model that most accurately predicts the class variable based on a set of predictive variables. Problem domains such as voxelwise analysis of magnetic-resonance (MR) images, spatial mining of geological data, and text classification are especially challenging because in these contexts m is on the order of 103 ∼ 108 . In this manuscript, we concentrate on the voxel-wise analysis of MR images, where each voxel in an image is considered to be a variable, and C is a function variable, which can either be a demographic variable, such as age, or a clinical variable reflecting performance on a neuropsychological battery of tests. The number of voxels in a brain MR image volume varies widely, but may range from 105 to 107 , depending on spatial resolution. In this domain, determining the optimal classifier from D directly is infeasible because of computational cost. Variable selection, which identifies a subset R of variables V, has the potential to ameliorate the curse of dimensionality, avoid over-fitting, reduce the classification-error rate, and reduce computation time. Another critical issue in voxelwise MR image analysis is small sample size, relative to the number of variables. A typical MR-based study may be based on only dozens of instances (subjects), perhaps on the order of 100. In this context, it is important to validate statistically the classification model in order to avoid over-fitting. We propose a joint classifier-training and variable-selection algorithm, based on

Categories and Subject Descriptors G.3 [Probability and Statistics]: Multivariate statistics; J.3 [Life and Medical Sciences]: Medical information systems

General Terms Algorithms

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’05, August 21–24, 2005, Chicago, Illinois, USA. Copyright 2005 ACM 1-59593-135-X/05/0008 ...$5.00.

1

We use X to represent a variable and x to denote a specific value assumed by that variable. We denote a set of nodes by X and an assignment of states by x.

4

Research Track Paper Given a data set D, we can train a BN classifier by inducing a BN B from D. This process includes determining the structure and parameters of B. The search-and-score approach to BN-structure generation poses this process as an optimization problem. It defines a metric that describes the goodness of fit of the candidate BN structure to the observed data; an optimization method is then used to find the structure that maximizes or minimizes the metric. Commonly used scores include various Bayesian scores [22, 10, 21] and informationtheoretic scores [4, 27, 32]. One of the most widely used structure-learning metrics is the Bayesian score [22, 10]. To use this Bayesian score, we assume that: (1) variables are discrete and there are no missing values; (2) instances in D occur independently, given the BN model of D; and (3) the prior of the parameters (when the structure is fixed) is Dirichlet: p(θij |S) ∼ Dirichlet(αij1 , αij2 , ..., αijri ). Let Nijk be the number of samples in D for which Xi = k and pa(Xi ) = j; Nijk is a sufficient statistic for θijk . Then the posterior distribution of θij is also Dirichlet:p(θij |S, D) ∼ Dirichlet(αij1 + Nij1 , αij2 + Nij2 , ..., αijri + Nijri ). We can then write

a Bayesian-network classifier with inverse-tree structure. We also describe a validation method to determine whether the classifier generated from the training data is artifactual. To capture the associations among probabilistically equivalent voxels, we employ a latent-variable model, in which the latent variable represents the state of these variables. We use the iterated conditional modes (ICM) algorithm [2] to find the maximum of the posterior marginal estimate for the latent variable. However, inferring the state of the latent variable is very computationally demanding. We propose a novel method that, in practice, dramatically reduces the computational requirements for computing the state of the latent variable.

1.2 Related Work There are two different approaches to variable selection [25, 19]. The filter method considers variable selection to be a preprocessing step, and therefore independent of classification. For example, we can calculate the pairwise correlation coefficients between Vi and C; then sort variables by correlation coefficient; and finally select the highest-ranked variables for constructing a classifier. Regardless of the classification model, variable selection is independent of the classifier-generation process. On the other hand, the embedded approach integrates variable selection and classifier training, and therefore simultaneously selects variables and trains the classifier. The classifier that we propose in this paper is based on a Bayesian-network representation. There exist different Bayesian-network models, such as na¨ıve-Bayes [15, 28, 14], tree augmented na¨ıve-Bayes (TAN) [16], BN augmented na¨ıveBayes (BAN) [7], and the general BN classifier (GBN) [7, 8]. The primary difference among these classifiers centers on various restrictions on allowed BN structures. The details of these classifiers are presented in Section 2.2. The remainder of this paper is organized as follows. Section 2 provides an introduction to Bayesian networks and Bayesian-network classifiers. The details of the BNCIT algorithm are presented in Section 3. Section 4 provides experimental results to verify the accuracy of BNCIT. We conclude in Section 5.

2.

p(D | G) = BDe(G, D) n

qi

=

 

i=1 j=1

Γ(αij ) Γ(αij + Nij )

ri k=1

Γ(αijk + Nijk ) , Γ(αijk )

(1)

i where Nij = rk=1 Nijk is a sufficient statistic for θij and i αijk . p(D | G) is called the Cooper-Herskovits αij = rk=1 scoring metric. In this paper, we assume that the {αijk } are equal, and we refer to this metric as the BDeu metric. Two widely used BDeu metrics are obtained by setting 1 [21]. Having defined a metric, the αijk = 1 and αijk = ri ×q i next step is to select an optimization procedure for generating candidate BN structures. This search problem is NPhard, so one must employ heuristic search methods, such as greedy search, greedy search with restarts, best-first search, or Monte-Carlo methods, to find an acceptable solution. For each variable Xi , its sufficient statistics can be listed in a table similar to the CPT of Xi . An example of a sufficientstatistics table (SST) for a binary variable Z with pa(Z) = {X, Y }, is shown in table 1.

BACKGROUND

X 0 0 1 1

2.1 Bayesian-Network Generation A Bayesian Network (BN) [29, 24, 5, 20] is a probabilistic graphical model defined by a pair B=(G, Θ). G = (V, E ) is a directed acyclic graph (DAG) called the structure of a BN. X ∈ V represents a random variable in the problem domain and V is the set of variables. The edge set E denotes probabilistic associations among nodes. For a given node X ∈ V, a parent of X is a node from which there exists a directed edge to X. We denote the set of parents of X as pa(X). The conditional probability θijk = P (Xi = k | pa(Xi ) = j) is the probability that variable Xi assumes state k when pa(Xi ) assume states j. If Xi has no parents, then θijk corresponds to the marginal probability of Xi . We denote the distribution of Xi with fixed parent states j by θij , and we denote the conditional-probability table (CPT) of Xi by θi . Θ, the set of all θijk , represents the parameters of a BN.

Y 0 1 0 1

z=0 10 15 1 51

z=1 20 9 70 12

Table 1: Sufficient-Statistics Table for Z with pa(Z) = {X, Y } Once a BN structure has been heuristically selected, the maximum likelihood (ML) [3, 33] estimate of θ that maximizes the sample likelihood with respect to θ is Nijk + 1 ML = . θˆijk Nij + ri

(2)

The maximum a posteriori (MAP) estimate of θ [3, 33] that maximizes the posterior distribution p(θ | D) is αijk + Nijk M AP = . θˆijk αij + Nij

5

(3)

Research Track Paper

2.2 Bayesian Network Classifiers

C

We can use a BN with C as one of its nodes for classification [15, 16]. After we have trained a classifier, classification is performed by using Bayes’ rule. The value of C can be determined from C ∗ = max P (C | V). P (C | V) can be calculated using BN-inference algorithms, such as the junctiontree algorithm [11]. For the case in which all variables in new instances assume known states, as is often the case for image data, only nodes in the Markov blanket mb(C) of C will affect classification, so we can simplify the classifier to include only nodes in mb(C). Let I(X; Y | Z) represent the statement that X is conditionally independent of Y given Z. The Markov blanket of node X is defined as the smallest set such that I(X; Y | mb(X)) for all Y ∈ V\{Y, mb(X)} [29, 26]. In a DAG, the union of X’s parents, its children, and the parents of its children, forms mb(X). A BN classifier that is generated from data based on the approach described above, differs from other classification algorithms, in that it does not directly minimize some type of classification error. Instead, this BN classifier models the joint distribution of the data. Figure 1 illustrates the structures of different BN classifiers. Na¨ıve Bayes (Figure 1 (a)) assumes that predictive variables {Vi } are conditionally independent of each other given C. This assumption seems unrealistic in many applications, yet often results in accurate classification in practice [16, 30]. In addition, Domingos and Pazzani [14] have shown that Na¨ıve Bayes may be optimal under zero-one loss, even when this assumption does not hold. Generating na¨ıve Bayes classifiers, and performing inference on them, are often computationally tractable in practice, even for large data sets. TAN (Figure 1 (b)) is a BN classifier that may improve the classification performance of na¨ıve Bayes by relaxing the conditional-independence assumption. It allows a tree structure among the child nodes of C. BAN (Figure 1 (c)) also relaxes the conditional-independence assumption of na¨ıve Bayes, in this case by allowing an unrestricted BN structure among the child nodes. TAN and BAN are both na¨ıve Bayes-based classifiers. As opposed to them, GBN (Figure 1 (d)) uses the Markov blanket of C for classification. In at least some domains, GBN performs better than na¨ıve Bayesbased classifiers, achieving lower classification error rates [7], at the cost of greater computational demands.

V1

C

V2

V3

V4

V1

(a) Naïve Bayes

V2

V3

V4

(b) TAN

V1 C C

V3 V1

V3 V2 V4

(c) BAN

V1

V2

V2

V4

(d) GBN

V3

V4

C

(e) BNCIT

Figure 1: Illustration of structures of Bayesiannetwork classifiers. (a) Na¨ıve Bayes. (b) TAN. (c) BAN. (d) GBN. (e) BNCIT. by subtracting t1 from t2 RAVENS maps for each subject. If a region has undergone volume reduction, the voxel intensity at a particular location in the t2 RAVENS map will be lower than that at the corresponding location in the t1 RAVENS map. Therefore, we can binarize the volumetric difference map as follows: a voxel in the difference map with positive value is set to state 1 (“volume loss”); otherwise, it is set to state 0 (“no volume loss”). These binary difference maps, along with the class variable, C, constitute D, the input to our algorithm.

3. BAYESIAN NETWORK CLASSIFIERS WITH INVERSE-TREE STRUCTURE 3.1 Methods

2.3 Data Preprocessing

3.1.1 Variable Selection

As an example of a complex data set, we describe the analysis of a study of cerebral-volume changes among different groups over time. We have MR images for two time points, t1 and t2, along with a class variable, C, for each subject in our study. These MR images are skull-stripped to remove the skull and non-brain tissues. The images are then segmented into gray matter, white matter, and cerebrospinal fluid [18]. In order to compare brain volumes acquired from different subjects and time points, we normalize these images by registering them to a standard coordinate system. For each subject, this registration process generates a RAVENS map, which is a density map defined on a stereotaxic canonical space, whose voxel values represent regional volumetric measurements for that image [13, 12, 31]; this RAVENS map forms the voxel-wise morphometric data for each subject. To correct for registration error, we apply an isotropic Gaussian smoothing kernel to each RAVENS map. Subsequently, we generate volumetric difference maps

The BN classifiers introduced in section 2.2 may encounter problems when applied to to a data set with very high dimensionality m. Na¨ıve Bayes-based classifiers, for example, cannot handle such large numbers of variables directly. It is often necessary, therefore, to select a subset of variables before a classifier is generated from data. Variable-selection methods, such as variable ranking based on mutual information, and variable filtering using decision trees, are often used in this setting. However, the variable-selection process is independent of classifier learning, so the selected variables are not optimized for the particular classification approach. In contrast, for methods based on the GBN representation, variable selection is embedded in the classifier-generation process, because this process determines the Markov blanket of C and uses mb(C) as the variable set for classification. However, the computational cost of applying GBN in the setting of high dimensionality makes it impossible to find mb(C).

6

Research Track Paper Algorithm 1 BNCIT Learning Algorithm 1: while V = Ø and A = Ø do 2: n = 1, pa(C) = Ø, V = V; 3: P old = BDeu(C, pa(C)); 4: done = false; 5: while done == false do 6: A = {Vi : Vi ∈ V, BDeu(C, pa(C)∪Vi )−P old > 0}; 7: if A = Ø then 8: Rn = the node in A that maximizes BDeu(C, pa(C) ∪ Rn ); 9: P old = BDeu(C, pa(C) ∪ Rn ); 10: pa(C) = pa(C) ∪ Rn ; 11: if n = 1 then 12: Generate the probabilistically equivalent set E from V\A; 13: end if 14: V = V\{E ∪ {Rn }}; 15: n = n + 1; 16: else 17: done = true; 18: end if 19: end while 20: end while

BNCIT is a method that integrates the selection of a subset of variables, and the construction of a classifier based on these variables. The subset of Bayesian networks generated by the BNCIT algorithm is shown in Figure 1 (e). First, the classifier-training process generates a parent set for C; we then use the ML or MAP method to estimate the parameters. Variables in pa(C) naturally form a subset of the variables in mb(C), which are most informative about C. BNCIT’s BN-construction process is illustrated in Algorithm 1. Let R be pa(C). BNCIT starts with an empty R. Iteratively, variables are increasingly incorporated into R. Ri = Ri−1 ∪ {Ri }, where Ri is the parent node added in iteration i and Ri the parent set after that iteration. Consider δ(V ), which is the difference between the BDeu metric after adding edge V → C to the current BN structure, from that for the network without this edge. If δ(V ) > 0, the BN with this edge is more likely to have generated the data than the BN without this edge. We search for Ri that maximizes δ(R), and add Ri → C to R. If there is no R such that δ(R) > 0, then the network-generation process stops. Generally, D is noisy due to the small number of samples. Furthermore, using a single voxel, such as Ri , for classification makes the classifier unstable. To correct these problems, we identify a set of variables that are probabilistically equivalent to each Ri , in the hope that the classifier will be stabilized by using these voxels jointly. Toward this end, we define variable V to be probabilistically equivalent to variable R if a similarity measure s(V, R) is large. For binary V and R, one measure of probabilistic equivalence is P (V = 0, R = 1) ≈ 0 and P (V = 1, R = 0) ≈ 0. We can then use Bayesian thresholding [23] or belief-map learning [6] to find variables probabilistically equivalent to Ri−1 . An important property of this variable-selection process is that R is a subset of mb(C). Let D be a collection of independent and identically distributed samples from a probability distribution P . Let G be a DAG and G  be G with adding Xi → Xj . A scoring metric S(G, D) is called locally consistent if (1) S(G  , D) < S(G, D) if IP (Xi , Xj | paG (Xj )) and (2) S(G  , D) > S(G, D) if IP (Xi , Xj | paG (Xj )) is not true. The BDeu score is locally consistent [9]. In BNCIT, if a variable outside mb(C) were to be added to R, voxels inside mb(C) would be added before it, because of local consistency of the BDeu score. Furthermore, if all variables in mb(C) were in R, they would prevent other variables from being added. Thus all variables in R are inside mb(C). BNCIT may not find the complete set mb(C), but it will not add variables outside mb(C) to R. Therefore, R consists of variables that are predictive of C. The BDeu score incorporates a tradeoff between goodness of fit and model complexity. A complicated structure that fits the data well, but has poor predictive power, will not maximize the BDeu score. The number of variables in R is referred to as the the dimensionality of BNCIT. The BDeu score guarantees that BNCIT will generate a BN with low model order; however, we can also set an upper bound on model order to ensure that BNCIT generates a parsimonious model. As we show in section 4, BNCIT’s training and classification processes are efficient because of the compactness of the BN model generated by BNCIT.

involves millions of variables but dozens of samples, we must be cautious in interpreting the results, as they could represent statistical artifacts. Therefore we have introduced a step to validate BNCIT’s results. To validate networks that BNCIT generates, we use the standard jackknife method, which is a resampling method that introduces a perturbed version of the original D, then uses statistics to describe how perturbation affects network generation. Toward this end, BNCIT generates n data sets Di by deleting one sample from D. Let R be the parent set generated from D, and Ri that generated from Di . We define the frequency of a model as f (R) =

1 n

 n

I[R=Ri ] ,

(4)

i=1

where I[R=Ri] = 1 if R is the same as Ri , and 0 otherwise. We define the frequency of an edge as f (E) =

1 n

 n

I[E∈Ri] .

(5)

i=1

If a parent set R∗ dominates the model histogram, that is, f (R∗ ) is significantly larger than these of any other parent sets, we infer with high statistical confidence that R∗ is a genuine pattern. Similarly, we can make claims regarding a voxel V , such that V → C is a frequent edge in the observed models. If the parent set learned by BNCIT is not dominant, we may decrease the dimensionality of the model by 1, then repeat the validation process, until a dominant pattern is learned.

3.1.3 Latent Variable Model and Incremental Updating Algorithm For each R ∈ R, there is a set of variables E that are probabilistically equivalent to R. Using them jointly, rather than any single variable, should stabilize the classifier and improve classification accuracy. For cluster i, defined as

3.1.2 Model Validation In a MR image study, D usually contains only dozens or perhaps 100 samples. In a machine-learning problem that

7

Research Track Paper Pa(X)

L

R

E1

E2

1 2 …

En

•••

X X=1 a * …

Figure 2: Latent Variable Model

Pa(X) X=3 b * …

… … … …

X X=1 a-1 * …

1 2 …

Before

X=2 * * …

X=3 b+1 * …

… … … …

After

Figure 3: Case 1: sample A and sample B have same parent states

{Ri , Ei }, we introduce a binary latent variable Li to represent the state of cluster i, as shown in figure 2. We have

Pa(X)

n

p(L, R, E) = p(L)p(R | L)

X=2 * * …

p(Ei | L),

1 2 …

i=1

and we try to find L that maximizes p(L, R, E). Again, p(L, R, E) can be computed using BDeu. Let Q be the data set with k instances containing only {L, R, E}. We observe the states of {Rj , Ej } for instances j = 1, 2, . . . , k. The goal is to infer the state of the latent variable L for each instance in Q. There does not exist a generally applicable optimization technique that solves this problem, even for small k. The maximum of the posterior marginal (MPM) estimate of L is defined as

X X=1 a * …

X=2 * * …

Pa(X) X=3 * b …

… … … …

X X=1 a-1 * …

1 2 …

Before

X=2 * * …

X=3 * b+1 …

… … … …

After

Figure 4: Case 2: sample A and sample B have different parent states SST is

LjM P M = max BDeu(L\Lj , R, E),

log

Lj

where Lj is the state of L for instance j. There are different techniques of obtaining the MPM estimate. Gibbs sampling is a stochastic relaxation technique that can return a globally optimal solution if an appropriate temperature schema is chosen [17]. In contrast, ICM is a deterministic greedy-search algorithm [2], which will often return a locally optimum result. Furthermore, both of these approaches are based on computation of BDeu(L\Lj , R, E), and are subsumed under the generalized-EM framework. Because Gibbs sampling is very computationally intensive, we have implemented ICM in BNCIT to obtain LM P M . In each iteration, ICM sequentially updates Lj , by calculating BDeu(L\Lj , R, E) for all possible states of Lj , and then selecting the state that maximizes this metric. To infer the hidden state of L, BNCIT computes the BDeu score 2k times during each iteration, which requires a great amount of time. Consider a particular iteration. Let Qju be the data set with Lj = u. The difference between Q and Qju is only one instance. If we can derive an equation for ∆uj = BDeu(Qju ) − BDeu(Q), then an incremental computation algorithm for BDeu score follows: (1) compute BDeu(Q) at the begin of each iteration; (2) use ∆uj to compute BDeu(Qju ); (3) find LjM P M based on BDeu(Qju ) for each instance. After iteration i, we update Q by LM P M and proceed to the next iteration. We present here the equation for ∆ui . In Equation (1), the computation of the BDeu metric is carried out variable by variable. Each instance has an entry index in the SST for each variable. For a fixed binary variable X, when updating the current instance A with a new instance B, there are two different cases, as shown in figure 3 and 4. We consider first what we call case 1 (figure 3), in which the indices of instances A and B (IA and IB ) are in the same row in the SST. Note that if IA = IB , there is no change in the BDeu score. Let a and b be the sufficient statistics for IA and IB , respectively. The logarithm of the BDeu score of a row in

Γ(αij ) + Γ(αij + Nij )

 ri

k=1

log

Γ(αijk + Nijk ) . Γ(αijk )

(6)

Since the sufficient statistics other than a and b do not change, the corresponding BDeu scores are the same before and after updating. The difference exists in the states in positions IA and IB ; they change from (a, b) to (a − 1, b + 1). In case 1, therefore, the first term in Equation (6) does not change, since Nij is unchanged. We have Γ(αijk + b + 1) Γ(αijk + a − 1) + log Γ(αijk ) Γ(αijk ) Γ(αijk + b) Γ(αijk + a) − log − log Γ(αijk ) Γ(αijk )

∆ = log

= log(αijk + b) − log(αijk + a − 1)

(7)

We now consider case 2, in which IA and IB are in different a b rows. Let Nij be the Nij for row a and Nij be Nij for row b. Then Γ(αijk + a − 1) Γ(αijk ) Γ(α + b + 1) ijk b + 1) + log − log Γ(αij + Nij Γ(αijk ) Γ(αijk + a) a ) + log − − log Γ(αij + Nij Γ(αijk ) Γ(αijk + b) b ) + log − log Γ(αij + Nij Γ(αijk )

a − 1) + log ∆ = − log Γ(αij + Nij





a = log(αij + Nij − 1) − log(αijk + a − 1) b − log(αij + Nij ) + log(αijk + b)

(8)

Compared with the whole-sample method, our incrementalupdating algorithm greatly reduces computation costs. For the whole-sample approach, from Equation (1), we must compute the Γ or logarithm Γ function 4ri qi times for variable Xi . Suppose E(qi ) = q and E(ri ) = r. The total number of log-Γ operations for the whole-sample method is 4rqk. The computational cost of the numerical approximation of

8

Research Track Paper log-Γ is approximately 3 times that of ∆. The reduction of computational cost of the incremental-updating algorithm 1 . For a small relative to the whole-sample method is 3×4rqk data set with r = 2, q = 1, η = 20, this is on the order of 10−3 .

Data Preprocessing Difference maps

Raw MR images

3.2 BNCIT for MR Image Analysis A typical application of using BNCIT to analyze MR images is as follows: a training data set is provided, containing a subject’s MR images and a class variable, such as whether a subject has Alzheimer’s disease; we train a classifier by applying BNCIT to the training data set; we then use the classifier to predict the probability of a new subject’s having Alzheimer’s disease, based on the MR images for this new subject. To use BNCIT, we first preprocess the MR images in the training data set using the protocol described in section 2.3. The output of this step is a set of difference maps that are registered to a standard coordinate space. BNCIT then selects a set of variables using the variable selection method described in Section 3.1.1 and the validation method in Section 3.1.2. These variables can be divided into different clusters. Each cluster consists of a variable in R and its probabilistically equivalent variables. BNCIT then introduces a latent variable, to represent the state of a cluster. The algorithm described in Section 3.1.3 infers the unobserved values of the latent variables. BNCIT then generates a BN classifier, with a structure such that latent variables are the parents of the class variable. This classifier is used to label new instances. Figure 5 shows an overview of this process. When we have MR images for a new subject, after preprocessing, choosing the selected variables, and inferring the states of latent variables, the probability that this subject is in a particular class can be predicted by the BN classifier generated by BNCIT from the training data.

4.

Variable selection, Model validation

Infer the latent variables

Selected voxels

Latent variables

Learn the BN

L1

L2

C

BN used to classify new images

Figure 5: The procedure of using BNCIT for MR image analysis fined as N+,Y , N+ N−,Y , FPR = N− TPR =

where N+,Y is the number of instances for which the true class is positive and the classifier label is positive, and N+ is the number of instances for which the true class is positive. Classification accuracy is

EXPERIMENTAL RESULTS

We present our evaluation of BNCIT in this section. The data sets we used for evaluation contain associations between cerebral morphology and age or sex, in 119 subjects; data for each subject consists of 128 × 128 × 94 = 1, 540, 096 voxel variables and one class variable. Age={+, −} represents a subject’s age in [60, 70] and [70, 80], respectively. Sex={+, −} represents male or female sex, respectively. We obtained the data used in this experiment from normal elderly subjects of the Baltimore Longitudinal Study of Aging [1]. We refer to the data set GM-Age as that consisting of gray-matter morphological changes associated with the class variable age; similarly, we refer to the data set GM-Sex as that consisting of gray-matter morphological changes associated with sex, WM-Age consisting of white-matter morphological changes associated with age, and WM-Sex consisting of white-matter morphological changes associated with sex. All variables in these data sets are binary; each voxel assumes the state 1 if there is volume loss in the corresponding RAVENS-map voxel location, or 0 if there is no volume loss. First we define metrics for evaluating a classifier’s accuracy. Let {Y, N } represent the positive and negative classifications produced by a classifier, respectively. The truepositive rate (TPR) and false-positive rate (FPR) are de-

N+,Y + N−,N N+ + N− T P R · N+ + (1 − F P R) · N− . = N+ + N−

accuracy =

(9)

Let Φ represent classification accuracy, and let CV(n) denote n-fold cross-validation. In n-fold cross-validation, the data are randomly divided into n partitions. The learning method applies to n − 1 partitions and one partition is used for testing. We evaluate the classifier’s performance by computing the average Φ across different folds. In order to evaluate the classifier thoroughly, we vary the fold number n. The results, as shown in Table 2, demonstrate that accuracy increases as the number of training instances increases, overall. A receiver operating characteristic (ROC) curve is a plot of FPR against TPR. BNCIT is a probabilistic classifier; thus, for the binary class variable C, BNCIT’s output assumes the form (1 − p, p), where p = P (C = + | R). To generate an ROC curve from this output, we label C as ’+’ if p < τ , where τ is the decision threshold. Figure 6 shows

9

Research Track Paper GM-Age 0.79 0.78 0.84 0.88 0.88

CV(2) CV(3) CV(5) CV(10) Leave-one-out

GM-Sex 0.86 0.81 0.90 0.90 0.90

WM-Age 0.87 0.85 0.87 0.87 0.87

WM-Sex 0.82 0.83 0.87 0.87 0.88

BNCIT MLP Logistic Reg. Decision Tree Prism Rule Na¨ıve Bayes SVM Bagging

Table 2: Classification accuracy for GM-Age, GMSex, WM-Age and WM-Sex.

ROC Graph 1

0.17 0.15 0.22 0.30

0.8

0.13 0.13

0.09

0.00 0.08

0.08

0.8

0.67 0.71

0.5

0.75

0.7 0.6 TPR

0.6

0.4

0.1

0.5

0.50 0.57 0.63 0.67 0.71 0.75 0.78 0.80 0.83 0.86

0.2 0.88 0.1 0.89 0.2

0.3

0.4

0.5 FPR

0.6

0.7

0.8

0.9

0.91 0 0.93 0 0.1

1

0.2

0.3

0.4

(a) 0.33 0.360.36 0.50 0.40

0.9 0.8

0.00

1

0.8 0.7

TPR

0.6

0.5 0.75 0.78

0.5 0.4

0.3

0.80 0.83

0.3

0.2

0.85

0.2

0.1

0.87 0.88

0.1

0.88 0 0.94 0 0.1

0.2

0.3

0.4

0.5 FPR

(c)

0.6

0.7

0.33 0.22 0.38 0.40

0.9

0.67 0.70 0.71

0.6

0.6

0.7

0.8

0.9

1

0.07

0.06 0.06 0.05 0.00

ROC Graph 0.06 0.05 0.05

0.60 0.63

0.7

0.5 FPR

(b)

ROC Graph 0.06 0.20 0.180.17 0.13 0.09 0.30

1

TPR

0.050.05 0.00

0.33 0.36 0.36 0.460.40 0.38

0.3 0.80 0.82

0.83 0.86 0 0.92 0 0.1

0.4

0.06

0.8

0.9

1

0.10 0.09 0.07 0.13 0.20 0.17 0.11

WM-Age 0.87 0.87 0.84 0.78 0.57 0.84 0.84 0.82

WM-Sex 0.88 0.87 0.78 0.83 0.60 0.78 0.66 0.85

Finally, we measured statistical confidence in these results for the these data set. Using the validation method described in Section 3.1.2, we generated a series of data sets; and computed the frequencies of different models and associations; we consider models, and associations among variables, that were generated more frequently, to be more likely to be genuine, rather than statistical artifacts. The results for these data sets are shown in Figure 7. For GM-Sex, 34 models had non-zero frequencies during validation. There exists a dominant model, whose frequency is 0.52, which is at least 5 times greater than the next most frequent model. The dominant model corresponds exactly the model that BNCIT generated from D, the full GM-Sex data set. Similarly, for association frequencies, there are 64 associations with non-zero frequencies, four of these being dominant. Again, these dominant associations correspond to those in the model that BNCIT generated from D. The validation results for the other data sets are similar to those for GMSex. These results demonstrate that the model generated by BNCIT is valid, rather than a statistical artifact.

0.4

0.3 0.2

ROC Graph 0.13 0.110.10 0.09 0.07 0.30

1 0.9

0.50 0.60 0.64

0.7

TPR

0.14

GM-Sex 0.90 0.83 0.76 0.89 0.80 0.76 0.76 0.83

Table 3: The comparison of the accuracy of BNCIT with those of other classifiers for leave-one-out crossvalidation on all data sets

aggregated ROC graphs for all data sets for five-fold crossvalidation. As expected, when τ → 1, BNCIT has a very low FPR and a low TPR; when τ → 0, BNCIT has a low false-negative rate; and when τ is in the middle of this range, BNCIT balances TPR and FPR. The areas under the ROC curves (AUCs) when applying BNCIT to the GM-Age, GMSex, WM-Age, and WM-Sex data sets are 0.87, 0.94, 0.90, and 0.89, respectively.

0.9

GM-Age 0.88 0.87 0.71 0.84 0.70 0.71 0.67 0.81

0.50 0.57 0.60 0.67 0.75 0.77 0.79 0.80 0.81 0.83

5. CONCLUSION AND DISCUSSION

0.85

Voxel-wise MR image analysis can be treated as a machinelearning problem with a high-dimensional variable space, and a limited number of observed samples. We propose a BN-based classifier, called BNCIT, to analyze these data. BNCIT incorporates the following major strengths:

0.88 0.89 0.90

0.91 0 0.94 0 0.1

0.2

0.3

0.4

0.5 FPR

0.6

0.7

0.8

0.9

1

(d)

Figure 6: The aggregated ROC graph of CV(5). (a) GM-Age. (b) GM-Sex. (c) WM-Age. (d) WM-Sex.

1. Variable selection is embedded in classifier learning. BNCIT is therefore capable of handling data with very high dimensionality.

In terms of the computational cost of BNCIT, our implementation on a Silicon Graphics (Mountain View, CA) Origin workstation required approximately an hour for variable selection, and 20 hours for inferring the latent variables. The computation time is related to the subject number, the voxel number of MR images , and the dimensionality of the model. The time to label a new instance is approximately 30 minutes. In addition, we compared the accuracies Φ (Equation 9) of multi-layer perceptron (MLP), logistic regression, decision tree, prism rule set, na¨ıve Bayes, support vector classifier, and bagging, with that of BNCIT based on leave-one-out cross-validation on all data sets, with the selected variables. As shown in Table 3, BNCIT’s accuracy is greater than those of the other classifiers, for all data sets. Note that we did not compare the AUCs of these classifier because some classifiers had deterministic classification outputs and did not have ROC graphs.

2. The classifier generated by BNCIT is stable, even when trained with a small number of samples; this stability is due to our implementation of a latent-variable model. Our incremental-updating method infers the state of the latent variable much more efficiently than standard methods, which use the entire sample on every iteration. 3. The classifier-training and variable-selection processes are computationally efficient. BNCIT returns a model in a reasonable amount of time using readily available hardware. The models generated by BNCIT are also computationally efficient classifiers. In section 4, we compared the accuracy Φ of BNCIT with those of other classification methods such as MLP for leaveone-out cross validation. These results demonstrate that BNCIT achieves higher accuracy than other methods. We

10

Research Track Paper Frequency of model

Frequency of link

0.35

Institute of Mental Health, and the National Cancer Institute.

1 0.9

real model

0.3

0.8 0.7

0.2

0.6

Frequency

Frequency

0.25

0.15

6. REFERENCES

0.5 0.4

[1] Baltimore longitudinal study of aging home page: http: //www.grc.nia.nih.gov/branches/blsa/blsa.htm. [2] J. Besag. Spatial interaction and statistical analysis of lattice systems. J. Royal Statistical Soc. Series B, 2, 1974. [3] J. Binder, D. Koller, S. Russel, and K. Kanazawa. Adaptive probabilistic networks with hidden variables. Machine Learning, 29:213–244, 1997. [4] R. R. Bouckaert. Properties of Bayesian network learning algorithms. In R. L. de Mantaras and D. Poole, editors, Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence, pages 102–109. Morgan Kaufmann, 1994. [5] W. Buntine. A guide to the literature on learning probabilistic networks from data. IEEE. Trans. on Knowledge and Data Engineering, pages 195–210, 1996. [6] R. Chen and E. H. Herskovits. Graphical-model based morphometric analysis. accepted, IEEE Trans. Medical Imaging, 2004. [7] J. Cheng and R. Greiner. Comparing Bayesian network classifiers. In Proceedings of UAI-99, pages 101–108, 1999. [8] J. Cheng and R. Greiner. Learning Bayesian belief network classifiers: algorithms and system. Lecture Notes in Computer Science, 2056, 2001. [9] D. M. Chickering. Optimal structure identification with greedy search. Journal of Machine Learning Research, 3:507–554, 2002. [10] G. F. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data. Machine Learning, 9:309–347, 1992. [11] R. Cowell. Introduction to inference for Bayesian networks. In M. I. Jordan, editor, Proceedings of the NATO Advanced Study Institute on Learning in Graphical Models, pages 9–26. Kluwer Academic Publishers, 1998. [12] C. Davatzikos. Mapping of image data to stereotaxic spaces: Application to brain mapping. Hum. Brain mapp, 19:334–338, 1998. [13] C. Davatzikos, M. Vaillant, S. Resnick, J. Prince, S. Letovsky, and R. Bryan. A computerized method for morphological analysis of the corpus callosum. J. of Comp. Ass. Tomography, (20):88–97, 1996. [14] P. Domingos and M. J. Pazzani. On the optimality of the simple Bayesian classifier under zero-one loss. Machine Learning, 29(2-3):103–130, 1997. [15] R. Duda and R. Hart. Pattern Classification and Pattern Analysis. Wiley, New York, 1973. [16] N. Friedman, D. Geiger, and M. Goldszmidt. Bayesian network classifiers. Machine Learning, 29:131–163, 1997. [17] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE. Trans. PAMI, 6:721–741, 1984. [18] A. Goldszal, C. Davatzikos, D. Pham, M. Yan,

0.3

0.1

0.2 0.05 0.1 0

0

5

10

15

20 25 Model id

30

35

40

0

45

0

10

20

30

40 Link id

(a)

70

80

Frequency of link 0.7

0.6

0.6 real model

0.5

0.4

0.5

Frequency

Frequency

60

(b)

Frequency of model 0.7

0.3

0.2

0.4

0.3

0.2

0.1

0

50

0.1

0

5

10

15 20 Model id

25

30

0

35

0

10

20

30

40

50

60

70

Link id

(c)

(d)

Frequency of model

Frequency of link

0.12

0.4 0.35 real model

0.1

0.3 0.08 Frequency

Frequency

0.25 0.06

0.2 0.15

0.04 0.1 0.02 0.05 0

0

10

20

30

40 Model id

50

60

70

0

80

0

50

100

150

Link id

(e)

(f)

Frequency of model

Frequency of link

0.45

0.8 real model

0.4

0.7

0.35

0.6

0.3 Frequency

Frequency

0.5 0.25 0.2

0.4 0.3

0.15 0.2

0.1

0.1

0.05 0

0

5

10

15

20 Model id

25

30

35

(g)

40

0

0

10

20

30 Link id

40

50

60

(h)

Figure 7: Validation results for all data sets. (a) f (R) of GM-Age; (b) f (E) of GM-Age; (c) f (R) of GM-Sex; (d) f (E) of GM-Sex; (e) f (R) of WM-Age; (f ) f (E) of WM-Age; (g) f (R) of WM-Sex; (h) f (E) of WM-Sex. believe that this result is explained by BNCIT’s joint variable selection and classifier generation. For other types of classifiers, such as na¨ıve Bayes, accuracy could be improved by adopting a variable-selection method that is designed particularly for na¨ıve Bayes, if this is possible for data with such high dimensionality. Regardless of which variable-selection method we use, we need to know whether the difference in classification accuracy is statistically significant, or is due to random artifact. Towards this end, we can use resampling methods, such as the bootstrap, to resample the data, and compare overall performance. We plan to implement bootstrap resampling for BNCIT.

Acknowledgements This work was supported by The Human Brain Project, National Institutes of Health grant R01 AG13743, which is funded by the National Institute of Aging, the National

11

Research Track Paper

[19]

[20]

[21]

[22] [23]

[24] [25]

[26]

[27] W. Lam and F. Bacchus. Learning Bayesian belief networks: An approach based on the MDL principle. Computational Intelligence, 10:262–293, 1994. [28] P. Langley, W. Iba, and K. Thompson. An analysis of Bayesian classifiers. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 223–228, 1992. [29] J. Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan Kaufmann, 1988. [30] I. Rish. An empirical study of the naive Bayes classifier. In IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, 2001. [31] D. G. Shen and C. Davatzikos. Hammer: Hierarchical attribute matching mechanism for elastic registration. IEEE Trans. on Medical Imaging, pages 1421–1439, 2002. [32] J. Suzuki. A construction of Bayesian networks from databases based on an MDL scheme. In D. Heckerman and A. Mamdani, editors, Proceedings of the Ninth Conference on Uncertainty in Artificial Intelligence, pages 266–273. Morgam Kaufmann, 1993. [33] B. Thiesson. Accelerated quantification of Bayesian networks with incomplete data. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining, pages 306–311. AAAI Press, 1995.

R. Bryan, and S. M. Resnick. An image processing protocol for quanlitative and quantitative volumetric analysis of brain images. J. Comput. Assisted Tomogr, 22:827–837, 1998. I. Guyon and A. Elisseeff. An introduction to variable and feature selection. Journal of Machine Learning Research, 3:1157–1182, 2003. D. Heckerman. Bayesian networks for knowledge. In U. M. Fayyad, G. P.-S. P. Smyth, and R. S. Uthurasamy, editors, Advances in Knowledge Discovery and Data Mining. AAAI Press, Menlo Park, CA, 1996. D. Heckerman. A tutorial on learning with Bayesian networks. In M. I. Jordan, editor, Proceedings of the NATO Advanced Study Institute on Learning in Graphical Models. Kluwer Academic Publishers, 1998. E. H. Herskovits. Computer-based probabilistic-network construction. PhD thesis, Stanford University, 1991. E. H. Herskovits, H. Peng, and C. Davatzikos. A Bayesian morphometry algorithm. IEEE Trans. Medical Imaging, 23, June 2004. F. Jensen. An Introduction to Bayesian Networks. Springer, 1996. R. Kohavi and G. H. John. Wrappers for feature subset selection. Artificial Intelligence, 97(1-2):273–324, 1997. D. Koller and M. Sahami. Towards optimal feature selection. In Proceedings of the 13th International Conference on Machine Learning (ML), pages 284–292, 1996.

12

Suggest Documents