Semi-Supervised Discriminative Classification with Application to Tumorous Tissues Segmentation of MR Brain Images Yangqiu Song∗, Changshui Zhang, Jianguo Lee, Fei Wang, Shiming Xiang and Dan Zhang State Key Laboratory on Intelligent Technology and Systems, Tsinghua National Laboratory for Information Science and Technology (TNList), Department of Automation, Tsinghua University, Beijing 100084, China. 18th December 2007
Abstract Due to the large data size of 3D MR brain images and the blurry boundary of the pathological tissues, tumor segmentation work is difficult. This paper introduces a discriminative classification algorithm for semi-automated segmentation of brain tumorous tissues. The classifier uses interactive hints to obtain models to classify normal and tumor tissues. A non-parametric Bayesian Gaussian random field in the semisupervised mode is implemented. Our approach uses both labeled data and a subset of unlabeled data sampling from 2D/3D images for training the model. Fast algorithm is also developed. Experiments show that our approach produces satisfactory segmentation results comparing to the manually labeled results by experts. Keyword: Magnetic Resonance Imaging (MRI), Brain Tumor Segmentation, SemiAutomated Segmentation, Gaussian Random Field (GRF), Gaussian Process (GP).
∗ Corresponding author. Email:
[email protected]; Tel.: +86-10-627-96-872; Fax: +8610-627-86-911.
1
Originality and Contribution Computer assisted segmentation is one of the most important issues in medical image analysis problems and it is the first step of quantitative analysis. Due to the large data size of 3D Magnetic Resonance (MR) brain images and the blurry boundary of the pathological tissues, tumor segmentation work is difficult. This paper proposes a semi-supervised machine learning method applied to the MR brain image segmentation problem. Particularly, it focuses on the tumor tissues segmentation. A discriminative model is developed instead of the traditional generative and descriptive Markov random field (MRF) model. The relaxation to the continuous Gaussian random field (GRF) allow that the the solution can be computed with a matrix multiplication and the model can induct for the new coming test points. The approach uses both labeled data and a subset of unlabeled data sampling from 2D/3D images to classify the remain pixels/voxels. Thus, 3D image segmentation can be obtained by sampling rather than by segmenting 2D images sequently. Fast algorithm that can accelerate the training algorithm to be linear with the number of unlabeled data is also implemented. For experiments, a standard ground truth is developed based on a consensus between experts. The Missing Detection Rate (MDR) and False Detection Rate (FDR) are used for numerical comparison instead of classification error rate. Both segmentation images and numerical results show that a semi-supervised method can be more accurate compared to the state-of-the-art methods. Early results of the research have been reported in [1].
2
1
Introduction
Magnetic Resonance (MR) Imaging has been proven to be a useful noninvasive technique for assisting in clinical diagnoses and in evaluating therapy results, due to its high contrast resolution and ability to provide rich information about human soft tissue. Segmentation of MR brain images is the first step of quantitative analysis. In medical imaging analysis field [2, 3], segmentation is interesting and challenging, since both the normal and abnormal tissues of the brain have complicated structures. While segmentation methods have been successful on normal tissues [4, 5, 6, 7, 8, 9], there still remains both theoretical and experimental work in the segmentation of abnormal tissues. Computer assisted brain tumor segmentation is one of the most important and hardest issues in segmenting abnormities. There are two main problems. First, automatic measurement of the volume and variation of these tumorous tissues is not easy. The distribution of normal tissue intensity is complex, and overlapping exists among different type of tissues. In addition, the cerebral tumorous tissues in MR brain images may vary in size, shape and location. They are usually accompanied by edema. Other tissues, such as hemorrhage, necrosis and cystic components, may also appear in the tumorous region. Therefore, the boundaries of the tumorous tissues can be rather blurry. Second, there are a great many pixels (such as 256 × 256 × 124) for 3D MR images. Consequently, segmentation will have high computational complexity and require large memory storage. This problem can be solved by applying the 2D methods sequently, and even experts segment the images in this way. However, this will lose some geometric information. In the last decades, many methods have been proposed to segment the brain tumor of MR images: such as Neural Networks [10, 11], support vector machine (SVM) [12], finite Gaussian mixture model [13], fuzzy C-means (FCM) [10, 14], knowledge based methods [15, 16], atlas based method [17], active contour model [18], level set methods [19, 20] and outlier detection [21]. Here, the segmentation task is regarded as a tissue recognition problem, which means using a well-trained model that can determine whether a pixel/voxel belongs a normal or tumorous tissue. In general, one could use the supervised classification or the unsupervised clustering methods. Supervised methods [22, 23, 24, 25, 26, 27] may produce good results, however, they require the scrupulous labeling work by doctors or experts, which is time consuming and costly. Unsupervised methods [28, 29, 30, 31] could do the segmentation automatically, but sometimes it is difficult to produce a good result. Fully automated methods always cooperate with some human knowledge. Therefore, the idea of combining supervised and unsupervised methods is brought forward, which is the semi-supervised method [32, 33]. One class of the semi-supervised methods is based on graph or manifold analysis [34, 35, 36, 37, 38, 39, 40, 41]. The advantage of these methods is they can use only few labeled and lots of unlabeled examples to get satisfactory accuracy. Some approaches have applied the graph-based methods to image segmentation and MR image analysis. However, they are unsupervised [42] or transductive1 [43]. Thus, both of them should do the segmentation slice by slice, due to the large amount of data and the computational complexity. When the tumorous tissues are very small in the image, the unsupervised graph-based methods will attempt to segment the normal tissues, such as: gray matter (GM), white matter (WM) 1 Semi-supervised
methods could be either transductive or inductive [32] [33]. While a transductive method only works on the observed labeled and unlabeled training data, the inductive methods can naturally handle the unseen data that are not in training set [33].
3
and the cerebrospinal fluid (CSF) [44, 45]. An intuitive idea to improve these problems is that, by labeling only some data in one image, the computer classifies all the others.
(a) MRF
(b) GRF and GP
Figure 1: Graphical Models. In this paper, we propose a discriminative method based on graph regularization. It is a semi-supervised inductive method, which uses labeled data in one image and a subset of unlabeled data sampling from 2D/3D images to classify the remains. It can directly segment 3D data by sampling the unlabeled data from 3D images rather than by segmenting 2D images sequently. Early results of the research have been reported in [1]. This paper is organized as follows: In section 2, we will review some related works and compare the models. Section 3 will go into the technical details about our semi-supervised inductive method. Real MR images experimental results will be given in section 4. Finally we will conclude in section 5.
2 2.1
Generative Model and Discriminative Model Semi-Supervised Learning Setup ∆
N
We denote one input data point as a feature vector xi , and XN = {xi }i=1 is the observed data set including both labeled and unlabeled data. Note that for our model the input data consist of multi-dimensional spatial coordinates and multi-channel intensity values. For the semi-supervised problem, we attempt to extend the labels of the labeled to the unlabeled, whose labels are set to zeros initially. The label of xi is given by ti (i = 1, 2, ..., N ), and tN = (t1 , t2 , ..., tN )T . We also denote the training data set as D = {xi , ti }N i=1 . For inductive problems, we want to estimate the label tN +1 of a new point xN +1 . Most of the algorithms in Bayesian framework focus on the joint distribution P (x, t). For the algorithms that do not need to predict for the unobserved data, there are mainly two ways to model this probability: (a) : (b) :
P (XN , tN ) = P (XN |tN )P (tN )
P (XN , tN ) = P (tN |XN )P (XN )
This leads to the models described below (see also [46, 47]). 4
(1)
2.2
Descriptive Model
The first equation in (1) is called the generative model or the descriptive model (seen as a special case of generative model) [47]. Here, we discuss the descriptive model since it is more related to our work. The descriptive model is constructed based on statistical descriptions of the natural images, such as intensity, texture, texton, 2D curves and so on [47]. The Markov Random Field (MRF) model [48], the active contour and deformable models [49, 50], and the level set method [51] can all be regarded as descriptive models. We take the typical descriptive model MRF as an example. MRF first models the probability P (tN ) as a Gibbs distribution, and then formulates the probability P (XN |tN ) as Gaussian or others. The graphical model of MRF is shown in Fig. 1 (a). Boykov et al. [52] develop a semi-supervised graph cut method, separating an object of interest from the background using an interactive way. Their approach has a clear cost function that can be explained as a context of MAP (maximum a posterior) estimation [53]. Other interactive methods based on graph cut, such as [54, 55] are all related to the MRF model, and their main differences are the interactive manners and the prior forms.
2.3
Discriminative Model
Descriptive models are efficient for segmenting images. However, when the boundary of the object is blurry, a discriminative model may be better. The basis of discriminative models is that they classify each points based on features. While a descriptive model defines P (tN ) and P (XN |tN ) in the Bayesian framework (1), the discriminative method firstly models the probability P (tN |XN ), then models the probability P (XN ) as a regularization term. Note that, in discriminative models, we call P (XN ) the prior and P (tN |XN ) the likelihood. Supervised methods SVM [56] and transductive graph based method [43] are all discriminative. Discriminative models are faster for prediction of new test data than generative models are. It has been shown that discriminative models could achieve better classification results in many cases [57]. The proposed method in this paper is also discriminative. Instead of using the direct process between xi and ti used in MRF, we use a latent variable yi to generate the process xi → yi → ti . Note that yi = y(xi ) is a function of xi , and yN = (y1 , y2 , ..., yN )T is the latent variable vector of input data. Therefore, we can model the latent variable y to generate a Gaussian Random Field (GRF) [58, 39] and use P (yN |Θ) as the data-dependent prior instead of P (XN ), where the hyper-parameters are denoted as Θ. This random field of latent variable is continuous, as opposed to the discrete random field in MRF. The advantages of this “relaxation” to the continuous case are: (1) the solution can be computed with a matrix multiplication; (2) for the multi-class problem, we can get a closed form instead of a approximation algorithm for the NP-hard problem [52, 39]; (3) it is easy to find a induction formulation for the new coming test points. For predicting new test data, the efficient Gaussian process (GP) model can be proposed. GP for classification is introduced by Neal [59] and Williams and Barber [60]. An introduction of GP is given in [61]. The graphical model of GRF/GP is shown in Fig. 1 (b).
5
3 3.1
Method: Semi-Supervised Learning Problem Formulation (c)
(p)
In our discriminative model, each voxel xi = (xi , xi ) in 3D MR images is represented by a six dimensional feature vector: three spatial features and three intensity’s features of T1, T2 and PD weighted images. x(p) is the coordinate vector (X, Y, Z) of the voxel and x(c) is the vector of intensities (IT 1 , IT 2 , IP D ) of T1, T2 and PD images. We label some of the pixel/voxel as the labeled points, which is the hints obtained by very simple human-computer interaction. The unlabeled data are obtained by randomly sampling from the 2D/3D image data. We use the training data, which include both labeled and unlabeled data, to train a classification model. In the following context, the model is GRF. For all the image data, the classification model can get a prediction that whether a pixel/voxel belongs to tumorous or normal tissues. The segmentation is done by classifying all the image data. In the following parts, we will firstly present the training phase, which includes how to model the prior P (yN |Θ), probability P (tN |yN ) and the Laplace approximation algorithm to estimate the latent variable. Then, we will show the hyper-parameter estimation problem. After that, we will present the induction formulation for classifying new voxels. Finally, we will introduce how to speed up the training procedure.
3.2
Training GRF
By computing the mode of posterior P (yN |tN , Θ) as the estimation of yN , which is the negative logarithm of P (yN |tN , Θ) = P (yN , tN |Θ)/P (tN ), we define the Gaussian density function of yN as: Ψ(yN ) ≈ − log P (tN |yN ) − log P (yN |Θ) (2) where P (tN ) is omitted since it is a constant with respect to yN . P (yN |Θ) is the prior of latent variables, which is modeled based on graph regularization. P (tN |yN ) is the new conditional probability Extended Bernoulli Model. 3.2.1
Graph Based Data-Dependent Prior
The graph-based methods consider a weighted undirected graph G = (V, E) established on the data points in some feature space. The nodes V correspond to the points, and weights on the edges E are functions of the pairwise connected nodes. In grouping or segmentation problems, the nodes need to be partitioned into disjointed sets. Here, we adopt the graph regularization based data-dependent prior. A regularization term of a graph restricts a function to be sufficiently smooth on the graph . The prior probability is given by: S(yN ) 1 exp{− } (3) P (yN |Θ) = Zr µ where Zr is the normalization constant, and the smoothness function S(yN ) is: S(yN ) =
1 T y (I − S)yN 2 N
(4)
P 1 1 where S = D− 2 WD− 2 , D = diag(D11 , ..., DN N ) and (W)ij = Wij . Dii = j Wij , and Wij is the weight function associated with the edges on graph, which satisfies: Wij ≥ 6
0 and Wij = Wji . It can be viewed as a symmetric similarity measure between xi and xj . A simple example of Wij is: Wij = exp(−||xi − xj ||2 /2σ 2 ). If different dimensions have different weights, we set the hyper-parameters as Θ = {σ1 , σ2 , ..., σd }. Then the matrix ∆ = I − S is called normalized graph Laplacian in spectral graph theory [62]. Moreover, the gradient and Hessian of − log P (yN |Θ) are: gN = ∇yN (− log P (yN |Θ)) = K−1 N yN
K−1 N = ∇∇yN (− log P (yN |Θ)) = ∆ = I − S
(5)
The matrix KN = ∆−1 is the covariance matrix of the prior probability, and it is the inverse matrix of the normalized graph Laplacian. Therefore, the covariance between two points depends on all the other training data, including both labeled and unlabeled. In contrast, most of the traditional Gaussian processes adopt the covariance based on “local” distance information. This will make the covariance depend only on the self coordinates in the prior probability. 3.2.2
Extended Bernoulli Model
In this section, we present a new noise function of the process y → t, which is called Extended Bernoulli Model (EBM): P (ti |yi ) =
1−λ δ{t 6=0} + λδ{ti =0} 1 + exp(−ti yi ) i
(6)
where δ is the indicator function. δ{ti =0} means: if ti = 0, δ = 1, and if ti 6= 0, δ = 0. (For more details, see [1] and Appendix A.) QN Due to the graphical model of GRF, we have P (tN |yN ) = i=1 P (ti |yi ). Then the gradient vector and Hessian matrix of − log P (tN |yN ) are: ti αN = ∇yN (− log P (tN |yN )) = − (7) 1 + exp(ti yi ) i ! t2i exp(ti yi ) ΠN = ∇∇yN (− log P (tN |yN )) = diag 2 (1 + exp(ti yi )) Note that the parameter λ does not affect the gradient vector and the Hessian matrix, which is consistent with the analysis above. In the training phase, the unlabeled elements in the gradient vector αN and the Hessian matrix ΠN are zeros. 3.2.3
Training Procedure
For training a GRF, which estimates the latent variables from the posterior probability P (yN |tN , Θ), we make use of the Laplace approximation [60, 63] for the posterior probability with fixed hyper-parameters. By differentiating (2) with respect to yN , we have: ∇Ψ = αN + gN ,
∇∇Ψ = ΠN + K−1 N
(8)
To find the expectation of the approximated Gaussian density Ψ in (2), the Newton-Raphson iteration is adopted: yN new = yN − (∇∇Ψ)−1 ∇Ψ (9) 7
Since ∇∇Ψ is always positive definite2 , (2) is a convex problem. When it converges to ˆ N , ∇Ψ will be a zero vector. The posterior probability P (yN |tN , Θ) is apan optimal y ˆ N . The inverse of the posterior proximated as Gaussian, being centered at the estimated y covariance is ∇∇Ψ.
3.3
Hyper-Parameters Estimation
Since for semi-supervised learning, labeled data are usually few. Therefore, standard methods such as cross validation may fail based on the small labeled data. Based on the Laplace approximation approach for estimating the latent variables yN , we can have the estimation result of hyper-parameters in the Bayesian inference framework. The hyper-parameter estimation mainly follows [60, 40], which maximizes the marginal likelihood: J(Θ) ≡ − log P (tN |Θ) ≈ Ψ(yN ) + =
N X
log (1 + exp (ti yi )) +
i=1
1 log K−1 N + ΠN 2
1 1 T −1 log |KN ΠN + I| + yN K N yN 2 2
(10)
We can update the hyper-parameter by gradient search (see Appendix B for details): 1 ∂yN ∂J(Θ) −1 ∂KN ΠN = αTN + tr (I + KN ΠN ) ∂Θ ∂Θ 2 ∂Θ −1 1 T ∂yN T ∂KN + 2 K−1 y + y y (11) N N N N 2 ∂Θ ∂Θ
3.4
Prediction
To find whether a voxel, which is not in the observed training set, is tumorous or normal, is equivalent to estimate the label tN +1 of a new point xN +1 . Given the estimated hyperparameters and labels, the prediction function of is to compute the integral over the latent variable of a new point: Z P (tN +1 |D, Θ) = P (tN +1 |yN +1 )P (yN +1 |D, Θ)dyN +1 (12) The second factor in (12) can be obtained by further integration: Z P (yN +1 |D, Θ) = P (yN +1 |yN , Θ)P (yN |tN , Θ)dyN
(13)
Calculating the high dimension integral is not easy. Since the posterior probability can be approximated as Gaussian by using the Laplace approximation method [60], we can firstly approximate P (yN |tN , Θ) as Gaussian and estimate y ˆN . Then, we use the estimated yN instead of the integral. To calculate yN +1 , we define the function: Ψ(yN , yN +1 ) = − log P (yN +1 )
(14)
which is minimized only with respect to yN +1 . We minimize (14) to compute the latent function yN +1 of a new point xN +1 . The objective function (14) can be rewritten as: Ψ(yN , yN +1 ) = 2 For
1 T y K−1 yN +1 2 N +1 N +1
the positive semi-definite case, we can add extra regularization as the jitter noise [59].
8
(15)
T where KN +1 is (N + 1) × (N + 1) covariance matrix of the vector yN +1 = (yN , yN +1 )T , and yN +1 is the latent variable of a new given point in the test set.
Due to the derivation given in Appendix C, we have: yˆN +1 = kT KN −1 y ˆN = kT (I − S)ˆ yN
(16)
where ki = WN +1,i = exp(−kxN +1 − xi k2 /2σ 2 ) is the approximate covariance of a new given point and the ith training point.Note that after we minimize (2), the gradient in (8) ˆ N = 0. Thus, (31) is given by: becomes: ∇Ψ = α ˆ N + KN −1 y yˆN +1 = −kT α ˆN
(17)
where α ˆ N = −KN −1 y ˆN . According to Appendix A, the definition of margin is the range where P (ti = 0|y) is larger than P (ti = 1|y) and P (ti = −1|y). According to (7) and (17), we can see that, if the latent variable of a point in the training set satisfies yi > 0, and the point is outside the margin, it has a positive weight −αi ki in the predicting function. Conversely, if a point outside the margin satisfies yi < 0, the weight is negative. Moreover, αi tends to be zero with very large yi , and to be near ±1/2 when yi is nearly outside the margin. Finally, the points falling inside the margin will have the weight of zeros (according to (7)). Therefore, these points will not affect the classification result in the prediction phase.
3.5
Speeding Up the Training Algorithm
Although we have a clear and compact formulation to realize the semi-supervised induction, there still exists a problem that the training computational complexity is scaling to O(N 3 ), where N is the number of the training data. A simple way to reduce the O(N 3 ) computational requirement and O(N 2 ) memory requirement is to express the solution as sampled examples from the training set. A successful usage of sampling approach has been proposed by Williams and Seeger [64], which is called Nystr¨om method. It is also used to speed up the normalized cut algorithm [65] in a modified way [42]. We follow these works to speed up our semi-supervised training phase. For theoretic explanation, the reader is referred to [42, 64]. Suppose we have N points in the training data set, the goal of the Nystr¨om method is to use the random partition of M (M ≪ N ) and (N −M ) points to approximate the original N ×N 1 1 matrix. It has been proved that the matrix D− 2 WD− 2 can be decomposed as UΛUT , where Λ is an M × M matrix (see Appendix D and [42] for details). In the training phase, the computer solves the inverse of ∇∇Ψ when it computes the Newton-Raphson iteration to find the minimum of Ψ (9). We rewrite the Hessian matrix as ∇∇Ψ = ΠN + K−1 N = − 21 ˆ ˆ − 12 T ˆ ΠN + I − D WD = ΠN + I − UΛU . Therefore, by applying the Woodbury formula [66], we have: (∇∇Ψ)−1 = (ΠN + I)−1 + (ΠN + I)−1 U(I − ΛUT (ΠN + I)−1 U)−1 ΛUT (ΠN + I)−1 (18) Thus, the computational requirement of Newton-Raphson iteration is O(M 2 N ), because it only needs to compute the SVD of an M × M sub-matrix. The memory usage is O(M N ) since it only stores: (1) the weight matrix of the sampling points, (2) the weights between sampling and the remaining points in the training set. In summary, the flowchart of our algorithm is shown in Table 1. 9
Table 1: The flowchart of semi-automated segmentation. 1.Input: Training set examples D = {xi , ti }N i=1 . Each pixel/voxel of MR images is (c) (p) represented by a six dimensional feature vector xi = (xi , xi ): three intensity’s features of T1, T2 and PD weighted images (IT 1 , IT 2 , IP D ) and three spatial features (X, Y, Z). Labeled points are obtained by interactive human hints. Unlabeled points are obtained by random sampling. Determine randomly initial or empirically selected hyper-parameters σc and σp . 2.Training Phase: (a) Construct graph G = (V, E) based on the data points. The edge weight Wij between voxel i and j is computed as: Wij = (c)
exp
(c)
−kxi −xj k2 2σc2
(p)
∗ exp
(p)
−kxi −xj k2 . 2σp2
(b) Hyper-parameter estimation (optional). Compute the gradient ∂J(Θ) ∂Θ in (11) to do line search. (c) Train. (c.1) Compute the gradient ∇Ψ using (8). (c.2) Compute the Hessian ∇∇Ψ using (8) and (18). (c.2) Use Newton-Raphson iteration yN new = yN − (∇∇Ψ)−1 ∇Ψ to find ˆN . the estimated y 3. Prediction: For any point xN +1 , use yˆN +1 = kT (I − S)ˆ yN to find the correspondence tN +1 . 4. The segmentation result is obtained by classifying of all the unlabeled points (pixels/voxels).
10
4
Experimental Results
In this section we present some experimental results of segmenting the real MR brain images. We apply our algorithm to the real MR images of three patients [56]. Each patient sequence consists of 124 slices of 256×256 pixels which is 0.94×0.94×1.5 mm3 for T1 and 0.47×0.47×5 mm3 for T2 and PD of voxel size. The 3D T1, T2 and PD images have been preprocessed by registration, and the extra-cranial tissues has been removed [56, 67, 68]. The intensity inhomogeneous effect has also been processed in the pre-processing phase. The background pixels are ignored in the computation. Typical T1, T2 and PD images of “Patient 1” are shown in Fig. 2 (a)-(c).
Figure 2: “Patient 1” Segmentation Results. a) T1 weighted image of slice 60. b) T2 weighted image. c) PD weighted image. d) Handmin . e) Handmid . f) Handmax .
4.1
Graph Construction
We make use of both intensity features and spatial features to construct the edges of a graph G = (V, E). We take each voxel as a node and define the edge weight Wij between voxel i and j as: (p) (p) (c) (c) −kxi − xj k2 −kxi − xj k2 ∗ exp (19) Wij = exp 2σc2 2σp2 where x(c) is the vector of intensities of T1, T2 and PD images, and x(p) is the coordinate vector of the voxel, σc and σp are the hyper-parameters of our algorithm. All the vectors has been normalized to be in [0, 1].
4.2
The Standard Ground Truth
To evaluate the segmentation results, we introduce the standard ground truth as the segmentation reference. For tumorous tissues in MR brain images, it is hard to give an exact 11
true boundary, since gliomas are usually accompanied with other pathological tissues. In practice, different experts will give different results, because the boundary is blurred. As there will be error when manually labeling the region, even one expert will give slightly different results at different times. Therefore, we make use of a combination of multi-expert results to generate the silver standard. Suppose we have N experts which give independent segmentations by hand, denoted as Ai , i = 1, 2, ..., N , we use the following three methods to get the standard ground truth [56]: • Minimum Area: We denote it as the Handmin , and AHandmin = A1 ∩ A2 ∩ ... ∩ AN . • Maximum Area: We denote it as Handmax , and AHandmax = A1 ∪ A2 ∪ ... ∪ AN . • Majority Voting: If most of the experts judge that the voxel belongs to a pathological region, we regard it as pathological tissue, which is denoted by Handmid . Fig. 2 show a typical slice of “Patient 1”, together with the three types of standard ground truths. The region of Handmid is between the other two.
4.3
Performance Evaluation
In order to make comparisons with the results, we use the same index definition presented by [56] to evaluate the results, which are called Missing Detection Rate (MDR) and False Detection Rate (FDR): # F alse N egatives # T rue P ositives + # F alse N egatives # F alse P ositives F DR = # T rue P ositives + # F alse N egatives
M DR =
(20)
where # T rue P ositives is the number of positive instance classified as positive; # F alse N egatives is the the number of positive instance classified as negative; # T rue N egatives is the number of negative instances classified as negative; and # F alse P ositives is the number of negative instances classified as positive. In the following context we will compare all the experiments with respect to these two criteria. The MDR is low then it means the result covers most of the region of the silver standard. The FDR is low then it means that the result is mainly in the region of the silver standard. These two criteria force the result to be consistent with the standard ground truth.
4.4
2D Evaluation
The algorithm is first applied to the 2D image of slice 60 of “Patient 1”, slice 52 of “Patient 2” and slice 80 of “Patient 3”. All of the numerical results are the averages obtained by performing the tests ten times. The labeled points of object and background are fixed, and the unlabeled points are randomly selected from 2D images. The top left of each sub-figure in Fig. 5 shows the interactive human hints for the segmentation.
12
4.4.1
Hyper-Parameters
We test the MDRs and FDRs of three patients when the hyper-parameters σc and σp vary, and compare the estimated hyper-parameters of first step with the empirically selected ones. We make use of 2000 randomly sampled unlabeled data and the labeled information is the same as the one in each sub-figure (A) in Fig. 5. Fig. 3 shows how the MDRs and FDRs, which are compared with the standard ground truth Handmid , vary with the two hyperparameters. We can see that the MDRs and FDRs go to the opposite directions when the hyper-parameters are changing. Table 2 shows one of the estimated results based on specific initial values. The corresponding segmentation results are shown in Table 4. We also test (1) (1) (2) (2) (3) (3) the results with fixed hyper-parameters, which are σc = σp = σc = σp = σc = σp = 0.08. It is shown that estimated hyper-parameters are competitive with the empirically fixed ones. The shortcoming is estimation is time-consuming. Table 2: Estimated hyper-parameters and empirically selected parameters.
“Patient 1” “Patient 2” “Patient 3”
4.4.2
Initial Values σc = 0.10, σp = 0.10 σc = 0.08, σp = 0.08 σc = 0.06, σp = 0.06
Estimated Results σc = 0.0486, σp = 0.0605 σc = 0.0432, σp = 0.0481 σc = 0.0614, σp = 0.0383
Unlabeled Data
We test the algorithm with 200, 500, 1000, 2000, 5000 and 8000 additive unlabeled data of “Patient 1”. Table 3 shows that the average MDR and FDR results with respect to the standard ground truths Handmin , Handmid and Handmax . The MDRs of results compared with Handmax are larger than the ones compared with Handmin and Handmid cases, and the FDRs are smaller than them. This is because the region of Handmax is larger than the other two’s. The MDR is getting a minimum at the number of 2000, and the FDR is getting a minimum at the number of 5000. This is because we use the same fixed hyper-parameters σc = 0.1 and σp = 0.1. When we change the number of unlabeled data, σp might not remain as a good discriminative value. Table 3: The MDR and FDR of “Patient 1” with SSGPI. Unlabeled Data Handmin MDR Handmid MDR Handmax MDR Handmin FDR Handmid FDR Handmax FDR
4.4.3
200 0.187 0.254 0.317 0.0296 0.0239 0.0143
500 0.167 0.238 0.306 0.0074 0.0051 0.0013
1000 0.177 0.246 0.298 0.0112 0.0079 0.0037
2000 0.161 0.231 0.297 0.0120 0.0085 0.0015
5000 0.182 0.253 0.322 0.0005 0.0001 0
8000 0.202 0.258 0.327 0.0019 0.0009 0
Running Time
We test the time of the Nystr¨om method that speeds up our training phase. The result is shown in Fig. 4. Both the training and prediction time are linear with respect to the 13
X: 0.14 Y: 0.14 Z: 0.2757
FDR
MDR
0.04
0.26 X: 0.06 Y: 0.14 Z: 0.2096
0.24 0.22
X: 0.06 Y: 0.06 Z: 0.009601
0.02
X: 0.06 Y: 0.06 Z: 0.2444
0.2 0.15
X: 0.06 Y: 0.14 Z: 0.04341
0.06
X: 0.14 Y: 0.06 Z: 0.2821
0.3 0.28
0.11 0.09 0.07 0.05
σ
0.05
0.07
0.09
0.11
0.13
0.15
0.09 0.11 σ
p
FDR
MDR
0.4
0.12 0.15 0.11 0.07 0.05
σ
0.05
0.07
X: 0.06 Y: 0.06 Z: 0.1309
0.2
X: 0.14 Y: 0.14 Z: 0.01304
0 0.05
0.13 0.09
0.3
0.1
X: 0.06 Y: 0.06 Z: 0.1521
0.14
0.09
0.11
0.13
0.09 0.11
FDR
MDR
X: 0.04 Y: 0.12 Z: 0.3553
X: 0.12 Y: 0.04 Z: 0.2785
0.2 0.13
X: 0.04 Y: 0.04 Z: 0.2582
0.11 0.09 0.07 0.05 σ
c
0.03
0.03
0.05
c
X: 0.04 Y: 0.12 Z: 0.001431
X: 0.04 Y: 0.04 Z: 0.01268
0.01 0 0.03
0.09
0.11
X: 0.12 Y: 0.12 Z: 0.00092
X: 0.12 Y: 0.04 Z: 0.007462
0.05
0.07
σ
0.03 0.02
0.4
0.07 0.05
(d) “Patient 2” FDR
X: 0.12 Y: 0.12 Z: 0.8713
0.8
0.15
p
(c) “Patient 2” MDR
0.15 0.13 0.11 0.09
0.13 σ
p
1
X: 0.14 Y: 0.06 Z: 0.01884
0.07
0.15
σ
c
c
X: 0.06 Y: 0.14 Z: 0.4812
0.5
X: 0.14 Y: 0.06 Z: 0.1947
0.18 X: 0.06 Y: 0.14 Z: 0.1418
σ
0.05
(b) “Patient 1” FDR
X: 0.14 Y: 0.14 Z: 0.2015
0.2
0.07 0.15
p
(a) “Patient 1” MDR
0.22
0.15 0.13 0.11 0.09
0.13
σ
c
0.6
X: 0.14 Y: 0.06 Z: 0.001222
0.07
0.13
0.16
X: 0.14 Y: 0.14 Z: 0.000794
0 0.05
0.13
0.07
0.13
0.11 0.09
0.09
0.07
0.11
σ
σ
p
p
(e) “Patient 3” MDR
0.05 0.13
0.03
σ
c
(f) “Patient 3” FDR
Figure 3: The MDR and FDR change with different hyper-parameters under the Handmid standard ground truth.
14
training data size N . The labeled number is 48 positive and 123 negative. For adding 2000 unlabeled data, the training time is 1.13 seconds and the prediction time is 39.99 seconds using MATLAB code computed on a Pentium IV 2.4GHz CPU.
3
160 140
2.5
Time (seconds)
Time(seconds)
120
2
1.5
100 80 60 40
1 20 0.5
0
1000
2000
3000
4000
5000
6000
7000
0
8000
# Unlabelled Data
0
1000
2000
3000
4000
5000
6000
7000
8000
# Unlabelled Data
(a) Training Time.
(b) Prediction Time.
Figure 4: Training time and prediction time.
4.4.4
Comparison
We firstly compare the segmentation results with these methods: SVM, spectral clustering, graph cut and active contour. Then, we show some numerical comparison using the MDR and FDR criteria. In addition, to remove the speckles of our segmentation results, morphological open and close operators are included as post-processing techniques. Specifically, we use a two-pixel open operation and a three-pixel close operation. These operators will filter out the isolated regions and merge the connected regions. First, the SSGPI (Semi-Supervised Gaussian Process Induction) and post-processed SSGPI segmentation results are shown in (C) and (D) of each sub-figure in Fig. 5. We also test the GVF (Gradient Vector Flow) Snake algorithm [50]. The result after 80 iterations is shown in (E) of each sub-figure. We adopt a different kind of human interactive hint for the GVF Snake algorithm, and the result takes many false detected voxels. This is because the tumorous tissue is diffused to the normal tissues. The active contour based method will have problems when the boundary of the object is not clear. Moreover, the result of graph cut based Lazy snapping [54] is shown in (F). The result of SVM using only the interactive hints as training data is shown in (G). Using only the same labeled information is unfair for Lazy snapping and SVM, since our method is added with unlabeled data. If there are more human hints and the post-processing modifications, the Lazy snapping and SVM will give better results. However, this tells us that our method could do better than the Lazy snapping and SVM could with fewer hints. Finally, the result of spectral clustering using Nystr¨om method [42] is shown in (H). The unsupervised method spectral clustering algorithm tends to classify normal tissues as pathological, which will make FDR very high. Our approach is closely related to the spectral analysis, and overcomes its problems by using some human hints in a semi-automated mode. The results are more robust and accurate. For numerical comparison, we pick up the following methods: (1) Supervised method SVM [56]. Different from the result in Fig. 5, SVM uses one or more typical slices of MR images to segment the other slices. (2) Unsupervised method spectral clustering (SC) which uses the Nystr¨om method [42]. (3) The interactive method GVF snake algorithm [50] which uses the 15
(a) “Patient 1”
(b) “Patient 2”
(c) “Patient 3”
Figure 5: Segmentation Results. (A) T1 weighted image of slice 60 and the human interactive hints of object and background. (B) Ground truth Handmid . (C) The unpost-processed SSGPI result. (D) The post-processed SSGPI result. (The morphological post-processing techniques are a two-pixel open operation and 16a three-pixel close operation.) (E) The result of GVF Snake (Initial contour: magenta dot line. Final result: blue solid line). (F) The result of Lazy snapping. (G) The result of SVM. (H) The result of spectral clustering using Nystr¨om method.
Table 4: Comparison with other methods. See the context in section 4 for the contractions’ meaning. For SC-Nystr¨om and SSGPI we run them ten time and gain their means and standard derivations (means±std). Method/Criterion SVM GVF Snake SC-Nystr¨ om SSGPI (Fixed) SSGPI (BayesEst)
Patient1 MDR 0.2744 0.2898 0.3383±0.3477 0.2452±0.0266 0.2667±0.0337
Slice 60 FDR 0.0084 0.0361 0.1192±0.0586 0.0105±0.0164 0.0076±0.0131
Patient2 MDR 0.2433 0.0981 0.2394±0.2673 0.1802±0.0179 0.0805±0.0416
Slice 52 FDR 0.0492 0.2087 0.1432±0.0343 0.0240±0.0242 0.3740±0.2502
Patient3 MDR 0.2333 0.5373 0.1914±0.2859 0.3819±0.2588 0.2500±0.0576
Slice 81 FDR 0.0460 0.0294 4.9087±1.7557 0.0085±0.0201 0.0844±0.0421
same interactive hints as each sub-figure (E) in Fig. 5. (4) The SSGPI, which is proposed in this paper. The results are shown in Table 4. The hyper-parameters of spectral clustering (1) (1) (2) (2) (3) (3) for three patients are σc = σp = 0.15, σc = 0.1 σp = 0.15 and σc = σp = 0.1. The hyper-parameters of SSGPI in three patients are either fixed or estimated as section 4.4.1 mentioned. All of these methods are compared with the standard ground truth Handmid . Since the spectral clustering algorithm and ours use sampling methods, we test these two methods ten times. Thus, the values are the average MDR and FDR indexes of the postprocessed results. We can see that the FDR of our algorithm is the best of all. This means the segmentation results of SSGPI is mainly in the region of the hand-guided results. The MDR of our algorithm is also competitive with other methods. For “Patient 2” and “Patient 3”, GVF Snake and the spectral clustering present good results respectively. However, the corresponding FDRs are very high. The reason is that, for “Patient 2” and “Patient 3”, the boundaries of the tumorous tissues are more blurry. Especially for the spectral clustering algorithm tested on “Patient 3”, the region of the tumorous tissues is very small. We find that the spectral clustering algorithm tries to segment the normal tissues rather than the pathological ones.
4.5
3D Segmentation Results
We also test our proposed method on the 3D segmentation problem. Fig. 6 shows the slices of three patients. The first columns of the three sub-figures are the original T1 weighted images. The middle columns are the hand-guided standard ground truths Handmid . The (1) (1) (2) (2) fixed hyper-parameters used in three patients are σc = σp = 0.1, σc = σp = 0.08 and (3) (3) σc = σp = 0.06. To segment the 3D data, we use an extra weight multiplied to the Z axes, which is equal to 0.1. The labeled hints are the same as the one in Fig. 5, which are in slice 60 of “Patient 1”, slice 52 of “Patient 2” and slice 81 of “Patient 3”. The unlabeled data are randomly selected between 3D images. The results are shown in the end columns of sub-figures in Fig. 6. This means our inductive approach can directly handle the 3D image data. The advantage is that we only need to label one image slice to gain the whole 3D image segmentation.
5
Conclusion
In this paper, we solve inductive problem by a non-parametric discriminative model. It can induce the unseen data out of the training set. This makes the 3D semi-automated segmentation possible, which is done by labeling some data in one image and random selecting unlabeled data from 3D images. The training time of our algorithm is fast. The 17
(a) Patient 1.
(b) Patient 2.
(c) Patient 3.
Figure 6: 3D Segmentation Results. The left columns of each sub-figure are the original T1 weighted images. The middle columns are the hand-guided standard ground truth Handmid . The right columns are the results of SSGPI.
18
prediction time is also acceptable. However, it is not suitable for interactive use. Method for hyper-parameters estimation has been also developed. When compared against published work, our algorithm is very competitive, producing similar or better results. Our approach is conservative in judging whether a voxel belongs to pathological tissue. The false detection rate is low and the missing detection rate is acceptable. One advantage is that it does not attempt to demarcate a large “useless region”, which could confuse the user in the interaction. Our approach only gives the user the most certain region. In the future, we plan to do some fast sparse algorithm and a C++ implementation, which could make our algorithm apply in an interactive way. Moreover, we think that using tensor techniques to process the data without destroy the structural information is quiet interesting [69, 70].
Appendix A: Explanation of EBM For the semi-supervised problem, we set labels of the unlabeled data to zeros initially. Thus, if ti = 0, the probability P (ti = 0|y) ≡ λ. The factor λ makes the function P (ti |yi ) with respect to ti be a probability, which means P (ti = 1|yi ) + P (ti = −1|yi ) + P (ti = 0|yi ) ≡ 1. As Fig. 7 shows, this model can be considered as a degenerated Ordered Category Model (OCM) [71], where the variance of the probability P (ti = 0|yi ) is infinite. We define the margin as the range where P (ti = 0|y) is larger than P (ti = 1|y) and P (ti = −1|y). In the margin the difference between P (ti = 1|y) and P (ti = −1|y) is smaller than the difference outside the margin. Therefore, the margin of EBM represents the more uncertain labels. The parameter λ controls this margin. Moreover, Fig. 7 (b)-(d) show the relationship between the prior and the posterior probability of latent variable. In GP and GRF, we can assume that each latent variable is also conditionally Gaussian: ∆
P (yi |yN −{i} , XN ) = N (µi , σi ) = P (yi )
(21)
where µi and σi are related to the input points and labels (see the graphical model in Fig. 1 (b)). As Fig. 7 (b) and (c) show, for ti = 1 and ti = −1, the mean and the variance of posterior P (yi |ti = 1) and P (yi |ti = −1) are related to the likelihood P (ti |yi ) and the prior P (yi ). If µi is near zero, the posterior of latent variable yi is affected by the label ti . It will have a positive estimated yi when the label is 1, and negative yi when the label is −1. If the label is ti = 0, due to Bayesian formulation, we have P (ti |yi )P (yi ) = P (yi |ti )P (ti ). The probabilities P (ti ) and P (ti |yi ) are both constant for ti = 0, so the posterior probability P (yi |ti = 0) only depends on the prior P (yi ). If µi is still near zero, we will get a zero estimated yi by maximizing the posterior probability. This is why we choose a graph regularization based prior. As mentioned previously, each covariance between two points of this prior is related to all the training data. Thus, if there are a small amount of labeled data in the training set, µi of an unlabeled xi will be affected by the labeled data more than the one choosing the traditional prior. Then, µi is non-zero, which will lead to a non-zero estimated yi (shown in Fig. 7 (c)). Furthermore, by comparing Fig. 7 (c) and (d), we can see that, the margins do not affect the estimation of the latent variable. The estimated yi remains the same in spite of different 19
margin models being imposed on the process y → t. However, any point whose latent variable yi that falls inside the margin will be labeled zero, which makes it remains unlabeled. This kind of points does not contribute to the prediction function (see in the prediction phase). Therefore, the classification boundary will be changed. Ordered Category Model
Hard Margin
1
0.7
t=1
t = −1
0.6
P(t=−1|y)
P(t=1|y)
t=0
0.8
Margin
P(t|y) and P(y|t)
0.5
p(t|y)
0.6
0.4
P(t=0|y) 0.4
0.3
P(y|t=1)
P(y|t=−1) 0.2
0.2 P(y|t=0)
0.1
0 −20
−10
0
10
0 −5
20
0
(a) OCM
(b) λ = 0.4, µy = 0
Hard Margin
Soft Margin
0.7
0.7
P(t=−1|y)
0.6
Margin
0.5
P(t|y) and P(y|t)
P(t=1|y)
P(t=0|y) 0.4
0.3
P(y|t=1) 0.2
P(y|t=−1)
0 −5
0.4
P(t=1|y)
P(y|t=0)
0
P(t=0|y)
0.3
0.2
0.1
P(t=−1|y)
0.5
P(t|y) and P(y|t)
0.6
5
y
y
P(y|t=1)
P(y|t=−1)
P(y|t=0)
0.1
0 −5
5
0
y
5
y
(c) λ = 0.4, µy = 1.5
(d) λ = 1/3, µy = 1.5
Figure 7: Illustration of Extended Bernoulli Model.
Appendix B: Derivation of Hyper-Parameter Estimation The hyper-parameters are the standard deviations Θ = {σc , σp } of the exponential weight, which is shown in (19). We take σc as an example. The hyper-parameter is estimated by gradient search, which minimizes the negative logarithmic likelihood (10). According to equation (5) and (8) by using the fact ∇Ψ = 0, and taking derivatives on both sides, we have: ∂yN ∂KN = (I + KN ΠN )−1 (−αN ) ∂σc ∂σc
(22)
And according to equation (4) we have: ∂K−1 1 ∂W 1 N = − D− 2 D− 2 ∂σc ∂σc √ √ √ √ ∂ D −1 −1 ∂ D W D + DW )D +D ( ∂σc ∂σc
20
(23)
where: ∂Wij 2 = Wij kxc (i) − xc (j)k /σc3 ∂σc √ 1 X ∂ Dii 2 = √ Wik kxc (i) − xc (k)k /σc3 ∂σc 2 Dii
(24)
k
Furthermore, differences of KN and KN ΠN are given by: ∂K−1 ∂KN N = −KN KN ∂σc ∂σc
(25)
∂KN ∂ΠN ∂KN ΠN = ΠN + KN ∂σc ∂σc ∂σc
(26)
and
where: ∂ΠN ∂ΠN ∂yN = ∂σc ∂yN ∂σc = diag
t3i exp(ti yi ) (1 − exp(ti yi )) 3
(1 + exp(ti yi ))
!
∂yN diag ∂σc
(27)
Therefore, the derivative of objective function can be given by (11).
Appendix C: Derivation of Prediction Function For predicting new test points, we first distinguish between different sizes of covariance matrix K with a subscript, such that: (1) KN is the covariance matrix of the N input T training data, (2) KN +1 is (N + 1) × (N + 1) covariance matrix of the vector (yN , yN +1 )T , where yN +1 is the latent variable of a new given point in the test set. However, it is difficult to compute KN +1 by KN in an explicit expression, since KN +1 itself depends on all the training data and each new point3 . We make use of ki = WN +1,i = exp(−kxN +1 − xi k2 /2σ 2 ) as the covariance between a new given point and the ith training point. Therefore, KN +1 can be given by: KN νk (28) KN +1 = νkT k∗
where ν is a scale factor to make k compatible with KN . Note that the covariance matrix KN depends on the global distance information, while the covariances of new point and the training points are only depend on the local distance information. Then, K−1 N +1 is given by: K−1 N +1
=
M mT
m µ
(29)
By using the partitioned inverse equations [61], we have: −1 µ = (k∗ − ν 2 kT K−1 N k)
m = −µνK−1 N k 1 M = K−1 mmT N + µ 3 Namely,
(30)
if we want to induce ∆N +1 from ∆N directly, it need compute Dii of each new give point. This is very time consuming.
21
ˆ N has been estimated, we only need to minimize (15) with respect to Since the optimal y yN +1 : µyN +1 + mT yN = 0, which leads to: yˆN +1 = kT KN −1 y ˆN = kT (I − S)ˆ yN
(31)
Here the scale factor ν can be omitted.
Appendix D: Explanation of The Decomposition The authors in [42] proved that the weight matrix W can be approximately decomposed as: ˆ = VΛS VT W where V=
A B
(32)
−1/2
A−1/2 US ΛS
(33)
A is the sub-block of weight matrix generated by the sampling points, B is the weight matrix between the sampling points and the rest. We assume that W is denote by W = A B , where C is the weight matrix of the rest points. The matrix US and ΛS is BT C the eigenvectors and values of the matrix S = A + A−1/2 BBT A−1/2 = US ΛS UTS , where A−1/2 denotes the symmetric positive definite square root4 of A. Then we can find that the constraint VVT = I is satisfied automatically and the approximated weight matrix is given by: A B ˆ W= (34) BT BT A−1 B The difference of C and BT A−1 B is just the Schur complement. In addition, for the purpose ˆD ˆ − 21 . This has ˆ − 21 W of our accelerated algorithm, we need to approximate the matrix D been exploited by [42], which replace the matrix A and B as: Aij Aij ← q (i, j = 1, ..., n) ˆ id ˆj d Bij ← q
Bij
ˆ id ˆ j+m d
(i = 1, ..., n; j = 1, ..., m)
(35)
(36)
ˆ can be evaluated by: The vector d ˆ = W1 ˆ = d
A1M + B1N −M BT 1M + BT A−1 B1N −M
ˆD ˆ − 21 ˆ − 12 W where 1 is the column vector of ones. Thus, we can rewrite the decomposition of D as UΛUT , where Λ is an M × M matrix. 4 The weight matrix is near semi-positive definite, so we use the pseudo-inverse or add the extra regularization to find the square root of A in practice.
22
Acknowledgment This work is funded by the Basic Research Foundation of Tsinghua National Laboratory for Information Science and Technology (TNList). We would like to thank the anonymous reviewers for their valuable suggestions. We would also like to give special thanks to Qian Wu, Weibei Dou and Yonglei Zhou for providing us their detailed experimental data and code.
References [1] Song, Y., Zhang, C., Lee, J., Wang, F. (2006) A discriminative method for semi-automated tumorous tissues segmentation of MR brain images. In: Proc. of CVPR Workshop on Mathematical Methods in Biomedical Image Analysis (MMBIA). pp 79 [2] Pham, D.L., Xu, C., Prince, J.L. (2000) Current methods in medical image segmentation. Annual Review of Biomedical Engineering 2 pp 315–337 [3] Liew, A.W.C., Yan, H. (2006) Current methods in the automatic tissue segmentation of 3D magnetic resonance brain images. Current Medical Imaging Reviews 2(1) pp 91–103 [4] Leemput, K.V., Maes, F., Vandermeulen, D., Suetens, P. (1999) Automated model-based tissue classification of MR images of the brain. IEEE Trans. Med. Imag. 18(10) pp 897–908 [5] Pham, D., Prince, J. (1999) Adaptive fuzzy segmentation of magnetic resonance images. IEEE Trans. Med. Imag. 18(9) pp 737–752 [6] Zhang, Y., Brady, M., Smith, S.M. (2001) Segmentation of brain MR images through a hidden markov random field model and the expectation maximization algorithm. IEEE Trans. Med. Imag. 20(1) pp 45–57 [7] Marroqu´ın, J.L., Vemuri, B.C., Botello, S., Calder´ on, F., Fern´ andez-Bouzas, A. (2002) An accurate and efficient bayesian method for automatic segmentation of brain MRI. IEEE Trans. Med. Imag. 21(8) pp 934–945 [8] Liew, A.W.C., Yan, H. (2003) An adaptive spatial fuzzy clustering algorithm for 3d MR image segmentation. IEEE Trans. Med. Imag. 22(9) pp 1063–1075 [9] Prastawa, M., Gilmore, J.H., Lin, W., Gerig, G. (2004) Automatic segmentation of neonatal brain MRI. In: Proc. of Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp 10–17 [10] Hall, L., Bensaid, A., Clarke, L., Velthuizen, R., Silbiger, M., Bezdek, J. (1992) A comparison of neural network and fuzzy clustering techniques in segmenting magnetic resonance images of the brain. IEEE Trans. Med. Imag. 3(5) pp 672–682 [11] Sammouda, R., Niki, N., Nishitani, H. (1996) A comparison of hopfield neural network and boltzmann machine in segmenting MR images of the brain. IEEE Trans. Nucl. Sci. 43(6) pp 3361–3369 [12] Zhou, J., Chan, K.L., Chongand, V.F.H., Krishnan, S.M. (2005) Extraction of brain tumor from MR images using one-class support vector machine. In: Proc. of 27th Annula Int’l Conf. of the IEEE Engineering in Medicine and Biology Society (EMBS). pp 6411–6414 [13] Moon, N., Bullitt, E., Leemput, K.V., Gerig, G. (2002) Automatic brain and tumor segmentation. In: Proc. of 5th Int’l Conf. on Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp 372–379 [14] Shen, S., Sandham, W., Granat, M., Sterr, A. (2005) MRI fuzzy segmentation of brain tissue using neighborhood attraction with neural-network optimization. IEEE Trans. Med. Imag. 9(3) pp 459–467
23
[15] Li, C., Goldgof, D., Hall, L. (740–750) Knowledge-based classification and tissue labeling of MR images of human brain. IEEE Trans. Med. Imag. 12(4) pp 1993 [16] Clark, M., Hall, L., Goldgof, D., Velthuizen, R., Murtagh, F., Silbiger, M. (1998) Automatic tumor segmentation using knowledge-based techniques. IEEE Trans. Med. Imag. 17(2) pp 187–201 [17] Cuadra, M., Pollo, C., Bardera, A., Cuisenaire, O., Villemure, J.G., Thiran, J.P. (2004) Atlasbased segmentation of pathological MR brain images using a model of lesion growth. IEEE Trans. Med. Imag. 23(10) pp 1301–1314 [18] Zhu, Y., Yan, Z. (1997) Computerized tumor boundary detection using a hopfield neural network. IEEE Trans. Med. Imag. 16(1) pp 55–67 [19] Droske, M., Meyer, B., Rumpf, M., Schaller, C. (2001) An adaptive level set method for medical image segmentation. In: Proc. of 17th Int’l Conf. Information Processing in Medical Imaging (IPMI), Davis, CA, USA (2001) pp 416–422 [20] Lefohn, A.E., Cates, J.E., Whitaker, R.T. (2003) Interactive, GPU-based level sets for 3D segmentation. In: Proc. of Medical Image Computing and Computer-Assisted Intervention (MICCAI), Montreal, Que., Canada, Springer Verlag (2003) pp 564–572 [21] Prastawa, M., Bullitt, E., Ho, S., Gerig, G. (2004) Robust estimation for brain tumor segmentation. In: Proc. of Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp 10–17 [22] Guermeur, Y. (2002) Combining discriminant models with new multi-class SVMs. Pattern Analysis and Applications 5(2) pp 168–179 [23] Tortorella, F. (2004) Reducing the classification cost of support vector classifiers through an ROC-based reject rule. Pattern Analysis and Applications 7(2) pp 128–143 [24] Debnath, R., Takahide, N., Takahashi, H. (2004) A decision based one-against-one method for multi-class support vector machine. Pattern Analysis and Applications 7(2) pp 164–175 [25] S´ anchez, J.S., Mollineda, R.A., Sotoca, J.M. (2007) An analysis of how training data complexity affects the nearest neighbor classifiers. Pattern Analysis and Applications 10(3) pp 189–201 [26] Abe, S. (2007) Sparse least squares support vector training in the reduced empirical feature space. Pattern Analysis and Applications 10(3) pp 203–214 [27] Herrero, J.R., Navarro, J.J. (2007) Exploiting computer resources for fast nearest neighbor classification. Pattern Analysis and Applications 10(4) pp 265–275 [28] Tyree, E.W., Long, J.A. (1998) A monte carlo evaluation of the moving method, k-means and two self-organising neural networks. Pattern Analysis and Applications 1(2) pp 79–90 [29] Chou, C.H., Su, M.C., Lai, E. (2004) A new cluster validity measure and its application to image compression. Pattern Analysis and Applications 7(2) pp 205–220 [30] Frigui, H. (2005) Unsupervised learning of arbitrarily shaped clusters using ensembles of gaussian models. Pattern Analysis and Applications 8(1-2) pp 32–49 [31] Omran, M.G.H., Salman, A., Engelbrecht, A.P. (2006) Dynamic clustering using particle swarm optimization with application in image segmentation. Pattern Analysis and Applications 8(4) pp 332–344 [32] Seeger, M. (2001) Learning with labeled and unlabeled data. Technical report, Institute for ANC, Edinburgh, UK http://www.dai.ed.ac.uk/ seeger/papers.html. [33] Zhu, X. (2005) Semi-supervised learning literature cal Report 1530, Computer Sciences, University http://www.cs.wisc.edu/∼jerryzhu/pub/ssl survey.pdf.
survey. Techniof Wisconsin-Madison
[34] Belkin, M., Niyogi, P. (2003) Using manifold structure for partially labeled classification. In: Proc. of Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, MIT Press (2003) pp 929–936
24
[35] Belkin, M., Niyogi, P., Sindhwani, V. (2006) Manifold regularization: A geometric framework for learning from labeled and unlabeled examples. Journal of Machine Learning Research 1(1) pp 1–48 [36] Krishnapuram, B., Williams, D., Xue, Y., Hartemink, A., Carin, L., Figueiredo, M. (2005) On semi-supervised classification. In: Proc. of Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, MIT Press (2005) pp 721–728 [37] Zhou, D., Bousquet, O., Lal, T.N., Weston, J., Sch¨ olkopf, B. (2003) Learning with local and global consistency. In: Proc. of Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, MIT Press (2003) pp 321–328 [38] Zhou, D., Sch¨ olkopf, B. (2005) Regularization on discrete spaces. In: Proc. of Pattern Recognition, 27th DAGM Symposium (DAGM-Symposium). Volume 3663 of Lecture Notes in Computer Science., Vienna, Austria, Springer (2005) pp 361–368 [39] Zhu, X., Ghahramani, Z., Lafferty, J.D. (2003) Semi-supervised learning using Gaussian fields and harmonic functions. In: Proc. of Twentieth Int’l Conf. of Machine Learning (ICML), Washington, DC, USA, AAAI Press (2003) pp 912–919 [40] Zhu, X., Lafferty, J., Ghahramani, Z. (2003) Semi-supervised learning: From Gaussian fields to Gaussian processes. Technical Report CMU-CS-03-175, Computer Sciences, Carnegie Mellon University http://www.cs.cmu.edu/ zhuxj/publications.html. [41] Sindhwani, V., Chu, W., Keerthi, S.S. (2007) Semi-supervised gaussian process classifiers. In: Proc. of International Joint Conferences on Artificial Intelligence (IJCAI). pp 1059–1064 [42] Fowlkes, C., Belongie, S., Chung, F., Malik, J. (2004) Spectral grouping using the Nystr¨ om method. IEEE Trans. Pattern Anal. Machine Intell. 26(2) pp 214–225 [43] Grady, L., Funka-Lea, G. (2004) Multi-label image segmentation for medical applications based on graph-theoretic electrical potentials. In: Proc. of ECCV Workshops on CVAMIA and MMBIA. pp 230–245 [44] Suri, J.S., Singh, S., Reden, L. (2002) Computer vision and pattern recognition techniques for 2-D and 3-D MR cerebral cortical segmentation (part i): A state-of-the-art review. Pattern Analysis and Applications 5(1) pp 46–76 [45] Suri, J.S., Singh, S., Reden, L. (2002) Computer vision and pattern recognition techniques for 2-D and 3-D MR cerebral cortical segmentation (part i): A state-of-the-art review. Pattern Analysis and Applications 5(1) pp 77–98 [46] Liang, F., Mukherjee, S., West, M. (2006) Understanding the use of unlabelled data in predictive modelling. Statistical Science in press. [47] Zhu, S. (2003) Statistical modeling and conceptualization of visual patterns. IEEE Trans. Pattern Anal. Machine Intell. 25(6) pp 691–712 [48] German, S., German, D. (1984) Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Trans. Pattern Anal. Machine Intell. 6(6) pp 721–742 [49] McInerney, T., Terzopoulos, D. (1996) Deformable models in medical image analysis: A survey. Medical Image Analysis 1(2) pp 91–108 [50] Xu, C., Prince, J.L. (1998) Snakes, shapes and gradient vector flow. IEEE Trans. Image Processing 7(3) pp 359–369 [51] Malladi, R., Sethian, J., Vemuri, B. (1995) Shape modeling with front propagation: A level set approach. IEEE Trans. Pattern Anal. Machine Intell. 17(2) pp 158–175 [52] Boykov, Y., Jolly, M.P. (2001) Interactive graph cuts for optimal boundary and region segmentation of objects in n-d images. In: Proc. of IEEE Int’l Conf. On Computer Vision (ICCV). Volume I., Vancouver, B. C., Canada, IEEE Computer Society (2001) pp 105–112 [53] Boykov, Y., Veksler, O., Zabih, R. (2001) Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Machine Intell. 23(11) pp 1222–1239
25
[54] Li, Y., Sun, J., Tang, C.K., Shum, H.Y. (2004) Lazy snapping. ACM Trans. Graph. 23(3) pp 303–308 [55] Rother, C., Kolmogorov, V., Blake, A. (2004) “Grab cut” - interactive foreground extraction using iterated graph cuts. ACM Trans. Graph. 23(3) pp 309–314 [56] WU, Q., Dou, W., Chen, Y., Constans, J. (2005) Fuzzy segementaion of cerebral tumorous tissues in MR images via support vector machine and fuzzy clustering. In: Proc. of World Congress of Int’l Fuzzy Systems Association (IFSA), Beijing, China, Tsinghua University Press (2005) [57] Ulusoy, I., Bishop, C. (2005) Generative versus discriminative methods for object recognition. In: Proc. of Computer Vision and Pattern Recognition (CVPR). Volume 2. pp 258–265 [58] Abrahamsen, P. (1997) A review of Gaussian random fields and correlation functions, 2nd edition. Technical Report 917, Norwegian Computing Center [59] Neal, R.M. (1997) Monte carlo implementation of gaussian process models for bayesian regression and classification. Technical Report CRG-TR-97-2, Dept. of Computer Science, University of Toronto http://www.cs.toronto.edu/ radford/papers-online.html. [60] Williams, C., Barber, D. (1998) Bayesian classification with gaussian processes. IEEE Trans. Pattern Anal. Machine Intell. 20(12) pp 1342–1351 [61] MacKay, D.J.C. (1998). In: Introduction to Gaussian processes. Volume 168 of NATO ASI. Springer, Berlin pp 133–165 [62] Chung, F. (1997) Spectral Graph Theory. Number 92 in CBMS Regional Conference Series in Mathematics. American Mathematical Society [63] Seeger, M. (1999) Relationships between Gaussian processes, support vector machines and smoothing splines. Technical report, Institute for ANC, Edinburgh, UK http://www.dai.ed.ac.uk/ seeger/papers.html. [64] Williams, C.K.I., Seeger, M. (2001) Using the Nystr¨ om method to speed up kernel machines. In: Proc. of Advances in Neural Information Processing Systems (NIPS), Cambridge, MA, MIT Press (2001) pp 682–688 [65] Shi, J., Malik, J. (2000) Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Machine Intell. 22(8) pp 888–905 [66] Press, W., Teukolsky, S., Vetterling, W., Flannery, B. (1992) Numerical Recipes in C. 2nd edn. Cambridge University Press, Cambridge, UK [67] Dou, W., Ruan, S., Chen, Y., Bloyet, D., Constans, J.M. (2007) A framework of fuzzy information fusion for the segmentation of brain tumor tissues on mr images. Image and Vision Computing 25(2) pp 164–171 [68] Dou, W., Ren, Y., Wu, Q., Ruan, S., Chen, Y., Bloyet, D., Constans, J.M. (2007) Fuzzy kappa for the agreement measure of fuzzy classifications. Neurocomputing 70(4-6) pp 726–734 [69] Tao, D., Li, X., Wu, X., Maybank, S.J. (2007) General tensor discriminant analysis and gabor features for gait recognition. IEEE Trans. Pattern Anal. Machine Intell. 29(10) pp 1700–1715 [70] Tao, D., Li, X., Hu, W., Maybank, S.J., Wu, X. (2007) Supervised tensor learning. Knowledge and Information Systems [71] Lawrence, N.D., Jordan, M.I. (2005) Semi-supervised learning via Gaussian processes. In: Proc. of Advances in Neural Information Processing Systems (NIPS 17), Cambridge, MA, MIT Press (2005) pp 753–760
26