330
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
Abnormality Segmentation in Brain Images Via Distributed Estimation Evangelia I. Zacharaki and Anastasios Bezerianos
Abstract—The aim of this paper is to introduce a novel semisupervised scheme for abnormality detection and segmentation in medical images. Semisupervised learning does not require pathology modeling and, thus, allows high degree of automation. In abnormality detection, a vector is characterized as anomalous if it does not comply with the probability distribution obtained from normal data. The estimation of the probability density function, however, is usually not feasible due to large data dimensionality. In order to overcome this challenge, we treat every image as a network of locally coherent image partitions (overlapping blocks). We formulate and maximize a strictly concave likelihood function estimating abnormality for each partition and fuse the local estimates into a globally optimal estimate that satisfies the consistency constraints, based on a distributed estimation algorithm. The likelihood function consists of a model and a data term and is formulated as a quadratic programming problem. The method is applied for automatically segmenting brain pathologies, such as simulated brain infarction and dysplasia, as well as real lesions in diabetes patients. The assessment of the method using receiver operating characteristic analysis demonstrates improvement in image segmentation over two-group analysis performed with Statistical Parametric Mapping (SPM). Index Terms—Abnormality detection, brain pathology, distributed estimation, image segmentation, statistical modeling.
I. INTRODUCTION ECENT information explosion together with technological advances allowing access to electronic content provides an opportunity to collect and process large amount of data, such as medical images. The collection of data of the same kind allows the construction and exploitation of application-specific statistical models that optimally summarize knowledge. Such models can be incorporated into frameworks for anomaly or outlier detection, and model-based segmentation. Most methods, developed for segmenting anomalies, use labeled samples to characterize the abnormal objects. However, since abnormalities are usually rare or there may even be no data that describe specific pathologic conditions, supervised classification methods might be unsuccessful due to highly unbalanced data. Also, outlining and labeling representatives of all types of anomalies in an accurate way requires substantial human effort and is often prohibitively expensive. Semisupervised anomaly detection
R
Manuscript received April 18, 2011; revised September 19, 2011; accepted November 17, 2011. Date of publication December 7, 2011; date of current version May 4, 2012. This work was supported by the 7th European Community Framework Programme under Grant Marie Curie International Reintegration. The authors are with the School of Medicine, University of Patras, Rio 26504, Achaia, Greece (e-mail:
[email protected];
[email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TITB.2011.2178422
Fig. 1. Axial MR slices of normal brain illustrating the remaining variability after spatial normalization.
offers a solution to this problem by modeling normal data and then using a distance measure and thresholding to determine abnormality [1]. A recent review in the area of anomaly detection can be found in [2]. The best description of normality, in the statistical sense, is the unconditional probability density function (pdf) for the normal data [3]. If subsequently a test vector belongs to a region of input space for which the probability density is below a predetermined threshold, then that vector is considered to be novel or abnormal. The present approach goes beyond the standard anomaly detection techniques in that it not only characterizes the data vector as normal or abnormal, but also locates which part of the vector includes the anomaly. It is used in a scenario where data vary in their biggest part according to an expected or predictable distribution that can be statistically modeled from a set of normal data and also vary in small abnormal areas that cannot be explained by the same statistical model. Such abnormalities might be due to structural or morphological differences beyond the expected morphological variability and might indicate damage, disease, or any kind of artifacts. The method is applied for automatically segmenting pathologies in brain images, such as lesions, brain infarction, or dysplasia. Examples of training data (coregistered normal brain images) used to build a statistical model are shown in Fig. 1, whereas a test brain with white matter lesions is shown in Fig. 5. From every spatially normalized image, a feature vector x, such as the voxel-wise intensities, is first extracted. A new feature vector x ˆ is then synthesized, aiming to be similar to x but having the anomalies removed. A spatial abnormality score map is subsequently created by voxel-wise subtraction, |x − x ˆ|. Thresholding of such score map gives the segmentation of image anomalies. In contrast to most brain lesion segmentation methods based on outlier detection [4]–[6], the proposed method is generic. It does not consider single voxels independently and makes no assumption about shape or intensity profile of the abnormality. Hence, as a stand-alone method, it is expected to have low statistical sensitivity but it can be used as a preprocessing step
1089-7771/$31.00 © 2012 IEEE
ZACHARAKI AND BEZERIANOS: ABNORMALITY SEGMENTATION IN BRAIN IMAGES VIA DISTRIBUTED ESTIMATION
for locating all possible candidate regions before applying a specialized segmentation algorithm. A major issue during synthesizing valid images x ˆ is the significantly larger dimensionality (number of voxels) compared to the usually available number of training data (number of subjects). Most anomaly detection methods in image processing extract a few discriminatory features from regions of interest assuming data stationarity, in order to classify anomaly [3], [7], [8] rather than segment anomaly. In such cases, dimensionality is not the main challenge. On the opposite, the current approach extracts voxel-wise features, thus dimensionality increases tremendously. Specifically, we concatenate all voxelwise intensities into a long feature vector with the aim to detect any kind of structural or shape abnormalities (caused by tissue type change or tissue deformation) in nonstationary images. This is a main difference from methods that model intensitybased or deformation-based distributions (see [4] and [5] for review). In order to deal with the large dimensionality, we partition the images into subspaces, i.e., locally coherent overlapping blocks. It is assumed that for each location the blocks are generated from a Gaussian distribution and located in a tight cluster. These subspaces are then modeled by a linear method, such as principal component analysis (PCA) [9]. Although this assumption is oversimplistic for high-dimensional and complex datasets, it seems reasonable for local image partitions. Highdimensional data exhibit distributions that are highly sparse and can be represented by lower dimensional manifolds. As suggested also in other studies, such generally nonlinear manifolds can be approximated by locally linear subspaces [10]. Similarly, some anomaly detection methods, referred to as spectral techniques in [2], also determine subspaces (embeddings, projections, etc.). These methods are based on the assumption that normal and abnormal instances can be easier separated in a lower dimensional embedding space [11], whereas other methods project into a subspace constructed by normal samples (referred to as normal subspace) and measure degree of anomaly based on the residual component, i.e., the part that is not explained by the model. Our method differs from these methods in that 1) the embedding space is not constructed from the whole data, but from image partitions and 2) it seeks the separation within both, normal subspace, and residual subspace, i.e., it retains all dimensions of the original space (for each block). This study makes two fundamental contributions in discovering abnormality. First, an objective function is defined that evaluates probability of the test data according to a statistical model of normal data in a lower dimensional space, and also exploits similarity with the model representation as well as similarity with the original data. The objective function minimization is formulated as a quadratic optimization problem. Second, the curse of dimensionality is tackled by proposing a scheme where an image is partitioned into a set of overlapping blocks at various locations, similarly to [12]. The objective function is optimized for each local subspace and then the local subspace estimates are fused into a globally optimal estimate that satisfies coupling constraints. Data fusion is performed by applying a distributed estimation algorithm based on dual decomposi-
331
tion [13] and developed for solving large-scale problems. The proposed approach is comprehensively evaluated using receiver operating characteristic (ROC) analysis. The paper is organized as follows. Section II is devoted to methodology. In Section III, data and experiments are described including brain image simulations with three types of pathology, as well as real magnetic resonance (MR) images (FLAIR sequence) of patients with diabetes. The results are presented in Section IV followed by some concluding remarks in Section V. II. METHODS The methodology for abnormality segmentation uses 1) a set of pathology-free images in order to calculate an objective function measuring similarity to a healthy brain and 2) a test image (with abnormalities) for which the objective function is maximized. All images are coregistered and the mean image is calculated and subtracted from them. The solution is based on partitioning the spatial domain into overlapping, equally sized blocks in random locations. The algorithmic steps are the following. First, the test image is scanned and a random block is selected (among the not already scanned locations). By concatenating the image intensities in the block, the test vector x0 ∈ Rd is constructed, where d is the number of dimensions (e.g., number of voxels in the block). The same block is then extracted from all pathology-free images forming the training vectors Vn ×d , where n is the number of subjects. The training set V is used to calculate an objective function l(x) the optimization of which gives a new vector x ˆ ∈ Rd that is “less abnormal” and also as similar as possible to the original vector x0 . However, since the blocks are overlapping, the solutions cannot be independently calculated for each block. Instead, an iterative algorithm proposed in [13] and shown in Fig. 3 is used. After merging the solutions of all blocks, a spatial abnormality score map is calculated for the whole image by subtracting the reconstructed image from the original one. Section II-A describes the definition of the local objective function, which consists of three terms (E1 , E2 , E3 ) and Section II-B presents its formulation as a quadratic optimization problem. Sections II-C and II-D briefly illustrate the solution of the maximum likelihood estimation problem with overlapping blocks. A. Formulation of the Objective Function Since anomalies are defined as points with low probability density, it is expected to estimate x ˆ by maximizing the pdf obtained for the normal data. However, if the vector is high dimensional, the estimation of the pdf is not feasible. Therefore, we will maximize the pdf in a lower dimensional space p(u), where u is the representation of x in a basis W: u = W T x.
(1)
Here, x is a column vector assumed to be centered at the origin and T denotes the matrix transpose. If the Karhunen–Lo`eve (KL) transform (or PCA) is applied, the basis W is formed by the (d × d) matrix of the eigenvectors of the covariance matrix C of
332
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
the training set V, i.e., C = (1/n − 1)V T V . The KL transform can be inverted as follows: x = Wu. Assuming that x follows a multivariate Gaussian distribution, the density of u is the multivariate Gaussian density: p(u) =
1 T −1 e−(1/2)u D u (2π)k /2 |D|1/2
(2)
where D = W T CW = diag(λ1 , λ2 , . . . , λd ) is a (d × d) diagonal matrix of eigenvalues, assumed to be sorted in descending order. Typically, the number of samples is significantly smaller than the dimensionality in which case the eigenvalues λt , with t ≥ n, are zero and the corresponding eigenvectors in W are ignored. If all other eigenvectors are retained, u ∈ Rn −1 . According to (2), p(u) is maximized when (1/2)uT D−1 u is minimized. Based on (1) maximization of the density in respect to x then is equivalent to minimizing the following term: 1 T (x (W D−1 W T )x). (3) 2 Since u is lower dimensional than x, there exist an infinite number of data points x ∈ Rd with the same function cost value in (3). In order to reduce the solution space, we use an additional term that constraints the solution to remain close to the subspace spanned by W. If the test vector is x0 , then its projection on W is E1 (x) =
x0W = W u = W W T x0 . The second energy term expresses the distance to the projected point x0W : E2 (x; x0 ) = ||x − x0W ||2 = (x − W W T x0 )T (x − W W T x0 ) (4) where · denotes the L2 -norm. If x = x0 , this term expresses the reconstruction error or residual [14]. Since x0 does not necessarily lie within the subspace spanned by W, this term is larger than zero in this setting. This happens mainly because the abnormal vector x0 is inconsistent with the normal data building the basis W. Generally by minimizing E2 , we infer that x becomes sufficiently linearly dependent on the current dictionary (normal data), and represents normal behavior. The first two terms statistically model normality and are used to make the image look like if abnormality were removed. The final term is used to constrain the reconstructed image to be as similar as possible to the original image x0 based on the assumption that the majority of the voxels in the test image are normal. If all voxels are equally possible to be abnormal, then the distance from x0 can be used as dissimilarity criterion: E3 (x; x0 ) = ||x − x0 ||2 =
d
(x(j) − x0 (j))2
j =1
where j indicates the voxels in the image. If prior knowledge exists on spatial locations of possible abnormality, then weights can be incorporated to penalize less the dissimilarity in those locations. Since this method is unsupervised for the abnormal class and aims to generalize for any kind of abnormality, we do not incorporate a prior for the abnormal areas. However, we focus on the normal class and introduce a confidence measure on the estimation ability of the calculated
statistical model. Regions with large variability are much more difficult to model than uniform areas. A confidence map or vector shows the degree of certainty we have on the reconstruction of each parameter x(j). Parameters with high uncertainty in estimation should not deviate significantly from their original values x0 (j). This is achieved by penalizing any change on those parameters more than on other parameters. By incorporating an uncertainty vector a ∈ Rd the third term becomes E3 (x; x0 ) = (x − x0 )T A(x − x0 )
(5)
where A is a (d × d) diagonal matrix with normalized elements a(j)/ dj=1 a(j) on the main diagonal. The uncertainty vector a is calculated as the average reconstruction error at each location over all training images obtained by leave-one-out cross validation: n 1 xTt (I − Wt WtT )2 xt a= n t=1 where Wt is the basis formed without using training image t. The previous three terms are combined into a single objective function, l(x) by using different weights, shown as follows: x = arg min l(x), where x
l(x) = w1 E1 (x) + w2 E2 (x; x0 ) + w3 E3 (x; x0 )
(6)
and 0 ≤ w1 , w2 , w3 ≤ 1 and w1 + w2 + w3 = 1. According to the values of the weights, we balance between the model term (including E1 and E2 ), controlling the similarity with the training set consisting of normal data, and the data term (E3 ), controlling the similarity with the original vector. The weights depend on the confidence we have on the statistical model, as well as on the dominance of novelty or anomaly over the data. The larger the anomaly, the smaller should be the contribution of the data term. The model term on the other hand should always contribute significantly to the solution since it guides toward normality. The weights can be empirically determined by maximizing segmentation accuracy through cross validation on labeled data. Once the optimization problem is solved, the final reconstructed image is created by recentering to the original space, i.e., by adding the mean image to the result. B. Optimization of the Objective Function The objective function can be written in the form of a quadratic programming problem without any linear (equality or inequality) constraints 1 T T x Hx + f x x = arg min l(x) = arg min x x 2 subject to
bl ≤ x ≤ bu
(7)
where bl , bu are lower and upper bounds on x, H is a (d × d) positive semidefinite symmetric matrix, and f is a d-element column vector. From (6) it is derived that H = 2(1 − w1 − w2 )A + 2w2 I + w1 Wr Dr−1 WrT f = −2[(1 − w1 − w2 )A +
w2 Wr WrT
]x0
(8) (9)
ZACHARAKI AND BEZERIANOS: ABNORMALITY SEGMENTATION IN BRAIN IMAGES VIA DISTRIBUTED ESTIMATION
333
Fig. 2. Illustration of private and public variables. The public variables for block1 are all in constraint C1 whereas the public variables for block2 are in constraints C1 or in constraint C2.
where I is the (d × d) identity matrix, Dr−1 = diag(1/λ1 , 1/λ2 , . . .) is the inverse diagonal matrix of the largest eigenvalues retained and Wr the matrix of the corresponding retained eigenvectors. In this study, we retained 95% of variation. Optimization is performed by solving the quadratic programming problem with x0 as initial estimate. C. Distributed Estimation The maximum likelihood estimation problem in a distributed setting is solved using dual decomposition based on the algorithm presented in [13] and also described here briefly for completeness. Let us assume that k blocks (partitions) are extracted from an image and that the k blocks are coupled through nc consistency constraints that require the image intensities in overlapping voxels to be equal. The variables that are constraint to be equal across different blocks are denoted as public variables. The variables that are local to each block and are not common in other blocks are denoted as private variables. An example is shown in Fig. 2. Let us assume that si ∈ Rq i and yi ∈ Rpi are the unknown private and public variables (image intensities) of block i, respectively. If we concatenate si and yi , we get the vector s xi = i , indicating all variables (private and public) in block yi i. For each block a local (strictly) concave log-likelihood function is calculated by (7), li (xi ) or li (si , yi ). The public variables for all blocks are collected together into one vector variable y = (y1 , . . . , yk ) ∈ Rp , where p = p1 + · · · + pk , is the total number of public variables. A vector z ∈ Rn c is introduced to give the common values of the public variables in each consistency constraint. The constraints are expressed as y = Ez where E ∈ Rp×n c specifies the set of coupling constraints for the given block interaction 1 if (y)i is in constraint j . Eij = 0 otherwise Lagrange multipliers v ∈ Rp are introduced for the coupling constraints and a projected subgradient method is used to solve the dual master problem. Using these dual variables, optimization is independently performed in each block, and later on, the net variables are updated using the optimal values of the public variables of the blocks adjacent to that net. The dual variables
Fig. 3. Algorithm solving the maximum likelihood estimation problem in a distributed setting using a subgradient method.
are then updated, in a way that brings the local copies of public variables into consistency. The algorithm is summarized in Fig. 3. A measure of the inconsistency of the current values of the public variables (consistency constraint residual) is given by the norm of the vector computed in the last step, |E zˆ − y ∗ |. For more details, we refer to [13]. D. Implementation The optimization for each block in (10) can be written as a quadratic programming problem in respect to xi , where the loglikelihood function is given by the negative objective function of (6). Inorder to extract yi from xi , we define the matrix M = Oq i×pi , where O is composed of zeros and I is the identity Ipi×pi matrix, such that yi = M T xi . Then, the following equation becomes (s∗i , yi∗ ) = arg min(−li (si , yi ) + viT yi ) s i ,y i
⇒ x∗i = arg min(−li (xi ) + viT M T xi ) xi
⇒
x∗i
= arg min xi
1 T x Hi xi + fiT xi 2 i
(11)
where Hi = H and is calculated from (8) for block i, and fi = f + M vi with f given by (9) for block i. III. EXPERIMENTS The method is assessed on simulated and real MR brain images with pathology. The simulated data include three types of pathology: 1) white matter and periventricular lesions, 2) infarcts, and 3) dysplasia, whereas the real data include mostly pathology of the first type. White matter lesion patterns are very heterogeneous, ranging from punctuate lesions in the deep white matter to large confluent periventricular lesions. Brain lesions might be associated with several diseases, such as multiple sclerosis, traumatic brain injury, or some cardiovascular disease. Although the lesions present different characteristics depending on the cause, we aimed to simulate the main common patterns and create ground truth for assessing the method. Many methods in the literature have been developed for the detection of brain lesions using MRI. They usually apply supervised voxel-wise classification in which the desired segmentation is
334
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
known (expert manual delineation) and used as a training set to build the segmentation model [15]. Those methods can benefit from the proposed method by combining the abnormality score map calculated by this method with their own segmentation results in order to reduce false positives. In the second type of pathology, the brain infarction, an area of necrosis is formed in the brain, usually occurring when the blood supply to that area is interrupted. Since major infarcts may be easily discovered on CT or MRI scans, we concentrated on small infarcts that may go undetected clinically. The presence of necrosis is a main clue for differentiating white matter lesions from infarcts. Lesion segmentation methods based on intensity characteristics [15] will fail in the segmentation of the abnormal necrotic area since the intensity profile is the same with the profile of the cerebrospinal fluid. Moreover, infarcts close to the cortical gray matter are difficult to segment since intensities are similar and boundaries are not clear. Thus, methods providing prior knowledge of cortical shape variation in normal brains, such as the proposed one, are valuable also in this type of pathology. Similarly, the third type of pathology refers to brain malformation, such as cortical dysplasia. Focal cortical dysplasia (FCD) manifests as thickened cortex and is commonly associated with many forms of epilepsy. Detection of FCD lesions is fundamental for treatment planning and is challenging since the lesions might be subtle and also affect unique portions of the brain in each individual [16]. The creation of the simulated data and the preprocessing of the real data are described with more details in the next paragraphs. A. Simulated Data A set of images (axial slices) with dimension 128 × 128 voxels is created and used for constructing the statistical model of normal brain variability. The slices consist of a simplified model of white matter and ventricles, whereas gray matter is not included in the model. Variability in the dataset is introduced by randomly selecting the shape of head and ventricles. The shape of the head is parameterized by the location and size of an ellipse, while the shape of the ventricles is parameterized by the curvature, thickness, distance between left and right ventricle and asymmetry in anterior–posterior level. All these parameters are randomly selected assuming to follow a Gaussian distribution. Contrast is added in the two shapes as a sum of uniform gray level and white Gaussian noise. Subsequently, pathology is simulated on a small set of those images and used for assessing the method. For the first and second type of pathology, the lesions are elliptic in shape with a hyperintense Gaussian profile. Some are located in the white matter while others surround the ventricles (periventricular lesions). The infarcts include also a necrotic core. Dysplasia on the other hand is simulated as malformation of the ventricles instead of the cortex, because the simulated data do not include a model of the cortex. The aim is to detect the normal boundary in order to segment the part beyond the normal boundary. It is expected that the method will perform similarly in the detection of a thickened cortex, although we are aware that the cortex is
Fig. 4. Six examples of 2-D simulated brain images with added pathology (in first and second row) and removal of pathology with the proposed method (in third row). Three types of pathology are simulated, left: white matter and periventricular lesions, middle: infarcts, and right: dysplasia. The third row shows the images of the second row as reconstructed (without pathology) by the proposed method. The abnormality score map for the same images is illustrated at the bottom.
a much more complicated structure than the ventricles. Some examples of simulated brain images with pathology are shown in Fig. 4 (in first and second row). The large variability in shape and location of head and ventricles is evident. B. Real Data The real data consist of axial FLAIR scans obtained from elderly individuals with diabetes [17]. They are acquired with a 3-mm slice thickness, no slice gap, a 240 × 240 mm FOV and a 256 × 256 scan matrix. The data preprocessing steps include image smoothing, skull-stripping to extract the brain region [18], inhomogeneity correction [19], intensity normalization based on histogram matching, and deformable registration to a common space (template image). Intensity normalization is an important preprocessing step that standardizes the histograms of all training and test images. To this end, a linear transformation (translation and scaling) is calculated that minimizes the L2 -norm of the histogram difference between transformed image and template image. The histograms are first smoothed and the bin representing the background is excluded from the least-squares error minimization. Spatial normalization is performed by a deformable registration algorithm designed for brain images
ZACHARAKI AND BEZERIANOS: ABNORMALITY SEGMENTATION IN BRAIN IMAGES VIA DISTRIBUTED ESTIMATION
Fig. 5. Segmentation of white matter lesions on a diabetes patient. Top row from left to right: MR image, expert-defined lesion mask (in red) overlaid on the MR image, and reconstructed image without abnormalities by the proposed method. Bottom row: calculated abnormality score map in color scale, segmentation mask (in dark red) after thresholding the abnormality score map (overlaid on the MR image).
[20]. Assessment is performed using a lesion mask, manually delineated by a trained expert for each subject. An example of a test image with the corresponding lesion mask is shown in Fig. 5 (first row). IV. RESULTS The data are linearly scaled in the interval [0, 1] and centered by subtracting the mean image calculated from the training data. After optimization, the final reconstructed image is recentered to the original space. In the case of simulated data, the method is assessed using a set of 50 brain images for extracting the basis W (training set) and another set of 15 images with added pathology for testing the algorithm. In the case of real MRI data, the training set consists of 73 healthy elderly subjects and the test set consists of 33 subjects. The method is assessed on a single axial scan that includes pathology, similarly to the simulated data. ˆ|, is The calculated spatial abnormality score map, |x0 − x compared against the region of simulated pathology for the simulated data and the expert-defined lesion mask for the real data. By thresholding the abnormality score map, binary segmentation can be obtained. The results are assessed using ROC performance curves, created by changing the threshold level from 1 to 0. The area under the curve (AUC) gives an overall measure of sensitivity versus specificity and is used as independent evaluation criterion. The obtained AUC values are compared against voxel-wise statistics, commonly applied for brain lesion identification. For example, in [21], the images of controls and patients are normalized to the same template, spatially smoothed, and then compared statistically voxel-by-voxel to identify abnormal regions that differ between patients and the normal range established by the controls. Accordingly, in order to quantify de-
335
viation from the normal range, we use the voxel-wise standard score, z−score = (|x0 − μ|/σ), where μ and σ are the mean and standard deviation of the normal population for each voxel. Besides the z-score, which actually expresses the t-statistics, the proposed method is not compared against other methods developed for brain lesion segmentation since most of them, either require manual interaction such as samples labeling [15], [22] or seeds selection [23], apply a priori knowledge [24], [25], or target specific tissue (and not structural) abnormalities [26]. Such methods are expected to have higher specificity since they are based on pathology modeling, in contrast to our method that is solely based on modeling normal data. The aim of this study was mainly to avoid the tedious training phase and explore the potential of an automated tool for pathology segmentation. Our method can either be used as a stand-alone tool that highlights potential foci of pathology, thus being useful for expediting the screening process, or it can be applied in combination with a supervised scheme. In such scenarios we should assess the potential gain, regarding time to diagnosis or increase in segmentation accuracy, respectively. The method is developed in MATLAB with some routines written in C. The processing time in a Windows machine, Intel Core2 Duo CPU 2GHz, varied between a few minutes to several hours depending mostly on block dimensionality. A. Simulated Data We examined the sensitivity of abnormality segmentation to the choice of parameters by averaging the AUC values over all 15 simulated test data. Initially, we did not incorporate any uncertainty vector, i.e., we set A = I, and experimented with varying size of blocks, such that the ratio of image dimensionality to block dimensionality was equal to 2, 4, 6, or 8, respectively. A larger number of blocks (k) was selected for smaller block sizes in order to retain approximately the same number of public variables (overlapping voxels). Using different sets of weighting values in (6), we determined the block dimensionality having the highest average AUC value to be 21 × 21 voxels (image dimensionality/block dimensionality = 6). The corresponding average AUC values varied between 0.981 and 0.993 showing that for all tested sets of weighting values the proposed method performed significantly better than the segmentation based on z-score (for which AUC = 0.933). Subsequently, an uncertainty map was introduced, calculated by leave-one-out cross-validation on the normal population. The uncertainty map (with values smaller than 1) suppresses the contribution of the data term, and therefore, the term E3 was in this case weighted more. For the simulated data, the average highest AUC value (=0.993) was obtained when removing the contribution of the first term (w1 = 0) and increasing w3 until the value of 0.95. This possibly indicates that the data do not follow a Gaussian distribution or are too sparsely distributed (in this high-dimensional space) such that the estimated model likelihood is poor. In contrast, the second model term representing the distance to the data modeled by PCA proves to be significant. The reduction of the corresponding weight w2 from
336
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
0.05 to 0.001 causes a decrease in the average AUC value ( = 0.987) showing that the contribution of this term is important. Some examples of image reconstruction by the presented method are visualized in Fig. 4. The images in second row are reconstructed without pathology as shown in third row. Finally, the performance of the method was compared to a global scheme optimizing the objective function on the whole image without implementing distributed estimation. The results demonstrate that for the previously determined optimal weights (w2 = 0.05, w3 = 0.95) the AUC value drops (approximately 0.2%) for the datasets examined. Thus, for the current simulated data, the proposed method has higher segmentation accuracy than a global scheme. Generally, the optimal partitioning cannot be determined beforehand, since it depends on the estimation ability of the data (data variability versus number of available training samples). However, the availability of a method solving optimization in a distributed setting is generally important, since a global scheme is nothing else than a special case of image partitioning (i.e., for k = 1). B. Real Data The performance of the method in the case of real MR data was assessed on 33 subjects by selecting an axial scan (out of the 3-D image volume) that included a considerable amount of brain lesions. The selection of a single scan was performed to reduce the amount of data to process and focus on scans that include pathology. The best performance was obtained for block size 42 × 42 voxels and weighting values w1 = 0.98, w2 = 0.01, w3 = 0.01. The AUC value was 0.986. Here, global structure seems to be represented quite satisfactorily by the model likelihood term E1 , whereas local structure variability is not sufficiently well captured using this small number of available training samples. This is attributed mainly to the preprocessing of the MR images that included registration by a deformable transformation. The warped images, used to build the statistical model, differ mainly in their local structure, caused by the residual intersubject variability. This is not the case for the simulated data where structure differences are more global, as can be seen in Fig. 4. These experiments show that the results cannot be generalized and that the optimal weights are not the same for any kind of data. Thus, we decided to equally weight all terms using w1 = w2 = w3 = 1/3. We achieved AUC = 0.981 and 0.973 for the simulated and real data, respectively. The average (over all subjects) AUC value obtained by the voxel-wise standard score was significantly smaller (0.905), which indicates better performance of the current approach compared to univariate statistics. A segmentation example is illustrated in Fig. 5 and shows good lesion detection but poor performance in reconstructing the left lateral ventricle. By thresholding the abnormality score map a segmentation mask is created. This mask shows very good agreement with the expert-defined lesion mask. C. Assessment With SPM We used the SPM8 software [27] to perform a two-group analysis of the MR scan of each patient (Group 1) against the group
Fig. 6. Assessment of the proposed method (with w 1 = w 2 = w 3 ) on simulated (top) and real data (bottom).
of scans of normal controls (Group 2) for both simulated and real datasets. First, we extracted the statistical parametric map of t statistics under the null hypothesis of no difference between patient and control group, and calculated the corresponding AUC. The values are shown in Fig. 6 together with the AUC of the abnormality score map calculated by the proposed method and the z-score. The results show that the proposed method performs similar or better than the group analysis with SPM, for both simulated and real cases. Especially in the case of simulated dysplasia (data 11–15) the difference in the calculated AUC is significant. V. DISCUSSION This study combines ideas from anomaly detection and decomposition in order to segment anomalous regions in spatially normalized brain images. Abnormality detection without pathology modeling allows high degree of automation, since labeling of training images is avoided. It also allows the simultaneous detection of heterogeneous by nature pathologies without the need to obtain a specialized model characterizing each type of pathology. Segmentation is performed by incorporating a-priori knowledge about characteristics of normal data based on semisupervised learning. Specifically, an objective function is defined that evaluates probability of the test data according to a statistical model of normal data in a lower dimensional space, and also exploits similarity with the model representation as well as similarity with the original data. The objective function is optimized locally (in image blocks) and the local estimates are fused gradually into a globally optimal estimate that satisfies coupling constraints. The method has been applied for segmenting white matter lesions on diabetes patients. Although the results were inferior to using an intensity model [15] on the same data, the
ZACHARAKI AND BEZERIANOS: ABNORMALITY SEGMENTATION IN BRAIN IMAGES VIA DISTRIBUTED ESTIMATION
current method could also be successfully applied for segmenting pathological shape in simulated data. Unlike studies which aim to segment only pathological brain tissue based on outlier detection using intensity models [6], [28], this study is to the best of our knowledge the only one aiming to segment also anomalous shape and, thus, cannot use simply intensity models. Specifically, in [6] and [28], after taking into account the spatial content through the use of registered brain atlases, an intensity model is created in which images are treated as a “bag of pixels,” without respect to the considerable spatial structure. Therefore, the outliers are detected in a low-dimensional space, where the dimensions represent the applied modalities (e.g., T1-weighted, T2-weighted). Similarly, in some studies aiming to identify masses in mammograms [3] or single pixel anomalies in general [29] using a novelty detection framework, dimensionality is equal to a small number of statistical features extracted either from presegmented closed regions in the image [3] or from image partitions covering the whole image (with or without overlap) [29]. This is based on the assumption that images are locally stationary, i.e., that local appearance does not vary over the image. This is not true for brain images that contain complex cortical folding patterns. Methods that face the problem of high-dimensionality are the ones that perform image synthesis [7], [30]. Similarly to us, in [30] each image (mammogram) is warped to a canonical shape and the shape-normalized appearance is modeled using a multivariate Gaussian and PCA. In order to bypass the “curse of dimensionality,” the problem of modeling entire images is decomposed by combining an active appearance model with a spatially ergodic wavelet-based texture model. However, after the creation of synthetic mammograms, the model is not applied in a novelty detection scenario. We, on the contrary, do not only synthesize new normal brain images but also perform anomaly detection by maximizing model likelihood. In [7], dimensionality is reduced by developing a probability model that employs a pyramid representation to factor images across scale and a treestructured set of hidden variables to capture long-range spatial dependences. Moreover, this study differentiates itself from the well-known shape or appearance models [31] in that the region to be segmented is unknown in advance and is detected through the data residual obtained by modeling the normal anatomy. In such models [31], sparse representations can be achieved by representing only boundaries (in the case of shape models) or by sampling profiles normal to the boundary (in the case of appearance models). Here, a voxel-wise representation is sought, and thus, the number of variables is equal to the original data dimensionality. Often methods that express data similarity by least-squares error minimization are susceptible to contaminated outlying samples, such as voxels with anomaly. Since the anomalous image part is not known beforehand, in such cases all voxels are equally weighted during error minimization. To be robust to outliers, robust PCA or subspace learning has been introduced that assigns different weights to the vector elements in order to suppress the effect of untrusted data [32], [33]. We explored this idea and
337
introduced similarly an uncertainty map in the current scheme (5), in order to increase robustness. The construction of the statistical model requires the spatial registration of the data with a template image. For the real MR images this is performed by applying a deformable registration method designed for normal brain images [20]. Registration is achieved by hierarchically finding correspondences on edge voxels with high (attribute-based) similarity. Voxel pairs with low similarity, such as the ones around pathology, are not included in the deformation process. Therefore, although the algorithm is designed for images without pathology, it has been observed that it is also quite robust when the images include small localized abnormalities. The presence of pathology is even less crucial when the pathology is located in homogenous regions, such as in the case of deep white matter lesions. Alternatively, if required, the deformation could be restricted within the range of normal variation. The limits of normal variation can be determined by the deformations calculated during registration of the normal images. In such a case, it can be ensured that the generated anatomy is similar to the anatomies in the training class [34]. In this study, we do not examine the effect of the registration parameters, such as the regularization (warp smoothness) [35] or choice of reference subject (template). The template was chosen randomly and we have not removed any bias of the statistical model toward the particular anatomy. Several solutions for removing the bias have been suggested in the past [36], [37] and might be incorporated in this framework in the future in order to increase robustness. Also, we used a predetermined, fixed, smoothness constraint, same for all training and test subjects. It has been observed (in the context of cortical surface parcellation) that the optimal warp smoothness is mostly constant across subjects and that good segmentation results can be obtained for a range of smoothness levels [35]. The current methodology is based on voxel-wise features extracted from a single modality, such as FLAIR. In the future, we plan to explore multimodality features that have shown to minimize ambiguities in identifying abnormalities in many imaging applications, and especially in medical research [38]. Within the proposed mathematical framework, the application of the method in clinical studies with multiple sequences is easy with the only difference being the further increase in dimensionality. We also plan to investigate wavelet-based or other multiscale features that reduce spatial and spectral redundancy and, therefore, allow a more compact representation of the data, thus providing a means to efficient dimensionality reduction.
VI. CONCLUSION Image partitioning combined with a distributed estimation algorithm has been applied to deal with the high dimensional problem of statistical modeling. The constructed statistical model of normal brain images has been applied to the segmentation of brain pathologies. The assessment of the method based on ROC analysis demonstrated segmentation improvement over univariate statistics and two-group analysis performed with SPM and
338
IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 16, NO. 3, MAY 2012
illustrated the potential of semisupervised learning in such applications. ACKNOWLEDGMENT The authors would like to thank Dr. G. Erus and Dr. C. Davatzikos at the Section of Biomedical Image Analysis, University of Pennsylvania for helpful discussions. REFERENCES [1] M. Markou and S. Singh, “Novelty detection: A review. Part 1: Statistical approaches,” Signal Process., vol. 83, pp. 2481–2497, 2003. [2] V. Chandola, A. Banerjee, and V. Kumar, “Anomaly detection: A survey,” ACM Comput. Surveys, vol. 41, no. 3, pp. 1–72, 2009. [3] L. Tarassenko, P. Hayton, N. Cerneaz, and M. Brady, “Novelty detection for the identification of masses in mammograms,” in 4th Int. Conf. Artif. Neural Netw., 1995, vol. 4, pp. 442–447. [4] M. L. Seghier, A. Ramlackhansingh, J. Crinion, A. P. Leff, and C. J. Price, “Lesion identification using unified segmentation-normalisation,” NeuroImage, vol. 41, pp. 1253–1266, 2008. [5] S. Shen, A. J. Szameitat, and A. Sterr, “Detection of infarct lesions from single MRI modality using inconsistency between voxel intensity and spatial location—A 3-D automatic approach,” IEEE Trans. Inf. Technol. Biomed., vol. 12, no. 4, pp. 532–540, Jul. 2008. [6] K. Van Leemput, F. Maes, D. Vandermeulen, A. Colchester, and P. Suetens, “Automated segmentation of multiple sclerosis lesions by model outlier detection,” IEEE Trans. Med. Imag., vol. 20, no. 8, pp. 677–688, Aug. 2001. [7] C. Spence, L. Parra, and P. Sajda, “Detection, synthesis and compression in mammographic image analysis with a hierarchical image probability model,” in Proc. IEEE Workshop Math. Methods Biomed. Image Anal., 2001, pp. 3–10. [8] I. Steinwart, D. Hush, and C. Scovel, “Density level detection is classification,” Neural Inf. Process. Syst., vol. 17, pp. 1337–1344, 2005. [9] I. T. Jolliffe, Principal Component Analysis (Statistics Series). 2nd ed. New York, NY: Springer, 2002. [10] N.-X. Lian and C. Davatzikos, “Groupwise morphometric analysis based on morphological appearance manifold,” in Proc. IEEE Comput. Soc. Conf. Comput. Vision Pattern Recog. Workshops, 2009, pp. 133– 140. [11] M.-L. Shyu, S.-C. Chen, K. Sarinnapakorn, and L. Chang, “A novel anomaly detection scheme based on principal component classifier,” in Proc. 3rd IEEE Int. Conf. Data Mining, 2003, pp. 353–365. [12] G. Erus, E. I. Zacharaki, R. N. Bryan, and C. Davatzikos, “Learning high-dimensional image statistics for abnormality detection on medical images,” in Proc. IEEE Comput. Soc. Workshop Math. Methods Biomed. Image Anal., 2010, pp. 139–145. [13] S. Samar, S. Boyd, and D. Gorinevsky, “Distributed estimation via dual decomposition,” in Proc. Eur. Control Conf., 2007, pp. 1511–1516. [14] A. Lakhina, M. Crovella, and C. Diot, “Mining anomalies using traffic feature distributions,” in Proc. Conf. Appl., Technol., Architectures, Protocols Comput. Commun., 2005, pp. 217–228. [15] E. I. Zacharaki, S. Kanterakis, R. N. Bryan, and C. Davatzikos, “Measuring brain lesion progression with a supervised tissue classification system,” in Medical Image Computing and Computer-Assisted Intervention (Lecture Notes Computer Science Series), vol. 11. Berlin, Germany: Springer, 2008, pp. 620–627. [16] O. Colliot, T. Mansi, N. Bernasconi, V. Naessens, D. Klironomos, and A. Bernasconi, “Segmentation of focal cortical dysplasia lesions using a feature-based level set,” in Medical Image Computing and ComputerAssisted (Lecture Notes Computer Science), vol. 3749. Berlin, Germany: Springer, 2005, pp. 375–382. [17] J. D. Williamson, M. E. Miller, R. N. Bryan, R. M. Lazar, L. H. Coker, J. Johnson, T. Cukierman, K. R. Horowitz, A. Murray, and L. J. Launer, “The action to control cardiovascular risk in diabetes memory in diabetes study (ACCORD-MIND): Rationale, design, and methods,” Amer. J. Cardiol., vol. 99, no. 12, pp. S112–S122, 2007. [18] S. M. Smith, “Fast robust automated brain extraction,” Human Brain Mapping, vol. 17, no. 3, pp. 143–155, 2002.
[19] J. Sled, A. Zijdenbos, and A. Evans, “A nonparametric method for automatic correction of intensity nonuniformity in MRI data,” IEEE Trans. Med. Imag., vol. 17, no. 1, pp. 87–97, Feb. 1998. [20] D. Shen and C. Davatzikos, “HAMMER: Hierarchical attribute matching mechanism for elastic registration,” IEEE Trans. Med. Imag., vol. 21, no. 11, pp. 1421–1439, Nov. 2002. [21] E. A. Stamatakis and L.K. Tyler, “Identifying lesions on structural brain images—Validation of the method and application to neuropsychological patients,” Brain Lang., vol. 94, pp. 167–177, 2005. [22] N. Batmanghelich, X. Wu, E. I. Zacharaki, C. E. Markowitz, C. Davatzikos, and R. Verma, “Multiparametric tissue abnormality characterization using manifold regularization,” in Proc. SPIE Med. Imag.: Comput.-Aided Diagnosis, 2008, vol. 6915, no. 1, article 16, pp. 1–6. [23] J. Lecoeur, S. Morissey, J. C. Ferre, D. Arnold, D. Collins, and C. Barillot, “Multiple sclerosis lesions segmentation using spectral gradient and graph cuts,” in Proc. MIAMS Workshop, 2008, pp. 92–103. [24] D. Garc´ıa-Lorenzo, J. Lecoeur, D. L. Arnold, D. L. Collins, and C. Barillot, “Multiple sclerosis lesion segmentation using an automatic multimodal graph cuts,” in Medical Image Computing and Computer-Assisted (Lecture Notes Computer Science), vol. 5762. Berlin, Germany: Springer, 2009, pp. 584–591. [25] M. Roua¨ınia, M. S. Medjram, and N. Doghmane, “Brain MRI segmentation and lesions detection by EM algorithm,” World Acad. Sci., Eng. Technol., vol. 24, pp. 139–142, 2006. [26] R. B. Dubey, M. Hanmandlu, S. K. Gupta, and S. K. Gupta, “Semiautomatic segmentation of MRI brain tumor,” ICGST Int. J. Graphics, Vision Image Process., vol. 9, no. 4, pp. 33–40, 2009. [27] K. Friston, A. Holmes, K. Worsley, J. Poline, C. Firth, and R. Frackowiak, “Statistical parametric maps in functional imaging: A general linear approach,” Human Brain Mapping, vol. 2, pp. 189–210, 1995. [28] M. Prastawa, E. Bullitt, S. Ho, and G. Gerig, “A brain tumor segmentation framework based on outlier detection,” Med. Image Anal., vol. 8, pp. 275– 283, 2004. [29] J. Theiler and L. Prasad, “Overlapping image segmentation for contextdependent anomaly detection,” Proc. SPIE, vol. 8048, 2011. [30] C. J. Rose and C. J. Taylor, “A generative statistical model of mammographic appearance,” in Proc. Med. Image Understanding Anal., 2004, pp. 89–92. [31] T. F. Cootes and C. J. Taylor, “Anatomical statistical models and their role in feature extraction,” The Br. J. Radiol., vol. 77, pp. S133–S139, 2004. [32] L. Xu and A. L. Yuille, “Robust principal component analysis by selforganizing rules based on statistical physics approach,” IEEE Trans. Neural Netw., vol. 6, no. 1, pp. 131–143, 1995. [33] Y Li, “On incremental and robust subspace learning,” Pattern Recog., vol. 37, pp. 1509–1518, 2004. [34] D. Rueckert, A. F. Frangi, and J. A. Schnabel, “Automatic construction of 3D statistical deformation models using non-rigid registration,” in Medical Image Computing and Computer-Assisted (Lecture Notes Computer Science), vol. 2208. Berlin, Germany: Springer, 2001, pp. 77–84. [35] B. T. Yeo, M. R. Sabuncu, R. Desikan, B. Fischl, and P. Golland, “Effects of registration regularization and atlas sharpness on segmentation accuracy,” Med. Image Anal., vol. 12, pp. 603–615, 2008. [36] T. Rohlfing, R. Brandt, R. Menzel, and C. R. Maurer, Jr, “Evaluation of atlas selection strategies for atlas-based image segmentation with application to confocal microscopy images of bee brains,” Neuroimage, vol. 21, no. 4, pp. 1428–1442, 2004. [37] P. Aljabar, R. A. Heckemann, A. Hammers, J. V. Hajnal, and D. Rueckert, “Multi-atlas based segmentation of brain images: Atlas selection and its effect on accuracy,” Neuroimage, vol. 46, no. 3, pp. 726–738, 2009. [38] R. Verma, E. I. Zacharaki, Y. Ou, H. Cai, S. Chawla, R. Wolf, S.-K. Lee, E. R. Melhem, and C. Davatzikos, “Multi-parametric tissue characterization of brain neoplasms and their recurrence using pattern classification of MR images,” Acad. Radiol., vol. 15, pp. 966–977, 2008.
Authors’ photographs and biographies not available at the time of publication.