Deep Learning of Image Features from Unlabeled Data for Multiple ...

4 downloads 35672 Views 356KB Size Report
Cite this paper as: Yoo Y., Brosch T., Traboulsee A., Li D.K.B., Tam R. (2014) Deep Learning of Image Features from Unlabeled Data for Multiple Sclerosis ...
Deep Learning of Image Features from Unlabeled Data for Multiple Sclerosis Lesion Segmentation Youngjin Yoo1,2,3 , Tom Brosch1,2,3, Anthony Traboulsee3 , David K.B. Li3,4 , and Roger Tam2,3,4 1

Department of Electrical and Computer Engineering 2 Biomedical Engineering Program 3 Division of Neurology 4 Department of Radiology, University of British Columbia, Vancouver, BC, Canada

Abstract. A new automatic method for multiple sclerosis (MS) lesion segmentation in multi-channel 3D MR images is presented. The main novelty of the method is that it learns the spatial image features needed for training a supervised classifier entirely from unlabeled data. This is in contrast to other current supervised methods, which typically require the user to preselect or design the features to be used. Our method can learn an extensive set of image features with minimal user effort and bias. In addition, by separating the feature learning from the classifier training that uses labeled (pre-segmented data), the feature learning can take advantage of the typically much more available unlabeled data. Our method uses deep learning for feature learning and a random forest for supervised classification, but potentially any supervised classifier can be used. Quantitative validation is carried out using 1450 T2-weighted and PD-weighted pairs of MRIs of MS patients, with 1400 pairs used for feature learning (100 of those for labeled training), and 50 for testing. The results demonstrate that the learned features are highly competitive with hand-crafted features in terms of segmentation accuracy, and that segmentation performance increases with the amount of unlabeled data used, even when the number of labeled images is fixed. Keywords: Multiple sclerosis lesions, MRI, machine learning, segmentation, deep learning, random forests.

1

Introduction

Multiple sclerosis (MS) is a chronic, inflammatory and demyelinating disease of the brain and spinal cord. Lesions are a hallmark of MS pathology, and are primarily visible in white matter (WM) on conventional magnetic resonance imaging (MRI) scans. Manual segmentation by expert users is a common way to determine the extent of MS lesions, which is a time-consuming task and can suffer from intra- and inter-expert variability. Automatic segmentation is an attractive alternative, but it is a challenging task and remains an open problem [1]. G. Wu et al. (Eds.): MLMI 2014, LNCS 8679, pp. 117–124, 2014. c Springer International Publishing Switzerland 2014 

118

Y. Yoo et al.

Many automatic approaches have been proposed over the last two decades and they have two main categories: supervised and unsupervised. Supervised methods learn from training images previously segmented, and use user-selected image features to discriminate between lesions and healthy tissue (e.g. [2]). The availability of representative labeled images and the choice of image features are important considerations and may be difficult to optimize. Some methods use a very large starting set of features and select the more discriminative ones through labeled training (e.g. [3]). Unsupervised methods do not require labeled training data, but instead typically use an intensity clustering method to model tissue distributions and rely on expert’s a priori knowledge of MRI and anatomy to reduce false positives (e.g. [4]). While both supervised and unsupervised approaches have had some success, supervised methods that can automatically learn useful spatial features from unlabeled images are an attractive alternative that remains under-investigated. The amount of unlabeled data typically far exceeds that of labeled data, and using a large database to build a feature set has the potential to improve robustness and generalizability over current supervised methods. We present a new method for automatic learning, from unlabeled images, image features for MS lesion segmentation. We train our model on a large batch of unlabeled images to identify common patterns, then add labels to a subset of the training images so that the features and labels can be used in a supervised learning method to perform the segmentation. To our knowledge, this is the first attempt to automatically learn discriminative 3D image features from unlabeled images for MS lesion segmentation. Previous papers have proposed advanced feature selection methods, such as those based on modifications of random forests [5,6], but the features were still pre-determined and filtered using relatively small sets of labeled data to identify the more discriminative features. The main difference is that our method automatically learns data-driven features from unlabeled images without the potential bias of predefined features or those learned from labeled data. This allows large data sets to be used to generate broadly representative feature sets. We show that the learned features enable segmentation performance that is competitive with hand-crafted features, and that increasing the amount of unlabeled data improves segmentation performance, even when the amount of labeled data is fixed.

2

Materials and Methods

Our data set consists of the image data from 581 MS patients scanned at multiple time points. The total number of cases, where a case consists of a pair of T2weighted and proton density (PD) weighted scans, is 1450. Each T2/PD pair was acquired using a dual-echo MR sequence so they are inherently co-registered. The data set was collected from 48 sites, each using a different scanner, as part of a clinical trial in MS. All the images have the same resolution, 256 × 256 × 50, and the same voxel size, 0.936 × 0.936 × 3.000 mm3 . We divided the data set into independent training and test sets. The training set consists of 1400 cases

Deep Feature Learning for MS Lesion Segmentation

119

T2 images with lesion masks

Voxel intensity

T2 images

9×9×3 patches

Train RBM

9×9×3 patches

15×15×5 patches

Train 2-layer DBN

15×15×5 patches Compute activations

9×9×3 patches

Train RBM

9×9×3 patches

15×15×5 patches

Train 2-layer DBN

15×15×5 patches

PD images

Voxel intensity PD images with lesion masks

IT2 s1 T1:500 s2 ,1 T1:1000

s2 ,2 T1:1000



IPD s1 J1:500

Train a random forest with feature vectors and labels

s2 ,1 J1:1000

s2 ,2 J1:1000

g1



Feature vectors

Fig. 1. A training algorithm for our MS lesion segmentation framework. A large number of unlabeled images and a smaller number of labeled images are used in a deep learning framework to generate the feature vectors used to train a random forest classifier.

from 531 patients and the test set contains 50 cases from 50 patients. Within the training set, 100 cases from 100 patients have expert segmentations that we used for supervised training. For preprocessing, N3 inhomogeneity correction [7] is first applied. Then, the entire set of T2-weighted (and independently, PD-weighted) images are intensity-normalized to produce a mean of 0 and a standard deviation of 1. Skull-stripping is then performed with the brain extraction tool [8]. 2.1

Algorithm Overview

Our algorithm for learning image features from unlabeled data is built using restricted Boltzmann machines (RBMs), which are two-layer, undirected networks each consisting of a visible layer and a hidden layer, where the activations of the hidden units capture patterns in the visible units. RBMs can be stacked to form a deep belief network (DBN) for learning more abstract features. Our model (Fig. 1) consists of two RBMs, one for the T2 images, the other for the PD images, that learn smaller-scale features. In addition, two DBNs are used, again separately for the T2 and PD images, to learn larger-scale features. After training with unlabeled data, the model can be used to identify, in a probabilistic sense, the learned features in any given image. The model is then applied to a subset of the training data that has lesion labels. Any of the learned features found are then fed, along with the labels, into a random forest, which is used to build a voxel-wise probabilistic classifier to find lesion voxels in unseen images. 2.2

Unsupervised Feature Learning Using RBMs and Deep Learning

To target features at different scales, we extract image patches of two different sizes at the same locations from each image. To make feature learning on large batches of data feasible, we extract 100 uniformly spaced and non-overlapping patches at each scale, and set those patches as a mini-batch. The spacing and

120

Y. Yoo et al.

patch sizes allow complete coverage of the whole brain in most images. For the smaller scale, we use a patch size of 9 × 9 × 3, and convert the image values to one-dimensional vectors v1 , . . . , v100 ∈ RD with D = 243. For the larger scale features, we use a 3D patch size of 15 × 15 × 5 for D = 1125. We learn features from each 3D image patch using a Gaussian-Bernoulli RBM model [9] with a set of binary hidden random units h of dimension K (K = 500 for the smaller scale, 1000 for the larger scale), a set of real-valued visible random units v of dimension D (D = 243 for the smaller scale, 1125 for the larger scale), and symmetric connections between these two layers represented by a weight matrix W ∈ RD×K . We follow a published guide [10] for choosing the number of hidden units to avoid severe overfitting. We minimize the energy function [9]: D K D  K   1 2 (vi − ci ) − b j hj − vi Wij hj , E(v, h) = 2 i=1 j=1 i=1 j=1

(1)

where bj are hidden unit biases (b ∈ RK ) and ci are visible unit biases (c ∈ RD ). The units of a binary hidden layer (conditioned on the visible layer) are independent Bernoulli random variables P (hj = 1|v) = σ ( i Wij vi + bj ), where 1 is the sigmoid function. The visible units (conditioned on the σ(s) = 1+exp(−s) hidden layer) are independent Gaussians with diagonal covariance P (vi |h) =   N j Wij hj + ci , 1 . We perform the contrast divergence approximation [11] to update the weights and biases during training. In order to capture a higherlevel representation of local brain structures, another layer of hidden units is stacked on top of the larger scale RBM to form a deep belief network. Hinton et al. [11] showed that greedily training each pair of layers (from lowest to highest) as an individual RBM using the previous layer’s activations as input is an efficient approach for training DBNs. Our DBN has a layer of real-valued visible units v of dimension D = 1125 and two layers of K = 1000 binary hidden units h. 2.3

Feature Vector Construction for Supervised Learning

To train a random forest, we use the labeled set of training images and construct feature vectors computed by applying our trained RBM/DBN model to 200 image patches within the lesion mask and 3800 image patches from normalappearing tissue in each T2 and PD image. The patches are extracted in the same way described above. The activations of the RBM/DBN model represent the strength of the learned features present in the labeled images. We define x as a voxel location and let vs1 (x) represent a one-dimensional vector reformatted from a 3D image patch of size 9 × 9 × 3 centered at x. We define vs2 (x) as an one-dimensional vector reformatted from a 3D image patch of size 15 × 15 × 5 centered at x. We let IT2 (x) and IPD (x) represent intensity values at a voxel position x of a T2 image and a PD image, respectively. A feature vector g ∈ RL with L = 5002 is constructed for a given voxel in a pair of T2/PD images by concatenating the intensity values and activations of the learned features:

Deep Feature Learning for MS Lesion Segmentation s2 ,1 s2 ,2 s1 g(x) = {IT2 (x), T1:500 (vs1 (x)), T1:1000 (vs2 (x)), T1:1000 (vs2 (x)), s2 ,1 s2 ,2 s1 IPD (x), J1:500 (vs1 (x)), J1:1000 (vs2 (x)), J1:1000 (vs2 (x))},

121

(2)

where Tks1 is an activation of the k -th hidden unit from the trained RBM when the image patch vs1 (x) from the T2 image is used as input. Similarly, Tks2 ,n is an activation of the k -th hidden unit of the n-th layer from the trained DBN when the image patch vs2 (x) from the T2 image is used as input. Jks1 and Jks2 ,n are the analogous activations calculated from the PD image. 2.4

Random Forest Training and Prediction

We have chosen to use a random forest [12] for supervised classification because random forests have been successfully used for MS lesion segmentation using hand-crafted features [2], and because random forests are able to provide information on the relative importance of the features used. We construct a random forest consisting of 30 randomized binary decision trees with a maximum depth of 20. We use the same structure for the random forest as used for previous work [2] in MS lesion segmentation, which may not necessarily be optimal for our learned features, but should be sufficient for a proof-of-concept. As described above, we collect feature vectors from image patches inside and outside of the lesion mask of each labeled image. The information gain is used to measure the quality of a split. To segment the lesions in a new image, a feature vector for each voxel is computed using (2) and voxel-wise classification is performed by propagating the computed feature vectors through all the trees by successive application of the relevant binary tests. The final posterior probability is estimated by averaging the posteriors from every leaf node in all trees.

3

Experiments and Results

To evaluate the segmentation performance using the automatically learned features, we used a validation procedure in which we varied the amount of unlabeled data (100, 400, 700, 1000, and 1400 cases) used for training the RBMs and DBNs, while keeping the labeled (100 cases) and test images (50 cases) the same, and compared the automatic probabilistic segmentations to the binary segmentations by the experts. The parameters for training the RBMs and DBNs were kept consistent for all experiments. We used three measures for comparing segmentations: the Dice similarity coefficient (DSC), the true positive rate (TPR) and the positive predictive value (PPV) [1,13]. To produce binary segmentations, we thresholded the probabilistic segmentations using a visually derived value of 0.4. Since relative segmentation accuracy generally increases with lesion load, we stratified the cases into 5 lesion load categories for interpreting the results. An example of a segmentation result with a larger lesion load is shown in Fig. 2. Table 1 summarizes the segmentation performance as measured by the DSC. For all of the lesion load categories, there is an apparent trend toward greater accuracy with an increase in the number of unlabeled training images. The improvement is monotonic up to 700 cases, except for a slight aberration in the

122

Y. Yoo et al.

1

0.5

(a)

(b)

(c)

0

(d)

Fig. 2. Probabilistic segmentation example. (a) T2 input image, (b) PD input image, (c) probabilistic segmentation result, (d) ground truth. DSC = 73.15%.

Table 1. DSC results (%) calculated on 50 T2/PD test pairs. Ten T2/PD pairs were used for each lesion load range and average scores were computed. The number of unlabeled images used for feature learning was varied, while the supervised training set was fixed at 100 T2/PD pairs. There is an apparent trend toward improved accuracy with a greater number of unlabeled training images. Lesion load (1000×mm3 ) Number of cases (number of patients)

0.0-4.0

4.0-7.8

7.8-14.7

14.7-28.5

28.5+

100 (45)

12.8

32.7

45.2

51.2

51.4

400 (152)

12.1

34.2

48.1

54.8

55.3

700 (264)

12.8

35.5

49.0

56.3

56.5

1000 (384)

12.2

34.2

47.8

55.1

55.8

1400 (532)

14.0

36.2

49.6

55.7

55.4

lowest lesion load category. However, in all categories, the DSC decreased slightly when using 1000 cases as compared to 700 cases. This may be a problem arising from some unusual similarities between some of the 700 unlabeled cases, the labeled cases, and test images, leading to over-fitting, which may be determined by further experiments with multiple randomizations. To compare fairly with other state-of-the-art methods [2,13,14], we selected 38 cases from our test set so that the range in lesion load (128 mm3 to 20695 mm3 ) is similar to that of the data set used for evaluation in [2,13,14] (105 mm3 to 22542 mm3 ). Table 2 shows the performance statistics for the other methods and our own, using the features learned from 1400 cases, and demonstrates that our method is highly competitive in accuracy, although the use of different data sets only allows for an indirect comparison. The lower PPV value suggests that our model appears to under-segment the lesions compared to the other methods. Finally, we examined the training results of the random forest to determine which sets of features (intensity, RBM, DBN first layer, DBN second layer), and which MR channel were the most important. Table 3 shows the relative discriminative power of each category of features as represented by the percentage of nodes in which the category of features was selected by the random forest. These results suggest that spatial features were much more important (96.9%) than intensity features (3.1%) for distinguishing between lesion and non-lesion

Deep Feature Learning for MS Lesion Segmentation

123

Table 2. Average TPR/PPV/DSC results (%). Our method is compared to three state-of-the-art methods (2008: Souplet [14], 2011: Geremia [2], 2013: Weiss [13]). Note that DSC measures were not available in [14,2], and our data set was different which only allows for an indirect comparison. Souplet [14]

Geremia [2]

Weiss [13]

Our Method

TPR

PPV

TPR

PPV

TPR

PPV

DSC

TPR

PPV

DSC

19 ± 14

30 ± 16

39 ± 18

40 ± 20

33 ± 18

37 ± 19

29 ± 13

58 ± 17

35 ± 24

38 ± 19

Table 3. Relative discriminative power (%) of the features used for voxel-wise classification as determined by the random forest. The percentages indicate the relative frequency each category of features was selected when training the random forest using the features learned from 1400 unlabeled cases. T2 intensity

T2 RBM

T2 DBN Layer 1

T2 DBN Layer 2

PD intensity

PD RBM

PD DBN Layer 1

PD DBN Layer 2

1.3

14.9

17.7

19.0

1.9

14.3

15.0

16.0

voxels. The spatial features computed from the second layer of DBNs were selected slightly more often (35.0%) than those from the first layer (32.7%) and the RBMs (29.2%), but the RBM contribution still seems significant. The features learned from T2 images were selected slightly more often (51.6%) than the features learned from PD images (45.3%).

4

Conclusion and Future Work

We have presented a new MS lesion segmentation method based on automatic feature learning from unlabeled images. Using a multi-scale RBM/DBN framework, we showed that the automatically learned features can be highly competitive to hand-crafted features for subsequent use in the supervised training of random forests, and that adding more unlabeled images generally increases segmentation performance, with the main advantage that minimal manual effort is involved. The main current limitation is the high dimensionality of the feature vectors used for training the random forest, which is the reason we only used 4000 patches per labeled image. This limitation is more critical than the small number of patches used for RBM and DBN training, because much fewer labeled images are typically available. Future work would include improvements in training efficiency for the RBMs, DBNs, and random forests in order to use a greater number of sample patches for both the unsupervised and supervised stages. Another limitation is that although we have shown that increasing the amount of unlabeled training data generally increases segmentation performance, the interactions between the unlabeled, labeled, and test data are poorly characterized and deserve further investigation (for example, by varying the amount of labeled data). In addition, our model can likely be further optimized, for instance by tuning the deep learning and random forest parameters, and adding more layers

124

Y. Yoo et al.

to the network. Despite the limitations, we believe we have demonstrated the potential for unsupervised feature learning in MS lesion segmentation. Acknowledgements. This work was supported by the MS/MRI Research Group at the University of British Columbia, the Natural Sciences and Engineering Research Council of Canada, the MS Society of Canada, and the Milan and Maureen Ilich Foundation.

References 1. Garc´ıa-Lorenzo, et al.: Review of automatic segmentation methods of multiple sclerosis white matter lesions on conventional magnetic resonance imaging. Med. Image Anal. 17(1), 1–18 (2013) 2. Geremia, E., et al.: Spatial decision forests for MS lesion segmentation in multichannel magnetic resonance images. NeuroImage 57(2), 378–390 (2011) 3. Morra, J., et al.: Automatic segmentation of MS lesions using a contextual model for the MICCAI grand challenge. In: MS Lesion Segmentation Challenge (MICCAI Workshop), pp. 1–7 (2008) 4. Shiee, N., et al.: A topology-preserving approach to the segmentation of brain images with multiple sclerosis lesions. NeuroImage 49(2), 1524–1535 (2010) 5. Montillo, A., Shotton, J., Winn, J., Iglesias, J.E., Metaxas, D., Criminisi, A.: Entangled decision forests and their application for semantic segmentation of CT images. In: Sz´ekely, G., Hahn, H.K. (eds.) IPMI 2011. LNCS, vol. 6801, pp. 184–196. Springer, Heidelberg (2011) 6. Yaqub, M., Javaid, M.K., Cooper, C., Noble, J.A.: Improving the classification accuracy of the classic RF method by intelligent feature selection and weighted voting of trees with application to medical image segmentation. In: Suzuki, K., Wang, F., Shen, D., Yan, P. (eds.) MLMI 2011. LNCS, vol. 7009, pp. 184–192. Springer, Heidelberg (2011) 7. Sled, J.G., et al.: A nonparametric method for automatic correction of intensity nonuniformity in MRI data. IEEE T. Med. Imaging 17(1), 87–97 (1998) 8. Smith, S.M.: Fast robust automated brain extraction. Human Brain Mapping 17(3), 143–155 (2002) 9. Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images. University of Toronto, Tech. Rep. (2009) 10. Hinton, G.: A practical guide to training restricted Boltzmann machines. University of Toronto, Tech. Rep. (2010) 11. Hinton, G., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Computation 18(7), 1527–1554 (2006) 12. Breiman, L.: Random forests. Machine Learning 45(1), 5–32 (2001) 13. Weiss, N., Rueckert, D., Rao, A.: Multiple sclerosis lesion segmentation using dictionary learning and sparse coding. In: Mori, K., Sakuma, I., Sato, Y., Barillot, C., Navab, N. (eds.) MICCAI 2013, Part I. LNCS, vol. 8149, pp. 735–742. Springer, Heidelberg (2013) 14. Souplet, J.C., et al.: An automatic segmentation of T2-FLAIR multiple sclerosis lesions. In: MS Lesion Segmentation Challenge, MICCAI Workshop (2008)