Multilevel Image Recognition using Discriminative ... - CiteSeerX

1 downloads 0 Views 721KB Size Report
Department of Radiology and Imaging Sciences, ... proved as an effective image descriptor, with low dimensionality compared with joint statistics ... on future work using covariance feature and kernel classification for medical image analysis.
Multilevel Image Recognition using Discriminative Patches and Kernel Covariance Descriptor Le Lu, Jianhua Yao, Evrim Turkbey, Ronald M. Summers Clinical Image Processing Service and Imaging Biomarkers and Computer-Aided Diagnosis Laboratory Department of Radiology and Imaging Sciences, National Institutes of Health Clinical Center, Bethesda, MD

ABSTRACT Computer-aided diagnosis of medical images has emerged as an important tool to objectively improve the performance, accuracy and consistency for clinical workflow. To computerize the medical image diagnostic recognition problem, there are three fundamental problems: where to look (i.e., where is the region of interest from the whole image/volume), image feature description/encoding, and similarity metrics for classification or matching. In this paper, we exploit the motivation, implementation and performance evaluation of task-driven iterative, discriminative image patch mining; covariance matrix based descriptor via intensity, gradient and spatial layout; and log-Euclidean distance kernel for support vector machine, to address these three aspects respectively. To cope with often visually ambiguous image patterns for the region of interest in medical diagnosis, discovery of multilabel selective discriminative patches is desired. Covariance of several image statistics summarizes their second order interactions within an image patch and is proved as an effective image descriptor, with low dimensionality compared with joint statistics and fast computation regardless of the patch size. We extensively evaluate two extended Gaussian kernels using affine-invariant Riemannian metric or log-Euclidean metric with support vector machines (SVM), on two medical image classification problems of degenerative disc disease (DDD) detection on cortical shell unwrapped CT maps and colitis detection on CT key images. The proposed approach is validated with promising quantitative results on these challenging tasks. Our experimental findings and discussion also unveil some interesting insights on the covariance feature composition with or without spatial layout for classification and retrieval, and different kernel constructions for SVM. This will also shed some light on future work using covariance feature and kernel classification for medical image analysis.

KEYWORDS Computer-Aided Detection, Multilevel Image Recognition, Discriminative Image Patch Mining, Covariance Descriptor, Kernel Distance Metric, Support Vector Machine

1.

INTRODUCTION

There has been much progress in structured, model based 3D organ segmentation [7,8,9,10] in medical imaging for the last decade. Solid organs can be robustly located using boosting [11,12] or generalized Hough transform [10] based object detection schemes. The segmented model is then recovered by local boundary detection and statistical shape model fitting. However the mainstream organ segmentation methods cannot be directly generalized to semi-structured or non-structured image pattern detection and segmentation, with desirable accuracy. From the illustrative examples of colitis disease pattern in CT images as shown in Fig. 1, there are two major challenges of ambiguous visual appearance and non-structured shape model. Many diseased image patterns [13] or soft organs [14] share the similar characteristics, such as lesions, tumors, inflammation, bowel and so on.

In this paper, we describe a complete solution to address three essential problems of unstructured image pattern detection: where to look (i.e., where is the region of interest from the whole image/volume), image feature description and encoding, and similarity metrics for classification or matching. Contextual attachment of mesenteric vasculature is used as seeds for 3D region growing on labeling or segmenting small bowel [13], but this is not generally applicable. Given any input CT volume, a generic, trainable, high performance bounding-box based image detector is needed to detect the target tissue under scanning windows. To train such a detector, it is formulated as rare event detection problem since the target only occupies a small portion of the whole image. The target image patterns are labelled by manual annotation and treated as positive class samples. Thus the construction of representative negative class instances is critical to the effectiveness of final detector. We propose an iterative hard negative mining process via discriminative classifiers to solve this problem.

Figure 1, Illustrative examples of colitis disease pattern in 2D (left) and 3D (right) CT images.

Partially inspired by the hierarchical scene classification literature [6,18], we adopt a two level approach of direct patch level labeling and spatial image level aggregation, for classifying if a CT image contains the target disease and providing the segmented support mask. For image feature description, we exploit the covariance matrix based descriptor on intensity, gradient and spatial layout [1,13] channels. Affine-invariant metric [2] and log-Euclidean distance kernel [4,3] of covariance descriptors on Riemannian manifold are presented and compared as an extended Gaussian kernel for support vector machine. Multi-level image patch learning in medical domain has mostly been explored for lung image processing [19,20] where reasonably good intensity contrast of lung nodules/tumors versus parenchyma. In this paper, we propose a generic approach to solve more ambiguous image pattern detection task in a discriminative framework. Our method is tested and validated using two clinical applications: the prior mask detection of degenerative disc disease (DDD) on cortical shell unwrapped CT maps and colitis detection on CT key images. The rest of the paper is organized as follows. The complete algorithms and pipeline of our approach is described in Section 2. We provide experimental evaluation on the proposed method in Section 3, with both visually and quantitatively validated results. The paper is concluded and future work is discussed in Section 4.

2.

METHOD

Our method has three main algorithm components: (1) image region covariance descriptor, (2) task-driven discriminative image patching mining, (3) kernel similarity measurement using affine-invariant metric or log-Euclidean distance, combined with support vector machine.

2.1 Region Covariance Descriptor A region can be represented by the covariance matrix of image features, such as spatial location, intensity, higher order derivatives, etc [1]. It captures the second order co-statistics among different feature channels over an image region. The

covariance matrix is computed as averaging the feature co-statistics. Thus it is a robust local image descriptor and demonstrates good sensitivity and specificity in modeling texture, object (e.g., pedestrian) appearance. It is applicable for both detection and retrieval tasks when the proper distance measures are employed since covariance matrix metric is analytically lying on Riemannian Manifolds. The efficient computation method exists via integral images. In our problems, region covariance descriptor is first adopted to capture the statistical appearance correlations among osteophyte regions and to compute the confidence of osteophyte occurrence. Taking DDD detection on cortical shell unwrapped CT maps [21] as example, the operation is conducted on the mean density map (U3). We compute the regional covariance coefficients within an image patch over 11 feature channels as 2D spatial location ( Z ,  ), intensity

U 3 ( Z ,  ) and its first and second order spatial gradients of ( Z ,  ) for both the original U 3 ( Z ,  ) and its Gaussian smoothed version U 3S ( Z ,  ) , as demonstrated in Fig. 2. Thus the feature vector for each point (z, φ) on the map is,





2 2 s s 2 s 2 s F ( z, )  z, ,U3 ( z, ),|  zU3 |, | U3 |, |  z U3 |, |  U3 |, |  zU3 |, | U3 |, |  z U3 |, |  U3 | .

Figure 2, Illustrative examples of 11 channels of feature maps for regional covariance descriptor, using spatial coordinates, intensity from the mean intensity map of cortical shell unwrapped CT image, the first and second order directional gradients on the original and smoothed intensity maps.

The resulting feature covariance matrix S is 1111 and has 66 independent parameters due to symmetry. From the manual markers (Fig. 3), we can obtain two sets of positive and negative samples for training. In Fig. 3 (bottom), the radiologist’s annotations of degenerative disc disease spots is shown as white dots/clicks unwrapped from 3D CT volume [21] where the different color maps represent four spatially grouped DDD lesions. The sampled positive and negative image windows are randomly drawn from the image and labelled as (  /  ) based on their overlapping ratios R , against the reference of the spatial occupancy of DDD lesions. If R  0.7 , the window is defined as  ; or if R  0.2 , the window is defined as  . The subset of sampled windows with 0.2  R  0.7 are not used for training to avoid confusion.

2.2 Discriminative Image Patching Mining





For DDD detection, we can directly obtain two sets of positive {S } and negative {S } samples by random sampling and overlapping check with the colored ground-truth masks, as shown in Fig. 3 (bottom). This is due to DDD patterns often occupying a significant partition of the whole unwrapped image and with relatively simple background (single anatomy of vertebral column). One-loop sampling of negative set build an effective classifier.

{S } will be sufficiently representative to

Figure 3, (Top Left) An abdomen CT key image shows colitis affecting the cecum (arrows); (Top Middle) Its associated colitis image patch classification response value map (normalized classconditional probability from support vector machine); (Top Right) The detected image region with per patch probability value > 0.5. No spatial smoothing is applied on the response and detection maps. (Bottom) The annotation as white dots/clicks unwrapped from 3D CT volume; the color DDD masks and sampled positive and negative image windows. In CT abdomen key images, colitis regions normally appear in a weak or not very distinctive manner, compared with the rest of background. More importantly, there can be many visually confusing image patterns due to the existence of bowel, kidney and other organs, as seen in Fig. 4. The thickened colon wall and pericolonic fat inflammation regions are color-coded by a radiologist to identify both as foreground class (  ) to be detected. To build an effective discriminative model for detecting colitis, or even under mild colitis condition, the hard-to-distinguish negatives need to be discovered and enforced during the classifier training [5,15]. We propose an iterative discriminative negative patch mining method, by training layers of SVM classifiers of 1,2,... {CiSVM } . The positive testing samples {S } are drawn similarly by checking high overlapping ratios with the masked

foreground supporting masks and are fixed during iteration. Initially when i  1 , negatives randomly and uniformly. After

{Si1} are sampled

i 1 1  is trained, randomly resampled set {S } is evaluated by C SVM and only hard CiSVM

negatives (i.e., highly confusing negatives with relatively high SVM confidence values of being colitis class, 







 (S )  0.75 ) as {Si  2 } are further added into the negative set {Si } = {Si }  {Si 1} , to train the next layer SVM

CiSVM from the adapted {Si } and fixed {S } . This process converges when the change from set {Si1} to {Si } is negligible. According to our empirical experiments, the mining procedure converges within 5-6 iterations. In summary, a SVM classifier is finally trained to model the classification boundary precisely where hard negatives are the critical negative samples near the decision boundary. More sophisticated analytical solutions on mining covariance descriptor samples [16] or mode-seeking/clustering based principles [17] are left for future work.

Figure 4, Illustrative examples of the original CT key image (left) and its color-coded disease regions of thickened colon wall (green) and pericolonic fat inflammation(yellow)(right).

2.2 Similarity Measurements on Covariance Descriptor The distance between any pair of covariance matrix descriptors (Si ,S j ) is defined in Riemannian manifold. To preserve the manifold distance metric, there are two main types of distances that have been analytically proposed and exploited in computer vision for object matching and detection problems from previous literature. Affine-invariant metric is defined as

d(Si ,S j ) 



11 k 1

ln 2 k (Si ,S j )

where k is the generalized eigenvalue of eig(Si ,S j ) [2]; or Log-Euclidean is formulated as

d(Si ,S j ) | logm(Si )  logm(S j ) | where logm(Si ) is the matrix-logarithm operator of any matrix Si [3]. To integrate the distance metric into SVM classifier, an extended Gaussian kernel can be generally proposed as K(Si ,S j )  exp(d (Si ,S j ) /  ) . where d (Si ,S j ) can be any choice of above two distance or similarity functions and  is selected as the mean of

d (Si ,S j ) from the training datasets of {S } and {S } . Log-Euclidean distance d(Si ,S j ) | logm(Si )  logm(S j ) | can make K(Si ,S j )  exp(d (Si ,S j ) /  ) a Mercer kernel which is theoretically required by SVM to guarantee good performance, as recently addressed in [4]. Affine-invariant metric d(Si ,S j ) 



11 k 1

ln 2 k (Si ,S j ) can still be

empirically used but there lacks of theoretic support. From our experimental evaluation, Affine-invariant metric kernel indeed brings inferior classification accuracy, compared with Log-Euclidean distance kernel, even when significantly more support vectors are selected by SVM training. In runtime, we extract each N  N scanning window ( N  25 ) centered every 3 pixels in x and y directions via the trained SVM. In this way, a CT key image is densely scanned and every image window or patch has statistically sufficient amount of pixels to compute its covariance matrix descriptor. Fig. 3 (top-middle) shows an example of the response map by SVM confidence values. For colitis, the local patch confidences of detections are spatially aggregated to form image-level statistical features so that we can classify the key images as a binary problem of containing colitis (+), or not (-). The image level features are simply the zero and first order of statistics from the strength values of response map [6], to represent the summed response weight and the area of the spatial span of underlying disease regions.

3.

RESULTS

Twenty two cases of cortical shell unwrapped CT maps (with total 65 DDD lesions) are employed for study. We obtain total 3572 patches (one covariance descriptor Si per patch) for robustness by injecting randomness during spatial sampling. Under two fold cross-validation, we achieve 90.9% accuracy for affine invariant metric and 98.7% for logEuclidean metric kernel. Thus we observe the substantial performance gain by using a Mercer kernel under extended RBF-SVM. For DDD, the positive and negative class population of {S } and {S } are balanced; whereas strongly unbalanced {S } and {S } exist for colitis task. This is the main reason of Section 2.2. The examples of detection response maps for DDD are illustrated in Fig. 5.

Figure 5, Two examples of radiologist’s clicks of DDD spot on the mean intensity cortical shell unwrapped CT map [21], the color-coded groups of DDD lesions, and the final detection response map. More sophisticated classifier are used to identify DDD lesions, following the initial detection as shown above.

Figure 6. Examples of image level colitis detection of true positive detections (the first row) showing the original CT key image and its colitis response map, false positive detections (the second row left) and false negative detections (the second row right).

Forty three key CT images extracted from 43 patients are used for the initial study of colitis detection. The mean wall thickness at colitis segments was 10.3 mm (range: 4.2-20.2 mm) whereas it was 2.3 mm (range: 1.2-3.2 mm) at normal colon segments. The image patch level classification accuracy is 94.1% in testing using log-Euclidean metric

(representing 68.8% sensitivity and 99.97% specificity) but the missed positive patches generally have noticeably lower overlapping ratios with the ground-truth colitis masks. The response map still shows good coverage of colitis region and very few false positives (the first row of Fig. 6). When affine-invariant metric is employed, the accuracy drops to 75.2% even for the trained SVM model results with more than 2.5 times the number of support vectors. The per-image classification accuracy is 92.8% (versus 83.3% with affine-invariant). For colitis patients, the sensitivity is 94.1% (16 out of 17). The missed case has been verified to have no contrast which is unseen from training (Fig. 6, the second row right). 23 out of 25 healthy subjects are classified correctly with the specificity of 92%. The false positive case has other disease unseen from training (Fig. 6, the second row left). Fig. 7 demonstrates the image patch and key image level classification results, respectively. Per-Image ROC Curve 1 0.9 0.8

True Positive Rate

2 1.5 1 0.5 0 -0.5 -1

Predict Label Training Label Confidence

-1.5

0.6 0.5 0.4 0.3 0.2 0.1

-2 -2.5

0.7

0 0

100

200

300

400

500

600

700

800

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False Positive Rate

Figure 7. The image patch classification confidence plot on a subset of training data samples (left) to show good classification accuracy at per covariance matrix descriptor level using log-Euclidean RBF-SVM; the image level ROC curve of colitis detection.

4.

Conclusion

In summary, a novel classification system using covariance descriptor and log-Euclidean kernel SVM has been presented to address two computer-aided detection problems. Very promising quantitative experimental results are validated for 22 DDD patients and 43 colitis and control patients. Our system can provide important and useful information for further study on highly accurate computer-aided detection of various disease image patterns, by adaptive data sampling and descriptor training. For future work, we plan to exploit explicit dictionary learning and sparse coding principles coupled with covariance descriptor for large, scalable medical image recognition.

5.

Acknowledgments

This research was supported by the Intramural Research Program of the NIH Clinical Center.

REFERENCE 1.

O. Tuzel, F. Porikli, P. Meer, Region Covariance: A Fast Descriptor for Detection and Classification, European Conference on Computer Vision, 2006.

2.

W. Förstner, B. Moonen, A Metric for Covariance Matrices, Tech. Report of the Department of Geodesy and Geoinformatics, Stuttgart University, 1999.

3.

V. Arsigny, P. Fillard, X. Pennec, N. Ayache. Log-Euclidean metrics for fast and simple calculus on diffusion tensors. Magnetic Resonance in Medicine, 56(2):411–421, 2006.

4.

S. Jayasumana, R. Hartley, M. Salzmann, H. Li, M. Harandi, Kernel Methods on the Riemannian Manifold of Symmetric Positive Definite Matrices, IEEE Conference on Computer Vision and Pattern Recognition, Portland, USA, 2013.

5.

S. Singh, A. Gupta, A.A. Efros, Unsupervised Discovery of Mid-Level Discriminative Patches, European Conference on Computer Vision, Firenze, Italy, 2012.

6.

L. Lu, K. Toyama, G. Hager: A Two Level Approach for Scene Recognition. IEEE Conference on Computer Vision and Pattern Recognition, (1) 2005: 688-695.

7.

Y. Zheng, A. Barbu, B. Georgescu, M. Scheuering, D. Comaniciu: Four-Chamber Heart Modeling and Automatic Segmentation for 3-D Cardiac CT Volumes Using Marginal Space Learning and Steerable Features. IEEE Trans. Med. Imaging 27(11): 1668-1681 (2008).

8.

H. Ling, S. Kevin Zhou, Y. Zheng, B. Georgescu, M. Sühling, D. Comaniciu: Hierarchical, learning-based automatic liver segmentation. IEEE Conference on Computer Vision and Pattern Recognition, 2008.

9.

J. Ma, L. Lu: Hierarchical segmentation and identification of thoracic vertebra using learning-based edge detection and coarse-to-fine deformable model. Journal of Computer Vision and Image Understanding, 117(9): 1072-1083 (2013).

10. O. Ecabert, J. Peters, H. Schramm, C. Lorenz, J. von Berg, M. Walker, M. Vembar, M. E. Olszewski, K. Subramanyan, G. Lavi, J. Weese: Automatic Model-Based Segmentation of the Heart in CT Images. IEEE Trans. Med. Imaging 27(9):11891201 (2008). 11. Z. Tu: Probabilistic Boosting-Tree: Learning Discriminative Models for Classification, Recognition, and Clustering. IEEE International Conference on Computer Vision, 2005: 1589-1596. 12. P. Viola, M. Jones: Rapid Object Detection using a Boosted Cascade of Simple Features. IEEE CVPR, (1) 2001: 511-518. 13. J. Yao, H. Munoz, J. Burns, L. Lu, K. Kurdziel, R. Summers, Computer Aided Detection of Spinal Degenerative Osteophytes on Sodium Fluoride PET/CT, Computational Methods and Clinical Applications for Spine Imaging Workshop, MICCAI 2013. 14. W. Zhang, J. Liu, J. Yao, A. Louie, T. Nguyen, S. Wank, W. Nowinski, R. Summers: Mesenteric Vasculature-Guided Small Bowel Segmentation on 3-D CT. IEEE Trans. Med. Imaging 32(11): 2006-2021 (2013). 15. M. Juneja, A. Vedaldi, C. V. Jawahar, A. Zisserman, Blocks that Shout: Distinctive Parts for Scene Classification, IEEE Conference on Computer Vision and Pattern Recognition, 2013. 16. J. Henriques, J. Carreira, R. Caseiro, J. Batista, Beyond hard negative mining: efficient detector learning via block-circulant decomposition, IEEE Conference on Computer Vision and Pattern Recognition, 2013. 17. C. Doersch, A. Gupta, A. Efros, Mid-level visual element discovery as discriminative mode seeking, Neural Information Processing Systems 26, 494-502, Lake Tahoe, USA, 2013. 18. S. Lazebnik, C. Schmid, J. Ponce, Beyond bags of features: spatial pyramid matching for recognizing natural scene categories, CVPR (2) 2006: 2169-2178. 19. Y. Tao, L. Lu, M. Dewan, A. Chen, J. Corso, J. Xuan, M. Salganicoff, A. Krishnan: Multi-level Ground Glass Nodule Detection and Segmentation in CT Lung Images. MICCAI (1) 2009: 715-723. 20. Y. Song, W. Cai, S. Eberl, M. Fulham, D. Feng: Discriminative Pathological Context Detection in Thoracic Images Based on Multi-level Inference. MICCAI (3) 2011: 191-198. 21. J. Yao, J. Burns, H. Munoz, R. Summers: Detection of Vertebral Body Fractures Based on Cortical Shell Unwrapping. MICCAI (3) 2012: 509-516.