Deep-learning Derived Features for Lung Nodule

0 downloads 0 Views 908KB Size Report
Use of a standard deep learning network in the framework for feature derivation yielded features ... The workflow is outlined in Figure 1. Images are acquired via ...
Deep-learning Derived Features for Lung Nodule Classification with Limited Datasets P. Thammasorn*a, W. Wub,c, L. A. Piercec, S. N. Pipavathc, P.D. Lamped, A. M. Houghtond, D. R Haynorc, W. A Chaovalitwongsea, P. E. Kinahanc. a Department of Industrial Engineering, University of Arkansas, Fayetteville, AR, USA, 72701; b Department of Radiology, Tongji Medical college affiliated to Huazhong University of Science and Technology, Wuhan, China c Department of Radiology, University of Washington, Seattle, WA, USA, 98195 d Fred Hutchison Cancer Research Center, Seattle, WA, USA, 98109 ABSTRACT Only a few percent of indeterminate nodules found in lung CT images are cancer. However, enabling earlier diagnosis is important to avoid invasive procedures or long-time surveillance to those benign nodules. We are evaluating a classification framework using radiomics features derived with a machine learning approach from a small data set of indeterminate CT lung nodule images. We used a retrospective analysis of 194 cases with pulmonary nodules in the CT images with or without contrast enhancement from lung cancer screening clinics. The nodules were contoured by a radiologist and texture features of the lesion were calculated. In addition, sematic features describing shape were categorized. We also explored a Multiband network, a feature derivation path that uses a modified convolutional neural network (CNN) with a Triplet Network. This was trained to create discriminative feature representations useful for variable-sized nodule classification. The diagnostic accuracy was evaluated for multiple machine learning algorithms using texture, shape, and CNN features. In the CT contrast-enhanced group, the texture or semantic shape features yielded an overall diagnostic accuracy of 80%. Use of a standard deep learning network in the framework for feature derivation yielded features that substantially underperformed compared to texture and/or semantic features. However, the proposed Multiband approach of feature derivation produced results similar in diagnostic accuracy to the texture and semantic features. While the Multiband feature derivation approach did not outperform the texture and/or semantic features, its equivalent performance indicates promise for future improvements to increase diagnostic accuracy. Importantly, the Multiband approach adapts readily to different size lesions without interpolation, and performed well with relatively small amount of training data. Keywords: Lung nodules, texture, radiomics, quantitative imaging, deep learning, early detection

1. INTRODUCTION Lung cancer is the leading cause of cancer-related death1. Improvements in diagnosis at an early, potentially curable stage would have a major impact on human health. The National Lung Screening Trial (NLST) trial showed that lowdose screening CT for the appropriate patient population reduced death rates 7. However, of 4-12 mm indeterminate pulmonary nodules detected in the NLST trial, only 3.6% developed into cancer during the period of study. Clinical options for patients in whom small indeterminate nodules are found are typically either a biopsy of the lung nodule, if feasible, or continued surveillance by CT to monitor growth. Improved assessment of the probability of an otherwise indeterminate pulmonary nodule being cancerous would likely improve patient outcomes and reduce cost of care. One approach is ‘radiomics’, the use of quantitative texture or shape features derived from the nodule in the CT image to provide diagnostic information2,3,4,5,6. This approach is motivated by the recognition that trained radiologists can determine the likelihood of malignancy due to the shape and texture of nodules in CT images. This approach is also of interest in the prognosis or prediction of response to therapy for known lung cancers. In parallel with the growing development of radiomics has been the increased exploration of deep learning methods, which have been shown to exceed the performance of standard machine learning classification methods for some detection tasks9. There are an increasing number of studies applying these methods in radiology classification tasks, including the use of Convolutional Neural Network (CNN) 12,13.

Despite successful applications, a downside of deep learning methods is requirement of having thousands or millions of images for proper training. On the other hand, datasets for many medical imaging problems may only have hundreds of images. In addition, these datasets are often unbalanced which results in strong risk of over-fitting the limited data. Thus, applying deep learning method to medical domain may require modifications to the method used outside the domain of radiology. A prominent approach is deep representation learning approach, where, a deep learning architecture is used to extract suitable features instead of end-to-end detection. The learned feature is then used in training for target classification tasks. An example of work along this line is from Tajbakhs et. al.14, which used pre-trained Deep Network for large-scale image classifications as feature extractor for medical image classification. While the approach is useful, many trained networks still require images to be interpolated to the same fixed image as size used to train the network. This has the disadvantage that interpolation changes the noise correlations in the data 14, and thus the utility of the classification is in question. In case of limited imaging data, previous approaches augmented the data by adding random noise and sets of affine transformation to create larger datasets with more variability. However, the augmentation introduce risk of generating invalid data. Another approach is a Triplet Network10,11, that learns discriminative features for later classification. For Triplet Networks, the training is done using combination set of same-class and different-class instances expanded from existing data. In this work, we use a representation learning approach to cope with the limited dataset size in tumor classification problem. Additionally we propose further refinement to deep network for feature extraction to cope with variable size input. We modify the network for feature extraction to cope with images of variable size and orientation.

2. METHODS The workflow is outlined in Figure 1. Images are acquired via Computed Tomography (CT) with or without iodinebased contrast agents. These are volumetric images that cover large parts of body. To narrow down the search region for analysis, sub-volume delineation is performed by expert radiologists as illustrated in Figure 2.

Image acquisition

Image subvolume deliniation

Analytical / Semantic Feature extraction ML-based rule building

predictions on prognosis, response, or correlates

CNN based feature definition Training data

Figure 1. Two approaches for feature extraction or definition. The deep-learning feature derivation path uses a CNN trained to derive features (i.e. metrics) useful for nodule classification. ML = Machine Learning. CNN = convolutional neural network.

Figure 2. Examples of lung-nodule segmentation used in this study. Radiologist defined regions are shown in magenta.

After the sub volumes are obtained, features are typically extracted from each the sub volume using analytical/semantic feature extraction. The feature data are then used to create classification rules. The alternative approach we are testing is to use deep learning methods to learn suitable feature representation for classification, i.e. CNN based feature definition. We follow an approach similar to Hoffer et. al10. and Balntas et. al11 in learning the feature using a triplet network (Figure 3), which allows flexibility in controlling input sizes, number of parameters, and the feature extraction process can be modified to fit nature of input data. The flexible control over architecture makes the approach useful to many types of training datasets when applying the deep learning method to limited and unbalanced data.

Figure 3. Overview of the Triplet Network Approach.

The triplet network architecture consists of an extractor network and a comparator network. The extractor network converts raw image data into a lower-dimensional feature vector. The feature vector then undergoes training for value adjustment in the comparator network to make features more distinguishable between different classes. Figure 4 illustrates typical and proposed deep network architecture that figure tumor classification problem. In our application, the Convolution Neural network (CNN), which use 3D filters in convolutional layers, is used as input data are volumetric tumor image from 3D medical imaging. Using the typical CNN, input must be interpolated into fixed size before undergoes filtering, subsampling, and reduced down to vector output. As noted above, this has the disadvantage that interpolation changes the noise correlations in the data, and thus the utility of the classification may be impacted by the type of interpolation. Image sub-volume Interpolated to 40x40x40

Standard (Std) Convolutional Neural Network Convolutional output

1 1 2 4

Convolutional layer

Variable sized (non-interpolated) sub-volume

5 6 7 9

6 9

3 2 1 0

3 4

1 2 3 4 Subsampling / (Max) Pooling layer

Feature output Fully-Connected Layer

X3

Proposed Multiband (MB) Convolutional Neural Network Convolutional output

1 1 2 4

Multiband Processing

5 6 7 9

6 9

3 2 1 0

3 4

1 2 3 4 Convolutional S u bsampling / (Max)Pooling layer layer

Feature output Fully-Connected Layer

X2

Summarization

Summarization

Figure 4. Illustration of the deep-learning approaches for feature derivation. Top: A standard convolutional neural network (CNN). Bottom: Our proposed Multiband approach for variable sized image sub-volumes. The image sub-volume is the portion of the CT image containing the nodule, not the entire dataset. Note that there are repeated pairs of the convolutional and subsampling/pooling layers as described in the text.

To work with variably-sized lesions, we propose using 'Multiband' processing with 3D convolutional extractors. This modification is motivated by the Inception network15 and the Fully Convolution Neural Network (FCN)16. Neither approach requires the input image to be a fixed size. In our Multiband approach, we assume that smaller images provide a more limited range of frequencies. Thus repeated convolution operations more quickly provide monotonic results than larger images. Input images are sent to different sets (called 'depths') of convolutional layers with increasing number of repeated convolution operations. The images from each depth are sent to 2 separate process. The first process is a summarization that reduces the output images into a single feature value per image channel. A Log-Sum-Exponent function or a differential maximum function17 are used in the summarization process, which can be seen as compromise between taking the maximum and average value from the data in each image channel. The second process, also shown in Figure 4, is a stacking of images from different depths to one multi-channel image that is sent to the next convolutional-layer followed by pooling/subsampling layer. After repeated operations, each remaining image is summarized into one feature value per channel, instead being realigned into a vector as in a typical CNN. At the end of extraction process with Multiband processing blocks and convolutional layers, feature values from all the summarization process are concatenated together into a feature vector and input into a fully-connected layer to create a feature representation of the input image. This summarization approach provides a way to produce a consistent number of candidate feature values regardless of input sub-volume size. The parameters for each step in the architecture are listed in Figure 5.

Figure 5. Parameter details used in in this study.

For the standard approach (top row of Figure 5), the lesion sub-volume image is interpolated to 40x40x40 voxels, and the network consists of 3 repeated pairs of convolutional and subsampling/pooling layers. These are followed by a fullyconnected layer. Each convolutional layer uses 5x5x5 filters before undergoing max-pooling. The network uses 16, 16, and 8 filter channels in the 3 convolutional layer respectively. Finally, the fully-connected layer takes in 1000 input values from previous layers and outputs 42 feature values (similar to the number of texture features). For the Triplet Network training, the same CNN is connected to the comparator network. The Multiband approach (bottom row of Figure 5) applies a series of 5 bands, where the number of sequential filters is the depth or band number, i.e. 5 sequential filters for a depth of 5. As the size of the lesion decreases, the number of bands with unique information decreases, allowing appropriate derivation of feature values from the number of filters for different lesion sizes. Outputs from the bands are sent to the paired convolutional and pooling layer network as described above. The number and size of the filters in each step varies as shown in Figure 5. The values from the Multiband and final pooling layer are also sent to a summarization step. For the summarization step we applied a log-sum-exponent function (i.e. differentiable approximation to the maximum value function) to each channel output. All image channels from the Multiband processing and convolutional layer are summarized into 84 values, which are then used as an input for final fully-connected layer and digested down to 42 values. The values are aligned into a vector sent to the

comparator network. Both of the above architectures were trained with comparator network in triplet network manner 11 using Max-Margin cost function with margin value set to 1000. A summary of the all the different methods implied by Figure 1 is in Table 1. Machine learning algorithms were evaluated using a 10-fold cross-validation to evaluate the diagnostic accuracy of the radiomics features classifying NSCLC cases. The diagnostic accuracy was calculated as (TP+TN)/(P+N) from the 2x2 confusion tables generated by machine learning classification rules. We applied these methods in a retrospective analysis of cases referred to our lung cancer early detection and prevention clinic to evaluate the utility of this approach. Table 1. Methods used to select features used as input to the Machine Learning Algorithms Features Texture Shape Texture + Shape Deep Learning Std Deep Learning MB

Description Standard Haralick texture features as implemented in the PORTS library [8]. Categorized descriptors provided by experienced radiologists. Texture and Shape features combined. Deep learning convolutional neural network (CNN) used to derive features, with 3D images interpolated to a standard size of 40x40x40 voxels. Deep learning CNN used to select features, with a tiered multiband (MB) approach used in the reduction layer to avoid image interpolation.

3. DATA This was a retrospective study with 103 pathology-confirmed non-small cell lung cancer (NSCLC) cases and 91 control cases (Table 2). The diagnostic criteria were as following: 1) NSCLCs (non-small cell lung cancer): a definitive diagnosis of the lesion was confirmed by histopathologic examination of tissue obtained via surgery resection of the lesion or cytopathological examination of biopsy samples, a mediastinal lymph node, or a distant metastasis lesion. 2) Controls (benign): a definitive benign lesion was confirmed by histopathologic examination of tissue obtained via surgery resection of the lesion or cytopathological examination of biopsy samples, or the lesion was found to be stable radiographically at least 2 years of follow up or resolved under the CT surveillance. Table 2. Patient cases and CT imaging contrast method NSCLC Controls Total

with contrast enhancement 52 40 92

without contrast enhancement 51 51 102

Total 103 91 194

The CT scans resulted from clinical acquisitions with non-uniform protocols across patients whom were collected in multi-centers, including standard non-contrast CT (without injection of any intravenous contrast agent) or the intravenous contrast CT of the thorax. There were 92 non-contrast and 102 contrast-enhanced cases. For each patient, only the largest nodule confirmed by histopathologic examination of tissue obtained via surgery resection or cytopathological examination of biopsy samples was chosen for further analysis. Contours defining the nodule region were manually delineated slice-by-slice on the trans axial CT images by a chest radiologist via the commercial software MIM (MIM software Inc., Cleveland, OH). (Figure 2). All the segmented images and the corresponded image masks were read in Matlab 2016a (Mathworks, Natick, USA) and PET Oncology Radiomics Test Suite (PORTS) feature extraction algorithm was also implemented. In total, 41 quantitative radiomics texture features which covered gray-level co-occurrence matrix (GLCM) and gray-level histogram were automatically derived from the segmented regions. In addition, two radiologists who were blinded to clinical and histologic findings evaluated consensus categorical shape features including the following aspects: maximum long-axis and short-axis diameters and the heights, the visually determined shape and density type of the nodule (shape characteristics containing smooth, irregular, scalloped, spiculated, corana radiate; density descriptions containing solid, semi-solid, ground glass, cavitary, pseudo-cavitary). In total, 13 shape features from CT images were extracted for the analysis.

4. RESULTS The results are summarized in Table 3. In the CT contrast-enhanced group, the texture and/or semantic features yielded a maximum overall diagnostic accuracy of roughly 80% (depending on the combination of methods) In classifying NSCLCs vs. controls. In the non-contrast group, this dropped to roughly 73%, again, depending on the combination of methods. We note that simply adding different kind of features does not improve diagnostic accuracy in general. This suggests that there is no perfect feature suitable for all scenarios. Table 3. Diagnostic accuracy (%) of different combinations of feature selection and Machine Learning Algorithms for the two categories of CT images without and with contrast enhancement. Numbers shown in bold are the top performing algorithms in each of the two categories as discussed in the text. Machine Learning Alg

CTr

BM

K-NN

RF

ALR

CT images without contrast enhancement

CTr

BM

K-NN

RF

ALR

Feature Method

nF

CT images with contrast enhancement

Texture

41

54.9%

61.8%

55.9%

57.8%

55.9%

65.9%

69.2%

74.7%

81.3%

73.6%

Shape

15

73.5%

-

65.7%

72.5%

67.6%

70.3%

-

69.2%

78.0%

71.4%

Texture + Shape

55

64.7%

-

55.9%

59.8%

65.7%

68.1%

-

74.7%

80.2%

76.9%

Deep Learning Std 42

60.8%

49.0%

59.8%

57.8%

60.8%

68.1%

72.6%

75.8%

72.6%

68.1%

Deep Learning MB 42

65.7%

66.7%

67.6%

71.6%

70.6%

80.2%

80.2%

72.5%

79.1%

72.5%

Machine Learning algorithm (Alg): CTr = Classification Tree, BM = Bayesian Model, K-NN = K-Nearest Neighbor, RF = Random Forest, ALR = Adaptive Logistic Regression. Other abbreviations: nF = number of features used. The proposed Multiband CNN performed on par with the semantic features in the non-contrast group and with texture features in contrast group, achieving maximum diagnostic accuracies of 71% and 80% respectively. Even though the maximum accuracies are slightly worse than the texture and shape baselines, the results from the modified CNN tend to more consistent across different Machine Learning algorithms and better than that of texture and shape on the average. On the other hand, features from the standard CNN underperformed the classification task compare to the baselines, yielding maximum accuracies of 60% and 75% in the non-contrast and contrast groups respectively. The underperformance may be due to the interpolation to a fixed-size volume and inadequate number of training set for endto-end classification.

5. DISCUSSION The use of a standard deep learning network (i.e. with interpolation) as a feature derivation approach substantially underperformed relative to the diagnostic accuracy of the analytical texture and/or semantic features. However, the proposed Multiband approach to feature derivation produced results similar to the diagnostic accuracy of the texture and/or semantic features. While the Multiband feature selection approach did not outperform the texture and/or semantic features in term of maximum accuracy, its improvement across range of classifiers and equivalent maximum performance with a relatively small amount of data indicates promise for future improvements to increase diagnostic accuracy. Importantly, the Multiband approach adapts readily to 3D objects of different sizes without using interpolation. Consistent performance compare to analytical and expert-defined baseline results also suggest that the method perform well across different conditions. For the indeterminate nodules or masses found in lung cancer screening and pulmonary nodule clinics, the combined use of shape and texture features from contrast-enhanced and non-enhanced CT images have a diagnostic accuracy that warrants further study to evaluate the potential to impact patient management. Further information such as plasma biomarkers may also improve diagnostic accuracy.

6. ACKNOWLEDGMENTS This work was supported in part by NIH Grants U01CA185097 and U01CA148131.

REFERENCES [1] R. L. Siegel, et al., “Cancer statistics, 2016.,” CA: A Cancer Journal for Clinicians, vol. 66, no. 1, pp. 7–30, Jan. (2016). [2] S. Hawkins, et al, “Predicting Malignant Nodules from Screening CT Scans.,” J Thorac Oncol, vol. 11, pp. 2120– 2128, Dec. (2016). [3] B. W. Carter, et al, “Predicting Malignant Nodules from Screening CTs.,” J Thorac Oncol, vol. 11, no. 12, pp. 2045–2047, Dec. (2016). [4] M. B. Schabath, et al. “Differences in Patient Outcomes of Prevalence, Interval, and Screen-Detected Lung Cancers in the CT Arm of the National Lung Screening Trial.,” PLoS ONE, vol. 11, no. 8, p. e0159880, 2016. [5] H. J. W. L. Aerts, et al, “Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach.,” Nat Commun, vol. 5, pp. 4006–4006, Jan. (2014). [6] P. Lambin, et al., “Radiomics: extracting more information from medical images using advanced feature analysis.,” Eur. J. Cancer, vol. 48, no. 4, pp. 441–446, Mar. (2012). [7] D. R. Aberle, et al., “Reduced lung-cancer mortality with low-dose computed tomographic screening.,” N Engl J Med, vol. 365, no. 5, pp. 395–409, Aug. (2011). [8] M. J. Nyflot, F. Yang, D. Byrd, S. R. Bowen, G. A. Sandison, and P. E. Kinahan, “Quantitative radiomics: impact of stochastic effects on textural feature analysis implies the need for standards,” J. Med. Imag, vol. 2, no. 4, pp. 1–13, Oct. (2015). [9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks.” Advances in neural information processing systems, pp. 1097-1105, (2012). [10] E. Hoffer, and N. Ailon, “Deep metric learning using triplet network.” International Workshop on Similarity-Based Pattern Recognition, pp. 84-92. Springer, (2015). [11] V. Balntas, E. Riba, D. Ponsa, and K. Mikolajczyk, “Learning local feature descriptors with triplets and shallow convolutional neural networks.” British Machine Vision Conference, Vol. 1, No. 2, p3. (2016) [12] Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, & M. Chen, “Medical image classification with convolutional neural network.” Control Automation Robotics & Vision (ICARCV), 2014 13th International Conference on (pp. 844-848). IEEE, Dec, (2014). [13] H. R. Roth, L. Lu, A. Seff, K. M. Cherry, J. Hoffman, S. Wang, J. Liu, E. Turkbey, and R. M. Summers, “A new 2.5 D representation for lymph node detection using random sets of deep convolutional neural network observations.” In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 520-527). Springer, Cham, Sept, (2014). [14] N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang, “Convolutional neural networks for medical image analysis: Full training or fine tuning?.” IEEE transactions on medical imaging, 35(5), 1299-1312, (2016). [15] C. Szegedy, S. Loffe, V. Vanhoucke, and A. Alemi. “Inception-V4, Inception-Resnet and the Impact of Residual Connections on Learning.” arXiv:1412.6980 [cs.LG], (2016.) Retrieved January 14, 2010, from the arXiv database. [16] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431-3440, (2015). [17] P. O. Pinheiro, and R. Collobert, “From image-level to pixel-level labeling with convolutional networks.” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1713-1721, (2015).