Mass detection in mammograms using pre-trained deep learning models

0 downloads 0 Views 219KB Size Report
The resnet-50 structure has 4 residual layers each comprising of ... The deep learning framework used is Keras-2 with Tensorflow as backend. 3. RESULTS. 3.1.
Mass detection in mammograms using pre-trained deep learning models Richa Agarwal1( ), Oliver Diaz1, Xavier Lladó1, Robert Martí1 1

VICOROB Research Institute, Universitat de Girona, Girona, Spain {richa.agarwal, oliver.diaz, xavier.llado, robert.marti}@udg.edu

ABSTRACT. Mammography is a gold standard imaging modality and is widely used for breast cancer screening. With recent advances in the field of deep learning, the use of deep convolution neural networks (CNNs) in medical image analysis has become very encouraging. The aim of this study is to exploit CNNs for mass detection in mammograms using pre-trained networks. We use the resnet-50 CNN architecture pre-trained with the ImageNet database to perform mass detection on two publicly available image datasets: CBIS-DDSM and INbreast. We demonstrate that the CNN model pretrained using natural image database (ImageNet) can be effectively finetuned to yield better results, compared to randomly initialized models. Further, the benefit of applying transfer learning on a smaller dataset is demonstrated by using the best model obtained from CBIS-DDSM training to finetune on the INbreast database. We analyzed the adaptability of the CNN’s last fully connected (FC) layer and the all convolutional layers to detect masses. The results showed a testing accuracy of 0.92 and an area under the receiver operating characteristic curve (AUC) of 0.98 for the model finetuned on all convolutional layers, while testing accuracy of 0.86 and AUC=0.93 when the model is trained only on the last FC layer. Keywords: Mammogram, deep learning, breast cancer, mass detection, resnet-50, transfer learning

1. INTRODUCTION Breast cancer is the most common cancer in women. As shown in the World Health Organization’s cancer report1, for all cancer diagnosed in women worldwide, 25.1% were developed in the breast and it was observed that 14.7% of the cases led to death. In the USA, approximately 12.4% of women are expected to be diagnosed with breast cancer during their lifetime2. Therefore, there is a strong social interest to keep investigating this disease. Furthermore, breast cancer survival rate is dependent on the stage at which the cancer is diagnosed, thus the sooner it is detected, the better. X-ray mammography is the gold standard imaging modality used for breast cancer screening. This technique represents a rapid, relatively cheap and effective imaging modality for screening a large population. However, it has some limitations detecting lesions for women with dense breasts, which reduce the sensitivity resulting in a higher number of missed cancers3. Recently, computer-aided detection or diagnosis (CAD) systems are gaining importance in the field due to its high-performance accuracy in diagnosis tasks. CAD systems can analyze a mammogram and enhance suspicious areas which can be relevant to the radiologists. LeCun et. al.4 defines deep learning as learning methods with multiple levels of representation. They are obtained by composing simple but non-linear models that transform the representation from one level (starting with the raw input) into a representation at a higher level. In recent years, deep learning has gained a lot of interest in various fields including object detection5-7, image recognition8-11, natural language processing12,13 and speech recognition14,15. Generally, the training process for deep convolutional neural networks (CNNs) requires large number of annotated samples to avoid overfitting to the training dataset. This issue is frequently addressed by researchers by using transfer learning, i.e. retrain (finetune) publicly available models pre-trained with large databases, in combination with smaller datasets16. The availability of a large publicly available dataset has restricted the exploitation of deep learning in medical imaging. In spite of these issues, some researchers in the medical field exploited these approaches, especially for breast screening mammography17-21. Traditionally, CAD systems for mammograms relied on hand-engineered features, but with the advancement of deep learning methods and high computational Graphics Processing Units (GPUs), deep learning-based CAD systems can automatically learn which features are more relevant for an accurate diagnosis. In this line, a large-scale

challenge was organized recently to develop algorithms for breast cancer diagnosis using a large database of digital mammograms, which attracted attention from many researchers in the breast cancer community. The purpose of this work is to exploit deep learning frameworks for the mass detection problem in mammography and to compare the model’s performance when using transfer learning from digitized screen film to full-field digital mammograms (FFDM). We focus on investigating the effect of weight initialization on resnet-5022 CNN by testing on different patch configurations generated on two publicly available datasets: CBIS-DDSM23 and INbreast24. The aim of this work is two-fold: (1) to show that finetuning on resnet-50 pre-trained on the ImageNet dataset can be used to perform mass detection in a mammogram dataset. (2) to show that the finetuned model from digitized domain could also be used to detect masses on the digital image dataset and later finetuned to perform better detection. The paper is organized as follows: In section 2, the methodology for extracting input patches is presented followed by a brief description of the CNN network used and its training strategy. In section 3, we discuss the two datasets used, followed by experiments done for evaluating the methodology. The paper then finishes with the discussion and conclusions.

2. METHODOLOGY In this section, we describe the methodology used for generating input patches, and the strategy to train the CNN network. We first convert all DICOM mammograms into PNG image format and resize them from original resolution to a fixed image size of 1152x896 pixels using bilinear interpolation, to reduce the computational time. We then prepare a patchlevel dataset containing patches of size 224 x 224 pixels as the standard input used for resnet-50 CNN. The approach used for training is similar to that presented by Li Shen25, where the author evaluated the performance of different networks for breast cancer diagnosis by firstly training a patch classifier and then using it for whole mammogram classification. In this work, we concentrate on evaluating the effect of different weight initialization strategies on resnet-50 CNN on CBISDDSM dataset and then finetuning the trained CNN on the smaller INbreast dataset. Further, we use a different patch extraction strategy and modified the network to get two-class outputs, i.e. positive (containing mass) and negative (normal). 2.1 Input patch extraction The CBIS-DDSM23 database is composed of scanned screen film images, a segmentation step using Otsu’s method26 is used to differentiate between breast and non-breast tissue areas. The INbreast images are fully digital, so no such preprocessing is required. A sliding window approach is then followed to extract all the possible patches from the image, where the total number is controlled by the stride (𝑠 × 𝑠) which also defines the minimum overlap between the two patches. All the patches are classified as positive (patch containing mass) if the central pixel of the patch is inside an annotated lesion area, otherwise it is labelled as negative (patch without mass). 2.2. CNN network: Resnet-50 The resnet-5022 CNN architecture consists of convolutional layers, pooling layers and multiple residual layers, each containing several bottleneck blocks (a stack of three convolutional layers). The resnet-50 structure has 4 residual layers each comprising of 3, 4, 6 and 3 bottleneck blocks from bottom to top. The input to the resnet-50 is an image, i.e. patch, of size 224x224x3 pixels, where the three dimension represents red, green and blue color channels in the pre-trained model. Since the patch from a mammogram contains only one channel (gray level), the three-color components contain the same information. 2.3. CNN training and testing To optimize the weights of the CNN, the training data is split into training and validation sets. The training set is used to train the CNN and update its weights accordingly, while the validation set is used to measure the performance of the trained CNN after each epoch. An epoch describes the number of times the CNN sees the entire data set. For training, all the positive patches are generated from an image along with the equal number of negative patches (randomly selected) from the same image. Keras27 is used as the deep learning framework in this work. We used Adam28 as optimizer and the batch size is set to 128. The early stopping based on validation loss is employed here and is set to 10 epochs. 

DREAM Challenge, Accessed on: 26/12/2017, https://www.synapse.org/#!Synapse:syn4224222/wiki/401743

For the random weight initialization, the CNN is trained for maximum 100 epochs with learning rate of 10−3 , while for ImageNet weight initialization the CNN is trained for maximum 50 epochs with a learning rate of 10−4 . For fine tuning the trained model from CBIS-DDSM dataset on INbreast dataset, a lower learning rate is used and set to 10−5 . This is done in-order to update the weights in smaller steps, as the model has already been trained on similar images. Data augmentation, which is a common technique used in deep learning, is used to generate more samples from already existing data. In this work, the data samples are augmented on-the-fly using horizontal flipping, rotation of up to 30 ◦, and an additional rescaling by a factor chosen between 0.75 and 1.25 randomly for all the training samples including positives and negatives. Once the CNN model is trained, it is used to detect masses in the testing images (from different patients). For each image, input patches are generated using the sliding window approach described in section 2.1 2.4. Computational environment The computations are performed on a Linux workstation with 12 CPU cores and a single NVIDIA Titan X Pascal GPU with 12GB memory. The deep learning framework used is Keras-2 with Tensorflow as backend.

3. RESULTS 3.1. Datasets 3.1.1. CBIS-DDSM The Digital Database for Screening Mammography (DDSM) 29 database contains digitized images from scanned films compressed with lossless JPEG encoding. In this work, we have used a more recent version of the database CBIS-DDSM24 containing a subset of the images from DDSM in the standard DICOM format. The database was downloaded on October 10, 2017 from CBIS-DDSM website and it contains 3,061 mammograms of 1,597 patients. It includes the cranio caudal (CC) and medio lateral oblique (MLO) views for most of the screened breasts. The CBIS-DDSM database contains the pixel-level annotation for the region of interests (ROIs) and their pathology confirmed labeling: benign or malignant. There is a total of 1,698 confirmed masses in 1592 images from 891 patients. The total numbers of images in the training and testing sets are: 1231 (77%) and 361 (23%) respectively. Here the testing set is also used as validation set. 3.1.2. INbreast INbreast has FFDM images in the standard DICOM format acquired from 115 patients with CC and MLO views of breast, from which 90 patients have images for each view of the two breasts, while remaining 25 patients have two views of only one breast; leading to a total of 410 images. There is a total of 116 masses in 107 images from 50 patients. Further, for testing the performance of the network all the images (without mass) from the dataset are included in the testing set. The total numbers of images in the training, validation and testing sets are: 73, 12 and 325 respectively. In this work, the testing set is composed of 7% images with mass and 93% images without any mass are used to evaluate the trained network and to analyze the false positive detections by the network. 3.2. Evaluation metric The evaluation metric used in this work is the testing accuracy of the model and the receiver operating characteristic (ROC) curve score. The ROC curve is plotted as the function of true positive rate (TPR) vs the false positive rate (FPR). 𝑇𝑃𝑅 =

𝑇𝑃 𝐹𝑃 , 𝐹𝑃𝑅 = , 𝑇𝑃 + 𝐹𝑁 𝐹𝑃 + 𝑇𝑁

where TP is the number of patches correctly identified as mass (true positives), FN defines patches incorrectly identified as negatives (false negatives), FP is the patches incorrectly identified as mass (false positives) and TN as the negative patches correctly identified as negatives (true negatives). The area under the ROC curve (AUC) defines how well the CNN can distinguish between a cancer and a normal patch. Bear in mind that during testing, the number of negative patches is much larger than positive patches (40:1), the accuracy and AUC measures can be biased to the majority class, i.e. negatives. Therefore, to evaluate the performance of network for mass detection, the confusion matrix (at prediction threshold of 0.5) is also shown, including the number of TP, FP, FN and TN. 

CBIS-DDSM, Accessed on 10/10/2017, https://wiki.cancerimagingarchive.net/display/Public/CBIS-DDSM

3.3. Experiment details In all experiments, the input patches for training the CNN are generated using a stride (𝑠) of 28 × 28 pixels. The value of stride was selected to obtain a tradeoff between the computational requirement and sufficiently large number of training samples. This resulted in a total of 60,000 patches for CBIS-DDSM and 6,000 patches for INbreast dataset. 3.3.1. Experiment #1 The training of CNN is performed on CBIS-DDSM dataset using the pre-trained weights obtained from the ImageNet database and is compared against the randomly initialized network. Table 1 shows the accuracy and AUC obtained. Table 1: Experiment #1. Results for CBIS-DDSM using different weight initialization for resnet-50.

Dataset

Weights

CBIS-DDSM

random ImageNet

Testing accuracy on CBIS-DDSM 0.75 0.84

AUC 0.88 0.92

3.3.2. Experiment #2 Since the INbreast and CBIS-DDSM are mammographic datasets, the feature space of the CNN for one might also be relevant to the other dataset. So, we finetuned the resnet-50 CNN on INbreast dataset using the weights from CBIS-DDSM (ImageNet) training. First, we test the INbreast dataset directly with the CBIS-DDSM trained model without any finetuning. Later, we finetune the trained model with INbreast dataset in two steps: (1) training only the last fully connected (FC) layer and freezing all convolutional layers and (2) training all convolutional layers along with the FC layers. The results are presented in Table 2, and the ROC curve is depicted in Figure 1. Table 2. Experiment #2. Results for transfer learning from CBIS-DDSM to INbreast.

Training CBIS-DDSM FC layer All layer

Testing accuracy on INbreast 0.81 0.86 0.92

AUC 0.81 0.93 0.98

Total positive #231 TP FN 149 186 214

82 45 17

Total neg #10,663 TN FP 8,719 9,233 9,820

1,944 1,430 843

Figure 1. ROC curve analysis on test set of INbreast dataset with three different trained models.

4. DISCUSSION AND CONCLUSIONS In this work, we presented a study to investigate the effects of different weight initialization on resnet-50 CNN. The results of training the resnet-50 CNN in experiment #1 (Table 1) demonstrated that the network can be easily adapted to different image features when trained using ImageNet weights (accuracy=0.84, AUC=0.92). In Table 2, it is showed that the initial testing on the INbreast dataset with the CNN trained on CBIS-DDSM, had larger number of FPs (1,944) and FNs (82). Further, updating the CNN weights by only fine-tuning the CNN’s last FC layer improves the network’s performance resulting in lower FPs (1,430) and FNs (45). In addition, when updating the entire CNN weights by fine-tuning all the convolutional layers with the INbreast dataset, the model’s performance is further improved in terms of the number of FPs (843) and FNs (17). In order to fully analyze the proposed CNN models for mass detection on image, future work will include a further evaluation on full images (FROC curve), k-fold cross validation and variance analysis. Furthermore, the methodology presented in this work will be used for mass classification and detection on other large mammography databases. In addition, the analyses of the performance of pre-trained models on digital breast tomosynthesis (3D mammography) database will be considered.

5. ACKNOWLEDGEMENTS This work is partially supported by SMARTER project funded by Ministry of Economy and Competitiveness of Spain, under project reference DPI2015-68442-R. R.A. is funded by the support of the Secretariat of Universities and Research, Ministry of Economy and Knowledge, Government of Catalonia Ref. ECO/1794/2015 FIDGR-2016. The authors gratefully acknowledge the support of the NVIDIA Corporation with their donation of the Titan X PASCAL GPUs used in this research.

6. REFERENCES [1] Stewart, B. and Wild, C. , "World Cancer report 2014," World Health Organisation (2014). [2] Ries, L. A., Harkins, D., Krapcho, M., et al , "SEER cancer statistics review, 1975-2003," (2006). [3] Birdwell, R. L., Ikeda, D. M., O’Shaughnessy, K. F., et al , "Mammographic characteristics of 115 missed cancers later detected with screening mammography and the potential utility of computer-aided detection," Radiology 219(1), 192-202 (2001). [4] LeCun, Y., Bengio, Y. and Hinton, G. , "Deep learning," Nature 521(7553), 436-444 (2015). [5] Szegedy, C., Toshev, A. and Erhan, D., "Deep neural networks for object detection," Advances in neural information processing systems, 2553-2561 (2013). [6] Girshick, R., Donahue, J., Darrell, T., et al, "Rich feature hierarchies for accurate object detection and semantic segmentation," Proceedings of the IEEE conference on computer vision and pattern recognition, 580-587 (2014). [7] Ren, S., He, K., Girshick, R., et al, "Faster R-CNN: Towards real-time object detection with region proposal networks," Advances in neural information processing systems, 91-99 (2015). [8] Krizhevsky, A., Sutskever, I. and Hinton, G. E., "Imagenet classification with deep convolutional neural networks," Advances in neural information processing systems, 1097-1105 (2012). [9] Szegedy, C., Liu, W., Jia, Y., et al, "Going deeper with convolutions," Proceedings of the IEEE conference on computer vision and pattern recognition, 1-9 (2015). [10] Farabet, C., Couprie, C., Najman, L., et al , "Learning hierarchical features for scene labeling," IEEE Trans. Pattern Anal. Mach. Intell. 35(8), 1915-1929 (2013). [11] Tompson, J. J., Jain, A., LeCun, Y., et al, "Joint training of a convolutional network and a graphical model for human pose estimation," Advances in neural information processing systems, 1799-1807 (2014). [12] Collobert, R., Weston, J., Bottou, L., et al , "Natural language processing (almost) from scratch," Journal of Machine Learning Research 12(Aug), 2493-2537 (2011). [13] Bordes, A., Chopra, S. and Weston, J. , "Question answering with subgraph embeddings," arXiv preprint arXiv:1406.3676 (2014). [14] Mikolov, T., Deoras, A., Povey, D., et al, "Strategies for training large scale neural network language models," Automatic Speech Recognition and Understanding (ASRU), 2011 IEEE Workshop on, 196-201 (2011). [15] Hinton, G., Deng, L., Yu, D., et al , "Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups," IEEE Signal Process.Mag. 29(6), 82-97 (2012).

[16] Yosinski, J., Clune, J., Bengio, Y., et al, "How transferable are features in deep neural networks?" Advances in neural information processing systems, 3320-3328 (2014). [17] Dhungel, N., Carneiro, G. and Bradley, A. P., "Automated mass detection in mammograms using cascaded deep learning and random forests," Digital Image Computing: Techniques and Applications (DICTA), 2015 International Conference on, 1-8 (2015). [18] Becker, A., Marcon, M., Ghafoor, S., et al , "Deep Learning in Mammography: Diagnostic Accuracy of a Multipurpose Image Analysis Software in the Detection of Breast Cancer," Invest. Radiol. 52(7), 434-440 (2017). [19] Lotter, W., Sorensen, G. and Cox, D. , "A Multi-scale CNN and Curriculum Learning Strategy for Mammogram Classification,", 169-177 (2017). [20] Dhungel, N., Carneiro, G. and Bradley, A. P., "Fully automated classification of mammograms using deep residual neural networks," Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on, 310-314 (2017). [21] Kooi, T., Litjens, G., van Ginneken, B., et al , "Large scale deep learning for computer aided detection of mammographic lesions," Med.Image Anal. 35, 303-312 (2017). [22] He, K., Zhang, X., Ren, S., et al, "Deep residual learning for image recognition," Proceedings of the IEEE conference on computer vision and pattern recognition, 770-778 (2016). [23] Lee, R. S., Gimenez, F., Hoogi, A., et al , "Curated breast imaging subset of DDSM," The Cancer Imaging Archive (2016). [24] Moreira, I. C., Amaral, I., Domingues, I., et al , "Inbreast: toward a full-field digital mammographic database," Acad. Radiol. 19(2), 236-248 (2012). [25] Shen, L. , "End-to-end Training for Whole Image Breast Cancer Diagnosis using An All Convolutional Design," arXiv preprint arXiv:1708.09427 (2017). [26] Otsu, N. , "A threshold selection method from gray-level histograms," IEEE Trans.Syst.Man Cybern. 9(1), 62-66 (1979). [27] Chollet, F. , "Keras," KerasCN Documentation, 55 (2017). [28] Kingma, D. P. and Ba, J. L., "ADAM: A Method for stocastic optimization," Proceedings of the 3rd International Conference on Learning Representations (ICLR). [29] Heath, M., Bowyer, K., Kopans, D., et al, "The digital database for screening mammography," Proceedings of the 5th international workshop on digital mammography, 212-218 (2000).

Suggest Documents