diagnostics Article
Computer-Aided Diagnosis Scheme for Determining Histological Classification of Breast Lesions on Ultrasonographic Images Using Convolutional Neural Network Akiyoshi Hizukuri * and Ryohei Nakayama Department of Electronic and Computer Engineering, Ritsumeikan University, Kusatsu, Shiga 525-8577, Japan;
[email protected] * Correspondence:
[email protected]; Tel.: +81-77-561-2706 Received: 28 May 2018; Accepted: 23 July 2018; Published: 25 July 2018
Abstract: It can be difficult for clinicians to accurately discriminate among histological classifications of breast lesions on ultrasonographic images. The purpose of this study was to develop a computer-aided diagnosis (CADx) scheme for determining histological classifications of breast lesions using a convolutional neural network (CNN). Our database consisted of 578 breast ultrasonographic images. It included 287 malignant (217 invasive carcinomas and 70 noninvasive carcinomas) and 291 benign lesions (111 cysts and 180 fibroadenomas). In this study, the CNN constructed from four convolutional layers, three batch-normalization layers, four pooling layers, and two fully connected layers was employed for distinguishing between the four different types of histological classifications for lesions. The classification accuracies for histological classifications with our CNN model were 83.9–87.6%, which were substantially higher than those with our previous method (55.7–79.3%) using hand-crafted features and a classifier. The area under the curve with our CNN model was 0.976, whereas that with our previous method was 0.939 (p = 0.0001). Our CNN model would be useful in differential diagnoses of breast lesions as a diagnostic aid. Keywords: convolutional neural network; histological classification; computer-aided diagnosis; breast lesion; ultrasonographic image
1. Introduction Breast lesions, shown as hypoechoic masses on ultrasonographic images, are important indicators of breast cancer. However, it can be difficult for clinicians to accurately distinguish between benign and malignant lesions—and unusual ones, such as malignant-looking benign lesions or benign-looking malignant lesions, often appear at clinical practice. Some studies have found that a positive predictive value, i.e., the ratio of the number of breast cancers found to the number of biopsies, is rather low [1,2]. To improve the number of positive predictive values found at breast ultrasonographies, many investigators have developed a computer-aided diagnosis (CADx) scheme for distinguishing malignant lesions from benign ones [3–5], and in these studies, high sensitivity and specificity for the CADx schemes were shown. At clinical practice, experienced clinicians evaluate not only the likelihood of malignancy, but also the likelihood of histological classification for determining patient managements. Thus, in our previous study [6] we developed a CADx scheme for determining histological classifications of breast lesions on ultrasonographic images using hand-crafted features and a classifier. Nine objective features were extracted from lesions by considering specific features of images commonly used by clinicians for describing lesions. A multiple-discriminant analysis with the nine Diagnostics 2018, 8, 48; doi:10.3390/diagnostics8030048
www.mdpi.com/journal/diagnostics
Diagnostics2018, 2018,8,8,48x FOR PEER REVIEW Diagnostics
2 2ofof9 9
images commonly used by clinicians for describing lesions. A multiple-discriminant analysis with the ninewas features was toemployed to among distinguish among the four types classifications. of histological features employed distinguish the four different typesdifferent of histological classifications. Wethe also confirmed the clinicians’ classification performancesimproved were substantially We also confirmed clinicians’ classification performances were substantially by using improved by using CADxstudy scheme in an observer study [7]. However, theaccuracy unsatisfactory our CADx scheme in anour observer [7]. However, the unsatisfactory classification of our classification of our scheme made it unfit to apply in clinical practice. CADx scheme accuracy made it unfit to CADx apply in clinical practice. Manystudies studies have reported that convolutional neural networks have achieved Many have reported that convolutional neural networks (CNN) have(CNN) achieved outstanding outstandingin performance in applying segmentation, detection, and of lesions in medical performance applying segmentation, detection, and classification ofclassification lesions in medical images [8–16]. In lesionthe classification, which can extract complex multi-level features from Inimages lesion[8–16]. classification, CNN whichthe canCNN extract complex multi-level features from input images input images due to its self-learning ability significantly improved classification accuracy when due to its self-learning ability significantly improved classification accuracy when compared with compared the use of the conventional it uses a combination of hand-crafted features the use of with the conventional methods, asmethods, it uses aascombination of hand-crafted features and a classifier Therefore, we considered classification accuracies CADxscheme schemefor for aand classifier [16]. [16]. Therefore, we considered thatthat classification accuracies of of thethe CADx histologicalclassifications classificationscould couldbebeimproved improvedby bythe theuse useofofCNN. CNN. histological thisstudy, study,we wedeveloped developedthe theCADx CADxscheme schemefor fordetermining determininghistological histologicalclassifications classificationsofof InInthis breastlesions lesions using aa CNN. was constructed from fourfour convolutional layers, three breast CNN.Our OurCNN CNNmodel model was constructed from convolutional layers, batch-normalization layers,layers, four pooling layers, layers, and twoand fully connected layers. We then evaluated three batch-normalization four pooling two fully connected layers. We then the classification accuracy byaccuracy applyingby ourapplying CNN model 578 breast ultrasonographic evaluated the classification our to CNN model lesions to 578 on breast lesions on images. ultrasonographic images. 2.2.Materials Materialsand andMethods Methods The Theuse useofofthe thefollowing followingdatabase databasewas wasapproved approvedby bythe theinstitutional institutionalreview reviewboard boardatatMie Mie University Informed consent was was obtained from all observers, and the and database was stripped UniversityHospital. Hospital. Informed consent obtained from all observers, the database was ofstripped all patient identifiers. of all patient identifiers. 2.1. 2.1.Materials Materials Our Ourdatabase databaseconsisted consistedofof578 578breast breastultrasonographic ultrasonographicimages. images.The Thelesion lesionsizes sizesininour ourdatabase database were 4–25 mm. These images were obtained from 566 patients using an ultrasound diagnostic were 4–25 mm. These images were obtained from 566 patients using an ultrasound diagnostic system system (APLIO™ XG SSA-790A, Toshiba Medical Systems Otawara, Tochigi Prefecture, Japan) (APLIO™ XG SSA-790A, Toshiba Medical Systems Corp.,Corp., Otawara, Tochigi Prefecture, Japan) with with a 12 MHz linear-array transducer (PLT-1204AT) at Mie University Hospital in 2010. All cases had a 12 MHz linear-array transducer (PLT-1204AT) at Mie University Hospital in 2010. All cases had already alreadybeen beenpathologically pathologicallyproven. proven.After Afterthe thediagnosis diagnosisofofbenign benigncases caseswas wasconfirmed confirmedby byfine-needle fine-needle aspiration, the patients were examined again at 6 to 12 months after the initial diagnosis. aspiration, the patients were examined again at 6 to 12 months after the initial diagnosis.To Toavoid avoid the influence of artifacts in the CNN analysis, cases that had undergone a vacuum-assisted the influence of artifacts in the CNN analysis, cases that had undergone a vacuum-assistedneedle needle biopsy, biopsy,excisional excisionalbiopsy, biopsy,ororreceived receivedmedication medicationwere wereexcluded excludedininthis thisstudy. study.The Thesizes sizesofofthe theimages images were 716 × 537 pixels with 8-bit grayscale. The database included 287 malignant lesions (217 invasive were 716 × 537 pixels with 8-bit grayscale. The database included 287 malignant lesions (217 invasive carcinomas 70 noninvasive carcinomas) and 291 and benign lesions (111 cysts and(111 180 fibroadenomas). carcinomasand and 70 noninvasive carcinomas) 291 benign lesions cysts and 180 The histological classifications for all lesions were for proved by pathologic diagnosis. Regions of interest fibroadenomas). The histological classifications all lesions were proved by pathologic diagnosis. (ROIs) which included a whole lesion were selected from ultrasonographic images by experienced Regions of interest (ROIs) which included a whole lesion were selected from ultrasonographic images clinicians. Thoseclinicians. ROIs were usedROIs for the input offor ourthe CNN model. 1 shows an example by experienced Those were used input of our Figure CNN model. Figure 1 showsof an lesions with four different types of histological classifications. example of lesions with four different types of histological classifications.
Figure Figure1.1.Four Fourlesions lesionswith withdifferent differenttypes typesofofhistological histologicalclassifications. classifications.
Diagnostics 2018, 8, 48
3 of 9
Diagnostics 2018, 8, x FOR PEER REVIEW
3 of 9
2.2. Data Augmentation 2.2. Data Augmentation Although training of the CNN needs a sufficient amount of training data, the number of training Although training of the CNN needs a sufficient amount of training data, the number of training data in this study was limited. Because a small number of training data can sometimes diverge learning data in this study was limited. Because a small number of training data can sometimes diverge or cause overfitting in the CNN, we decided to augment the training data by transforming it, using learning or cause overfitting in the CNN, we decided to augment the training data by transforming a horizontal flip, parallel translation, and a gamma correction to do so [17]. it, using a horizontal flip, parallel translation, and a gamma correction to do so [17]. 2.3. Architecture of Convolutional Neural Networks Model 2.3. Architecture of Convolutional Neural Networks Model Figure 2 shows the architecture of our CNN model which was used in this study. Our CNN Figure 2 shows the architecture of our CNN model which was used in this study. Our CNN model was constructed from four convolutional layers, three batch-normalization layers, four pooling model was constructed from four convolutional layers, three batch-normalization layers, four layers, and two fully connected layers. Each convolutional layer was followed by a rectified linear pooling layers, and two fully connected layers. Each convolutional layer was followed by a rectified unit (ReLU). ROIs with lesions were first resized to 224 × 224 pixels and then given to the input layer linear unit (ReLU). ROIs with lesions were first resized to 224 × 224 pixels and then given to the input of our CNN model. The first convolutional layer generated 64 feature maps with 112 × 112 pixels, layer of our CNN model. The first convolutional layer generated 64 feature maps with 112 × 112 using 64 filters with 7 × 7 kernels at stride 2. Once the generated feature maps passed through the pixels, using 64 filters with 7 × 7 kernels at stride 2. Once the generated feature maps passed through normalization layer of the first batch, it then passed the first max pooling layer with a window size the normalization layer of the first batch, it then passed the first max pooling layer with a window of 3 × 3 at stride 2. The second convolutional layer used 192 filters with 5 × 5 kernels at stride 2, size of 3 × 3 at stride 2. The second convolutional layer used 192 filters with 5 × 5 kernels at stride 2, and generated 192 feature maps with 55 × 55 pixels. The generated feature maps also passed through and generated 192 feature maps with 55 × 55 pixels. The generated feature maps also passed through the normalization layer of the second batch, then the second max pooling layer with a window the normalization layer of the second batch, then the second max pooling layer with a window size size of 3 × 3 at stride 2. Subsequently, the third and fourth convolutional layers with 256 filters of of 3 × 3 at stride 2. Subsequently, the third and fourth convolutional layers with 256 filters of 3 × 3 3 × 3 kernels at stride 2 generated 256 feature maps, with 27 × 27 pixels and 13 × 13 pixels. The batch kernels at stride 2 generated 256 feature maps, with 27 × 27 pixels and 13 × 13 pixels. The batch normalization layer was applied only after the third convolutional layer. The max pooling layers normalization layer was applied only after the third convolutional layer. The max pooling layers with with window sizes of 3 × 3 at stride 2 were employed after both the third and fourth convolutional window sizes of 3 × 3 at stride 2 were employed after both the third and fourth convolutional layers. layers. The generated feature maps from the fourth convolutional layer was merged at two fully The generated feature maps from the fourth convolutional layer was merged at two fully connected connected layers. output layer the softmax outputted the likelihoods the layers. Finally, theFinally, outputthe layer using theusing softmax functionfunction outputted the likelihoods of the of four four histological classifications (invasive carcinoma, noninvasive carcinoma, fibroadenoma, and cyst). histological classifications (invasive carcinoma, noninvasive carcinoma, fibroadenoma, and cyst).
Soft-max
Fully connected 2
Fully connected 1
Max pooling
Conv. + ReLU
Max pooling Filter size: 3×3 Stride: 2
Max pooling
Max pooling Filter size: 3×3 Stride: 2
Batch Normalization
Conv. Filter size: 5×5 Num. of filters: 192 Stride: 1
Conv. + ReLU
Max pooling
Batch Normalization
Conv. + ReLU
Max pooling
Batch Normalization
Conv. + ReLU
Input
Conv. Filter size: 7×7 Num. of filters: 64 Stride: 2
Output: Likelihood of each histological classification
Conv. Conv. Fully connected1: 4096 Filter size: 3×3 Filter size: 3×3 Fully connected2: 4096 Num. of filters: 256 Num. of filters: 256 Soft-max: 4 Stride: 1 Stride: 1 Max pooling Filter size: 3×3 Max pooling Filter size: 3×3 Stride: 2 Stride: 2
Figure Figure 2. 2. Architecture Architecture of of our our convolutional convolutionalneural neuralnetworks networks(CNN) (CNN)model. model.
2.4. Training of Convolutional Convolutional Neural 2.4. Training of Neural Networks Networks Model Model Our Our CNN CNN model model was was developed developed based based on on the the open-source open-sourcelibrary libraryKeras Keras[18] [18]on onWindows Windows77 Professional (Intel Core i7-6700k processor with RAM 32 GB) and accelerated by a graphic processing Professional (Intel Core i7-6700k processor with RAM 32 GB) and accelerated by a graphic processing unit (NVIDIA GeForce 1070 with 8 GB of memory). unit (NVIDIA GeForce 1070 with 8 GB of memory). A k-fold cross validation method [19] with k = 3 was used for the training and testing of our A k-fold cross validation method [19] with k = 3 was used for the training and testing of CNN model. In the validation method, the 566 patients were randomly divided into three groups so our CNN model. In the validation method, the 566 patients were randomly divided into three that the number of each histological classification was approximately equal in each group (73 patients groups so that the number of each histological classification was approximately equal in each for invasive carcinoma, 24 patients for noninvasive carcinoma, 60 patients for fibroadenoma, and 37 group (73 patients for invasive carcinoma, 24 patients for noninvasive carcinoma, 60 patients for patients for cyst). One group was used as a test dataset. To assess the possibility of an overfitting of fibroadenoma, and 37 patients for cyst). One group was used as a test dataset. To assess the possibility parameters in our CNN model, the remaining two groups were divided into a training dataset and validation dataset of a 90%:10% ratio. This process was repeated three times until every group had been used as test dataset. In this study, the number of ROIs for each histological classification in each
Diagnostics 2018, 8, 48
4 of 9
of an overfitting of parameters in our CNN model, the remaining two groups were divided into a training dataset and validation dataset of a 90%:10% ratio. This process was repeated three times until every group had been used as test dataset. In this study, the number of ROIs for each histological classification in each training dataset was unified to about 2000 by using data augmentation. Table 1 shows the number of training images before and after augmentation in each dataset. Table 1. Number of training images before and after augmentation in each dataset.
Pathological Diagnosis Invasive carcinoma Noninvasive carcinoma Fibroadenoma Cyst
k=1
k=2
k=3
Before After
Before After
Before After
144 48 120 74
2017 2160 2160 2220
145 46 120 74
2030 2070 2160 2220
145 46 120 74
2030 2070 2160 2220
To select appropriate hyper-parameters, twelve different combinations of hyper-parameters such as the learning rate, mini-batch size, epoch number, and dropout rate were used in our CNN model. We employed the combination of hyper-parameters with the highest classification accuracy in the 12 combinations. 2.5. Evaluation of Classification Performance The classification accuracy of our CNN model was evaluated by using the ensemble average from the testing datasets over the 3-fold cross validation method. The sensitivity [20], specificity [20], positive predictive value (PPV) [20], and negative predictive value (NPV) [20] were defined as: Sensitivity =
TP TP + FN
(1)
Specificity =
TN TN + FP
(2)
PPV =
TP TP + FP
(3)
NPV =
TN TN + FN
(4)
Here, TP (true positive) was the number of malignant lesions (invasive and noninvasive carcinomas) correctly identified as positive, whereas TN (true negative) was the number of benign lesions (cysts and fibroadenomas) correctly identified as negative. FP (false positive) was the number of benign lesions incorrectly identified as positive, and FN (false negative) was the number of malignant lesions incorrectly identified as negative. It is noted that the denominators in sensitivity and PPV were coincidentally the same (TP + FN = TP + FP), and the denominators for specificity and NPV (TN + FP = TN + FN) were also the same. Receiver operating characteristic (ROC) analysis [21] was used for analysis of classification performance. In the ROC analysis, the likelihood of malignancy for each lesion was determined by adding the output values regarding probabilities for invasive and noninvasive carcinomas in a computerized method. We also calculated the area under the curve (AUC) value. The statistical significance of the difference in the AUC value between two computerized methods was tested by using the Dorfman–Berbaum–Metz method [22].
Diagnostics 2018, 8, 48
5 of 9
3. Results The classification accuracy was highest when the mini-batch size was 128 ROIs, the learning rate was 10−3 , the max epoch number was 15, and the drop rate was 0.5. Therefore, these hyper-parameters were used in our CNN model. Table 2 shows the classification results of the four histological classifications using our CNN model. The classification accuracy of our CNN model was 87.6% (190/217) for invasive carcinomas, 85.7% (60/70) for noninvasive carcinomas, 83.9% (151/180) for fibroadenomas, and 85.6% (95/111) Diagnostics 2018, 8, x FOR PEER REVIEW 5 of 9 for cysts, respectively. The sensitivity, specificity, positive predictive values (PPV), and negative predictive values (NPV) were 93.0% (267/287), 93.1% (271/291), 93.0% (267/287), and 93.1% (271/291), predictive values (NPV) were 93.0% (267/287), 93.1% (271/291), 93.0% (267/287), and 93.1% (271/291), respectively. The classification accuracies for histological classifications with our CNN model were respectively. The classification accuracies for histological classifications with our CNN model were substantially higher than those with our previous method (71.2% for invasive carcinomas, 55.7% for substantially higher than those with our previous method (71.2% for invasive carcinomas, 55.7% for noninvasive carcinomas, 70.0% for fibroadenomas, and 79.3% for cysts) using hand-crafted features and noninvasive carcinomas, 70.0% for fibroadenomas, and 79.3% for cysts) using hand-crafted features a classifier [6]. This result indicates the usefulness of CNN in determining histological classifications. and a classifier [6]. This result indicates the usefulness of CNN in determining histological Figure 3 shows the comparison between the ROC curves of our CNN model and our previous method. classifications. Figure 3 shows the comparison between the ROC curves of our CNN model and our The AUC value of our CNN model was 0.976 (standard error = 0.0052), showing to be greater than previous method. The AUC value of our CNN model was 0.976 (standard error = 0.0052), showing to that of our previous method (AUC = 0.939, standard error = 0.0094, p = 0.0001). be greater than that of our previous method (AUC = 0.939, standard error = 0.0094, p = 0.0001). Table 2. Classification results of four histological classifications. Table 2. Classification results of four histological classifications. Outputof ofOur OurCNN CNNModel Model Output Noninvasive Carcinoma Cyst Noninvasive Carcinoma Fibroadenoma Fibroadenoma Cyst 10 (4.6%) 13 (6.0%) 4 (1.8%) 10 (4.6%) 13 (6.0%) 4 (1.8%) 60 (85.7%) 2 (2.9%) 1 (1.4%) 60 (85.7%) 2 (2.9%) 1 (1.4%) 6 6(3.3%) 151 (5.6%) (3.3%) 151(83.9%) (83.9%) 1010(5.6%) 0 0(0.0%) 15 (85.6%) (0.0%) 15(13.5%) (13.5%) 9595(85.6%)
Pathological Diagnosis Pathological Diagnosis
Invasive Carcinoma Invasive Carcinoma Invasive carcinoma (217) 190 (87.6%) Invasive carcinoma (217) 190 (87.6%) Noninvasive carcinoma (70) 7 (10.0%) Noninvasive carcinoma (70) 7 (10.0%) Fibroadenoma (180)(180) 13 13 (7.2%) Fibroadenoma (7.2%) CystCyst (111)(111) 1 (0.9%) 1 (0.9%) 1.0
Our CNN model (AUC = 0.976)
0.8 True Positive Fraction
Our previous method (AUC = 0.939)
0.6
0.4
0.2
0.0 0.0
0.2
0.4
0.6
0.8
1.0
False Positive Fraction
Figure3. 3.Comparison Comparisonof ofthe thereceiver receiveroperating operatingcharacteristic characteristic(ROC) (ROC)curves curvesbetween betweenour ourCNN CNNmodel model Figure andour ourprevious previous method. method. and
4.Discussion Discussion 4. Figure 44 shows shows the the training training curves curves of of our our CNN CNN model model for for each each dataset. dataset. The Theloss losscurves curvesbetween between Figure the training training dataset and the validation phases. Because in the validationdataset datasettended tendedtotosimilar similarininthe thethree threetraining training phases. Because thethe case of overfitting the the loss loss in the dataset wouldwould be much forthat the training in case of overfitting invalidation the validation dataset be larger much than largerthat than for the dataset [23], we [23], believe the that possibility that there was overfitting in our CNN model low.was training dataset we that believe the possibility that there was overfitting in our CNNwas model low.
0.8
loss(train)
0.6
loss(val)
1.2 1.0 0.8
loss(train)
0.6
loss(val)
k=3 100 90 80 70 60 50 40 30
1.6 1.4 1.2
Loss
1.0
1.4
Loss
Loss
1.2
ccuracy (%)
1.4
1.6
1.0 0.8
loss(train)
0.6
loss(val)
100 90 80 70 60 50 40 30
ccuracy (%)
k=2 100 90 80 70 60 50 40 30
ccuracy (%)
k=1 1.6
Figure 4 shows the training curves of our CNN model for each dataset. The loss curves between the training dataset and the validation dataset tended to similar in the three training phases. Because in the case of overfitting the loss in the validation dataset would be much larger than that for the training dataset [23], we believe that the possibility that there was overfitting in our CNN model was Diagnostics 2018, 8, 48 6 of 9 low.
0.8
loss(train)
0.6
loss(val)
0.4
accuracy(val)
0.2 0.0
1.2 1.0 0.8
loss(train)
0.6
loss(val)
0.4
accuracy(val)
0.2 0.0
1.6 1.4 1.2
Loss
1.0
Loss
Loss
1.2
1.4
Accuracy (%)
1.4
k=3 100 90 80 70 60 50 40 30 20 10 0
1.0 0.8
loss(train)
0.6
loss(val)
0.4 accuracy(val)
0.2 0.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Epoch
Epoch
Epoch
Figure 4. Training Training curves of our CNN model in each dataset. Diagnostics 2018, 8, x FOR PEER REVIEW
100 90 80 70 60 50 40 30 20 10 0
Accuracy (%)
k=2 1.6
100 90 80 70 60 50 40 30 20 10 0
Accuracy (%)
k=1 1.6
6 of 9
To To clarify clarify the the usefulness usefulness of of our our CNN CNN architecture, architecture, we we also also evaluated evaluated the the classification classification performances for an AlexNet without pre-training [24] and an AlexNet with pre-training [23]. The performances for an AlexNet without pre-training [24] and an AlexNet with pre-training [23]. AlexNet with pre-training had been trained with ImageNet. The hyper-parameters of both AlexNets The AlexNet with pre-training had been trained with ImageNet. The hyper-parameters of both were givenwere based on abased previous [23].study Both [23]. AlexNets of the same and had AlexNets given on astudy previous Both consisted AlexNets consisted of architecture the same architecture 5and convolutions, 2 normalizations, and 3 max and pooling layers. Eachlayers. convolution layer was followed by had 5 convolutions, 2 normalizations, 3 max pooling Each convolution layer was an ReLU [24]. In the last layers, all units were fully connected to output probabilities for 4 classes followed by an ReLU [24]. In the last layers, all units were fully connected to output probabilities for using theusing softmax function. The AlexNet without pre-training waswas trained with 4 classes the softmax function. The AlexNet without pre-training trained withour ourdatabase, database, whereas whereas the the AlexNet AlexNet with with pre-training pre-training was was fine-tuned fine-tuned for for our our database. database. The The classification classification accuracy accuracy of was 76.5% 76.5% (166/217) (166/217) for (45/70) for of the the AlexNet AlexNet without without pre-training pre-training was for invasive invasive carcinomas, carcinomas, 64.3% 64.3% (45/70) for noninvasive carcinomas, 76.7% (138/180) for fibroadenomas, and 76.6% (85/111) for cysts. On the noninvasive carcinomas, 76.7% (138/180) for fibroadenomas, and 76.6% (85/111) for cysts. On the other 82.9% (180/217) (180/217) for other hand, hand, the the classification classification accuracy accuracy of of the the AlexNet AlexNet with with pre-training pre-training was was 82.9% for invasive carcinomas, 64.3% (45/70) for noninvasive carcinomas, 76.7% (138/180) for fibroadenomas, invasive carcinomas, 64.3% (45/70) for noninvasive carcinomas, 76.7% (138/180) for fibroadenomas, and and 86.5% 86.5% (96/111) (96/111)for forcysts. cysts.The Theclassification classificationperformance performanceofofthe theAlexNet AlexNetwith with pre-training pre-training was was higher than that of the AlexNet without pre-training. Because those results weresubstantially improved higher than that of the AlexNet without pre-training. Because those results were improved substantially using our CNN we architecture, weour believe our CNN architecture was more by using our by CNN architecture, believe that CNN that architecture was more appropriate for appropriate forbetween distinguishing between types the four different classifications types of histological on distinguishing the four different of histological on breastclassifications lesions. The ROC breast lesions. The ROC analysis was also our employed to compare model with the AlexNet analysis was also employed to compare CNN model with our the CNN AlexNet without pre-training without andpre-training. the AlexNet with pre-training. Figure 5 shows the comparison of the and thepre-training AlexNet with Figure 5 shows the comparison of the ROC curves forROC our curves for our CNN model, the AlexNet without pre-training, and the AlexNet with pre-training. CNN model, the AlexNet without pre-training, and the AlexNet with pre-training. The AUC for The our CNN model was 0.976 (standard error = 0.0052), showing be greater than our AUC CNNfor model was 0.976 (standard error = 0.0052), showing to be greatertothan both that forboth the that for the AlexNet without pre-training (AUC = 0.925, standard error = 0.011, p < 0.0001) and the AlexNet without pre-training (AUC = 0.925, standard error = 0.011, p < 0.0001) and the AlexNet with AlexNet with(AUC pre-training = 0.958, = 0.007, p = 0.0019). pre-training = 0.958,(AUC standard errorstandard = 0.007, perror = 0.0019). 1.0 Our CNN model (AUC = 0.976)
True Positive Fraction
0.8
AlexNet with pre-training (AUC = 0.958) AlexNet without pre-training (AUC = 0.925)
0.6
0.4
0.2
0.0 0.0
0.2
0.4
0.6
0.8
1.0
False Positive Fraction
Figure 5.5. Comparison Comparison ROC curves CNN model, AlexNet without pre-training, Figure of of thethe ROC curves for for our our CNN model, AlexNet without pre-training, and and AlexNet with pre-training. AlexNet with pre-training.
To investigate the adequacy of the number of convolutional layers used for our CNN model, we compared the classification performances when the number of convolutional layers was changed from 3 to 5. Figure 6 shows the architecture of two different types of CNN models with three or five convolutional layers. Table 3 also shows the classification results when the number of convolutional
Diagnostics 2018, 8, 48
7 of 9
To investigate the adequacy of the number of convolutional layers used for our CNN model, we compared the classification performances when the number of convolutional layers was changed from 3 to 5. Figure 6 shows the architecture of two different types of CNN models with three or five convolutional layers. Table 3 also shows the classification results when the number of convolutional layers was set to 3, 4, and 5. Our proposed CNN model with four convolutional layers showed the highest classification accuracy among three different CNN models. Although the number of convolutional layers tended to increase in the classification problem, we believed that four convolutional layers was appropriate in determining the histological classification. To evaluate the validity of the number of filters in each convolutional layer, we compared the classification performances when the number of filters was changed from 0.5 to 1.5 times in our CNN model. Table the classification results when the number of filters in each convolutional layer Diagnostics 2018, 4 8, shows x FOR PEER REVIEW 7 of 9 was set to 0.5 times, 1 time, and 1.5 times. The classification accuracy of the CNN model with the combination of the number of filters given in our CADx scheme was higher than those with the other combinations of filters. However, there was no significant change in the classification accuracies due to the combination of the number of filters.
Soft-max
Fully connected 2
Fully connected 1
Max pooling
Conv. + ReLU
Max pooling
Conv. + ReLU
Max pooling Filter size: 3×3 Stride: 2
Batch Normalization
Max pooling
Batch Normalization
Conv. + ReLU
Input
Conv. Filter size: 7×7 Num. of filters: 64 Stride: 2
Output: Likelihood of each histological classification
Fully connected1: 4096 Conv. Conv. Filter size: 5×5 Filter size: 3×3 Fully connected2: 4096 Num. of filters: 192 Num. Stride: 1 of filters: 256 Soft-max: 4 Stride: 1 Max pooling Max pooling Filter size: 3×3 Filter size: 3×3 Stride: 2 Stride: 2
Soft-max
Fully connected 2
Fully connected 1
Max pooling Filter size: 3×3 Stride: 2
Max pooling
Max pooling Filter size: 3×3 Stride: 2
Conv. + ReLU
Max pooling Filter size: 3×3 Stride: 2
Max pooling
Conv. Filter size: 3×3 Num. of filters: 256 Stride: 1
Conv. + ReLU
Conv. Filter size: 5×5 Num. of filters: 192 Stride: 1
Batch Normalization
Max pooling
Batch Normalization
Conv. + ReLU
Max pooling
Conv. + ReLU
Batch Normalization
Max pooling
Batch Normalization
Conv. + ReLU
Input
Conv. Filter size: 7×7 Num. of filters: 64 Stride: 2
Output: Likelihood of each histological classification
Conv. Conv. Fully connected1: 4096 Filter size: 3×3 Filter size: 3×3 Fully connected2: 4096 Num. of filters: 256 Num. Stride: 1 of filters: 256 Soft-max: 4 Stride: 1 Max pooling Max pooling Filter size: 3×3 Filter size: 3×3 Stride: 2 Stride: 2
Figure 6. Architecture of two CNN models with three convolutional layers or five convolutional layers. Figure 6. Architecture of two CNN models with three convolutional layers or five convolutional layers.
This study has some limitations. One limitation is the that hyper-parameters, such as the number of layers, the 3.number of filters, learning mini-batch size, epoch number, rate Table Comparison of results when rate, the number of convolutional layers was setand to 3,drop 4, and 5. in our CNN model, may not have been the best combination for determining histological classifications Num. of Convolutional Layers in CNN of breast lesions. Because Pathological Diagnosisthe number of combinations of hyper-parameters in the CNN is infinite, 3 4 5 we evaluated the classification accuracy by using some different combinations of hyper-parameters in Invasive carcinoma (217) 189 (87.1%) 190 (87.6%) 195 (89.9%) our Noninvasive CNN model. The results could be improved60 by(85.7%) using more optimal carcinoma (70) in this study 58 (82.9%) 51combinations (72.9%) of hyper-parameters. We considered that a shallow CNN architecture would achieve a higher Fibroadenoma (180) 143 (79.4%) 151 (83.9%) 151 (83.9%) classificationCyst accuracy than a deep CNN architecture, because the amount of clinical data available for (111) 96 (86.5%) 95 (85.6%) 89 (80.2%) training the Total model in this study was limited. To clarify the usefulness of the shallow CNN architecture, (578) 486 (84.1%) 496 (85.8%) 486 (84.1%) we compared our CNN with an AlexNet, which is a deep CNN architecture. Although the classification Table 4. Comparison of architecture results when was the number filters in each convolutional layerCNN was set to 0.5 performance for our CNN greaterofthan that for AlexNet, a deeper architecture 1 time, and 1.5 atimes. mighttimes, be able to achieve higher classification accuracy than our proposed one when a larger database is available for training the model in the future. A third is that size of the ROI was Change Ratiolimitation of the Number of the Filters Pathological Diagnosis Invasive carcinoma (217) Noninvasive carcinoma (70) Fibroadenoma (180) Cyst (111)
in Each Convolutional Layer of Our CNN Model 0.5 Times 1.0 Time 1.5 Times 190 (87.6%) 190 (87.6%) 196 (90.3%) 52 (74.3%) 60 (85.7%) 50 (71.4%) 146 (81.1%) 151 (83.9%) 150 (83.3%) 102 (91.9%) 95 (85.6%) 97 (87.4%)
Diagnostics 2018, 8, 48
8 of 9
dependent on the size of the breast mass. Because its size and location were given subjectively by experienced clinicians, we can expect there to be some variation among the size and the location of such ROIs, and the classification performance of our CNN model might vary as the ROI is altered. A final limitation is that our CNN model evaluates the likelihood of only four histological classifications, meaning that we would need to make our CNN model applicable to a wider variety of histological classifications in the future. Table 3. Comparison of results when the number of convolutional layers was set to 3, 4, and 5. Num. of Convolutional Layers in CNN
Pathological Diagnosis Invasive carcinoma (217) Noninvasive carcinoma (70) Fibroadenoma (180) Cyst (111) Total (578)
3
4
5
189 (87.1%) 58 (82.9%) 143 (79.4%) 96 (86.5%) 486 (84.1%)
190 (87.6%) 60 (85.7%) 151 (83.9%) 95 (85.6%) 496 (85.8%)
195 (89.9%) 51 (72.9%) 151 (83.9%) 89 (80.2%) 486 (84.1%)
Table 4. Comparison of results when the number of filters in each convolutional layer was set to 0.5 times, 1 time, and 1.5 times.
Pathological Diagnosis Invasive carcinoma (217) Noninvasive carcinoma (70) Fibroadenoma (180) Cyst (111) Total (578)
Change Ratio of the Number of Filters in Each Convolutional Layer of Our CNN Model 0.5 Times
1.0 Time
1.5 Times
190 (87.6%) 52 (74.3%) 146 (81.1%) 102 (91.9%) 490 (84.8%)
190 (87.6%) 60 (85.7%) 151 (83.9%) 95 (85.6%) 496 (85.8%)
196 (90.3%) 50 (71.4%) 150 (83.3%) 97 (87.4%) 493 (85.3%)
5. Conclusions In this study, we developed the CADx scheme for determining histological classifications of breast lesions on ultrasonographic images using CNN. Our CNN model was shown to have a high classification accuracy for histological classification, and may be useful in differential diagnoses of breast lesions on ultrasonographic images when used as a diagnostic aid. Author Contributions: A.H. contributed in the design of this study, development, and evaluation of our CNN model, and wrote this manuscript; R.N. contributed to data collection, evaluation of our CNN model, and revised this manuscript. All authors interpreted the analysis results and drafted the manuscript. Funding: This study received no external funding. Acknowledgments: We are grateful to Tomoko Ogawa, and Yumi Kashikura, at Mie University Hospital, for their help and valuable suggestions. Conflicts of Interest: The authors declare no conflict of interest.
References 1. 2.
3.
Brem, R.F.; Lenihan, M.J.; Lieberman, J.; Torrente, J. Screening breast ultrasound: Past, present, and future. Am. J. Roentgenol. 2015, 204, 234–240. [CrossRef] [PubMed] Berg, W.A.; Blume, J.D.; Cormack, J.B.; Mendelson, E.B.; Lehrer, D.; Böhm-Vélez, M.; Mahoney, M.C. Combined screening with ultrasound and mammography vs. mammography alone in women at elevated risk of breast cancer. JAMA 2008, 299, 2151–2163. [CrossRef] [PubMed] Chen, D.R.; Huang, Y.L.; Lin, S.H. Computer-aided diagnosis with textural features for breast lesions in sonograms. Comput. Med. Imaging Graph. 2011, 35, 220–226. [CrossRef] [PubMed]
Diagnostics 2018, 8, 48
4.
5. 6.
7.
8. 9. 10.
11.
12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24.
9 of 9
Joo, S.; Yang, Y.S.; Moon, W.K.; Kim, H.C. Computer-aided diagnosis of solid breast nodules: Use of an artificial neural network based on multiple sonographic features. IEEE Trans. Med. Imaging 2004, 23, 1292–1300. [CrossRef] [PubMed] Shin, X.; Cheng, H.D.; Hu, L.; Ju, W.; Tian, J. Detection and classification of masses in breast ultrasound images. Digit. Signal Process. 2010, 20, 824–836. Hizukuri, A.; Nakayama, R.; Kashikura, Y.; Takase, H.; Kawanaka, H.; Ogawa, T.; Tsuruoka, S. Computerized determination scheme for histological classification of breast mass using objective features corresponding to clinicians’ subjective impressions on ultrasonographic images. J. Digit. Imaging 2013, 26, 958–970. [CrossRef] [PubMed] Kashikura, Y.; Nakayama, R.; Hizukuri, A.; Noro, A.; Nohara, Y.; Nakamura, T.; Ito, M.; Kimura, H.; Yamashita, M.; Hanamura, N.; et al. Improved differential diagnosis of breast masses on ultrasonographic images with a computer-aided diagnosis scheme for determining histological classifications. Acad. Radiol. 2013, 20, 471–477. [CrossRef] [PubMed] LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [CrossRef] [PubMed] Greenspan, H.; Van Ginneken, B.; Summer, R.M. Guest editorial deep learning in medical imaging: Overview and future promise of an exciting new technique. IEEE Trans. Med. Imaging 2016, 35, 1153–1159. [CrossRef] Litjens, G.; Kooi, T.; Bejnordi, B.E.; Setio, A.A.A.; Ciompi, F.; Ghafoorian, M.; van der Laak, J.A.; Van Ginneken, B.; Sánchez, C.I. A survey on deep learning in medical image analysis. Med. Image Anal. 2017, 42, 60–88. [CrossRef] [PubMed] Dalmı¸s, M.U.; Litjens, G.; Holland, K.; Setio, A.; Mann, R.; Karssemeijer, N.; Gubern-Mérida, A. Using deep learning to segment breast and fibroglandular tissue in MRI volumes. Med. Phys. 2017, 44, 533–546. [CrossRef] [PubMed] Mohamed, A.A.; Berg, W.A.; Peng, H.; Luo, Y.; Jankowitz, R.C.; Wu, S. A deep learning method for classifying mammographic breast density categories. Med. Phys. 2018, 45, 314–321. [CrossRef] [PubMed] Suzuki, K. Overview of deep learning in medical imaging. Radiol. Phys. Technol. 2017, 10, 257–273. [CrossRef] [PubMed] Tajbakhsh, N.; Suzuki, K. Comparing two classes of end-to-end machine-learning models in lung nodule detection and classification: MTANNs vs. CNNs. Pattern Recognit. 2017, 63, 476–486. [CrossRef] Shi, Z.; Hao, H.; Zhao, M.; Feng, Y.; He, L.; Wang, Y.; Suzuki, K. A deep CNN based transfer learning method for false positive reduction. Multimed. Tools Appl. 2018, 1–17. [CrossRef] Ginneken, B.V. Fifty years of computer analysis in chest imaging: Rule-based, machine learning, deep learning. Radiol. Phys. Technol. 2017, 10, 23–32. [CrossRef] [PubMed] Gonzales, R.C.; Woods, R.E. Digital Image Processing, 2nd ed.; Addison-Wesley: Boston, MA, USA, 1992; pp. 567–643. Chollet, F. Deep Learning with Python; Manning Pubns Co.: New York, NY, USA, 2017. Wernick, M.N.; Yang, Y.; Brankov, J.G.; Yourganov, G.; Strother, S.C. Machine learning in medical imaging. IEEE Signal Process. Mag. 2010, 27, 25–38. [CrossRef] [PubMed] Langlotz, C.P. Fundamental measures of diagnostic examination performance: Usefulness for clinical decision making and research. Radiology 2003, 228, 3–9. [CrossRef] [PubMed] Metz, C.E. ROC methodology in radiologic imaging. Investig. Radiol. 1986, 21, 720–733. [CrossRef] Dorfman, D.D.; Berbaum, K.S.; Metz, C.E. ROC rating analysis: Generalization to the population of readers and cases with jackknife method. Investig. Radiol. 1992, 27, 723–731. [CrossRef] Lakhani, P.; Sundaram, B. Deep Learning at chest radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology 2017, 284, 574–582. [CrossRef] [PubMed] Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the 25th International Conference on Neural Information Processing Systems, Lake Tahoe, NV, USA, 3–6 December 2012; pp. 1097–1105. © 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).