Brain Tumor Grading Based on Neural Network and ... - IEEE Xplore

Brain Tumor Grading Based on Neural Networks and Convolutional Neural Networks Yuehao Pan, Weimin Huang, Zhiping Lin, Wanzheng Zhu, Jiayin Zhou, Jocelyn Wong, Zhongxiang Ding Abstract—This paper studies brain tumor grading using multiphase MRI images and compares the results with various configurations of deep learning structure and baseline Neural Networks. The MRI images are used directly into the learning machine, with some combination operations between multiphase MRIs. Compared to other researches, which involve additional effort to design and choose feature sets, the approach used in this paper leverages the learning capability of deep learning machine. We present the grading performance on the testing data measured by the sensitivity and specificity. The results show a maximum improvement of 18% on grading performance of Convolutional Neural Networks based on sensitivity and specificity compared to Neural Networks. We also visualize the kernels trained in different layers and display some self-learned features obtained from Convolutional Neural Networks. I. INTRODUCTION The studies on brain tumor detection and segmentation get increasingly popular over the past few years. By the increasing power of computing, Magnetic Resonance Imaging (MRI), which is a major imaging technology for clinical brain tumor detection, can be now used more efficiently. With its unique non-ionizing radiation, details of important body organs, such as brain, can be shown clearly. However, brain tumor detection over MRIs has its limitation. Because the shape and size of brain tumors varies over different patients, predictions on tumor can be very difficult. The results of Multimodal Brain Tumor Image Segmentation Benchmark (BRATS), organized jointly with the MICCAI 2012 and 2013 conferences, show that quantitative assessments uncovered significant contradiction between the human raters in segmenting different tumor sub-locales. And the Dice Scores, which is the measure of results, ranges from 74% to 85% [1]. One of the major difficulties for brain tumor detection and segmentation is feature selection. There are many innovative methods in processing features, which, at the same time, are proved to help increase the accuracy for brain tumor detection and segmentation. Huang et al. [2] achieved an accuracy of 74.75% in tumor segmentation on the mean Overlapped Volume using subspace feature mapping. Yuehao Pan, Zhiping Lin and Wanzheng Zhu are with the School of EEE, Nanyang Technological University, Singapore; email: ({i110004, ezplin, zhuw0006}@ntu.edu.sg). Weimin Huang, Jiayin Zhou are with the Institute for Infocomm Research, 1 Fusionopolis Way, #21-01 Connexis, Singapore 138632 (+65-6408-2516; fax: +65-64082000; email: ({wmhuang, jzhou}@i2r.a-star.edu.sg). Jocelyn Wong is with the Department of Diagnostic Imaging, National University Hospital, Singapore; email: ([email protected]). Zhongxiang Ding is with the Department of Radiology, Zhejiang Provincial People's Hospital, Hangzhou, Zhejiang, China; email: ([email protected]).

978-1-4244-9270-1/15/$31.00 ©2015 IEEE

Different models, like Support Vector Machine (SVM) and Neural Network (NN), are widely used in the previous researches. These models showed a good performance on tumor classification. However, manually selection of features is normally carried out before classification. In the research on brain tumor grading, Soltaninejad et al. [3] classified tumors with different grades based on SVM using features of 38 first-order or second-order statistical measurements. Their results showed over 80% on accuracy over 21 patients applying different grading combinations. The procedure takes the segmented tumor slices as training examples and the features are carefully selected before training. Another research, introduced by Zacharaki et al. [4], showed a better result of 87% in accuracy on two-class neoplasms classification, using SVM-RFE. The other research, conducted by Sudha et al. [5], showed an accuracy of 96.7% using Back Propagation Neural Networks in classifying abnormal tumors from normal ones. The method also involved some calculations of features such as Low GrayLevel Run Emphasis, and the optimal features were carefully selected using fuzzy entropy measure. The current state of the art deep learning models such as Convolutional Neural Network showed a good performance over object classifications. Moreover, deep learning models are generally unsupervised learning model, which the features of object are learned spontaneously from input. Le et al. [6], in the research based on the ImageNet dataset, introduced a high performance object detector using the concepts of Convolutional Neural Networks. The detector obtained an accuracy of 15.8% on ImageNet dataset, which achieved 70% relative improvement over other researches done in the past. The visualization of the kernel from cat face detector showed a rough figure of a cat’s head that simply concisely concludes the face features of a cat. In our experiment, we compared the grading performance on both Back Propagation Neural Network and Convolutional Neural Network. The training and testing data are downloaded from BRATS 2014, which consists of MRI images of 213 patients. A quantitative comparison between the best results obtained from different NN and CNN structure is done based on sensitivity and specificity. The kernels learned from CNN are visualized in this paper to present the features learned by unsupervised learning procedure. II. RELATED WORKS A. Convolutional Neural Networks The Convolutional Neural Networks (CNN) is a deep-structured learning model that has improved a lot in the late twentieth century. It imitates the structure of human brain in perceiving objects. Moreover, researches also show a good performance on CNN-based algorithms, especially in the field of 2D data classification. For example, LeNet-5, a

699

model with CNN structure trained by MNIST database of handwritten digits, has a test error rate less than 1% [7]. For that reason, CNN is widely used in the area of image processing.

Figure 1. A typical 2-layer CNN structure

Figure 1 shows a typical structure of Convolutional Neural Networks. It comprises of one classifier and two layers in which convolution and subsampling are performed. The beauty of Convolutional Neuron Networks is that each kernel in different layers is learned spontaneously so that no setting of features is needed before hand. Hence, CNN is considered as an unsupervised deep-learning model. Because of that, the number of training example becomes critical. The MNIST dataset used in Le-Net 5 comprises of 60,000 training examples and 10,000 testing examples. III. BRAIN TUMOR GRADING USING NN AND CNN A. Input Data Cleaning The MRI data are downloaded from BRATS 2014, which consists of 213 patients’ image samples that can be used for training. Out of all the training data, only 25 of them labeled low, the rest, 188 of these training examples are high-labeled data. For each sample from training set, one ground truth image and four MR images are given. The ground truth images are labeled as zeros and non-zero values where all the zeros means normal pixel and non-zeros means the parts of tumor cells. Among all the training data obtained from BRATS 2014, we examine 100 of high-labeled tumors to learn the clinical criteria for grading. We remove 18 of the examined data because of incomplete or impropriate feature disclosure. After deletion, 195 patient data, consisting 170 high-labeled and 25 low-grade patients’ data, are selected. From all the selected data, the blocks that contain the whole tumor are extracted. Based on the ground truth image, the starting and ending plane tangent to each tumor along x, y and z-axes are found. The obtained six planes define a unique cuboid in which the entire tumor is totally fitted. B. Data Selection In this way, 195 tumor blocks are obtained. However, it is considered relatively insufficient for a deep learning machine. Worse over, a major problem that contains potentiality to fail CNN is the uneven distribution of training class, which about 12.8% of it belongs to low labeled data (25 out of 195). In order to solve imbalance and insufficient problems of data, we discriminately select data between high-labeled (Class one) and low-grade (Class two) tumors. For those with high label, only the image slice in the center of the block, which normally contains the MRI with the largest tumor segment, is selected from the three aspects, which expands

our Class one data to 3*170=510. On the other hand, Class two data is selected from the two slices with 8% deviated from its center. By this manner, we have 6 images on each patient. We have 6*25=150 Class two data. Nevertheless, the training data are still very biased because the Class two data are only about half of Class one. Also considering the tumor features are not obvious at the initial phase, when these tumors are normally graded as low, we come out with a way to enhance the feature: rotation. Since tumor cells proliferate randomly to all directions, we rotate each Class two image to 3 different degrees (90 180 270) to enhance the features. After rotation, we have 3*150=450 examples of low-graded data. The ratio of two different classes is now 510:450, not perfectly even but relatively good enough. All the slices are reconstructed in the same size of 60*60 pixels. Also the intensity value of each image is normalized between 0 and 1. Among all the data we have, 300 out of 450 Class one images are chosen as low-grade training data. Another 300 Class two images are selected as high-labeled training data in order to balance the training set. By this manner, a total of 600 training data and 360 testing data are used in the experiment. C. Accuracy Criteria In order to measure the performance of our model in the best scientific way, sensitivity and specificity classification rule is used in our experiment. In a binary classification test, sensitivity is the ratio of predicted positive data over total true positive data, while specificity is the proportion of predicted negative data on total real negative data. In this experiment, sensitivity is the predicted high-labeled examples over all the high-labeled data in test example. And specificity is the predicted low-grade examples over all low labeled testing data. 𝑆𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =

=

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐻𝑖𝑔ℎ 𝐺𝑟𝑎𝑑𝑒𝑑 𝐷𝑎𝑡𝑎 (1) 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐻𝑖𝑔ℎ 𝐺𝑟𝑎𝑑𝑒𝑑 𝐷𝑎𝑡𝑎

𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =

=

𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒

𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑟𝑢𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 + 𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝐿𝑜𝑤 𝐺𝑟𝑎𝑑𝑒𝑑 𝐷𝑎𝑡𝑎 (2) 𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝐿𝑜𝑤 𝐺𝑎𝑑𝑒𝑑 𝐷𝑎𝑡𝑎

In the above equation, true positives are the high-labeled data that are correctly graded and true negatives are the correctly predicted low-grade tumor slices. Whereas, false positives are the low-graded images that are classified wrongly while false negatives are the high-labeled data that are classified wrongly. IV. EXPERIMENT RESULTS A. Results Obtained from Neural Network Before CNN is adopted in the project, testing on input data of Neural Network is done to see the performance on NN. We selected a 2-layer NN structure and also a 3-layer NN structure for this pre-examining. However, the results do not show a good accuracy on NN with various thresholds.

700

grading criteria called CE- Heterogeneity, which the shape of tumor is enhanced so that different shapes of tumor, like nodules or flower shape, are formed. The last row shows medium sized tumors. Combining features of the second and the third rows, a basic criterion of differentiating severe tumors with banal ones, which is to examine the size of the cancer, matches what we obtain from the kernels.

Figure 2. Sensitivity and specificity value with different threshold using 2-layer (left) and 3-layer (right) NN

As shown in the figure above, after enough times of iterations, 2-layer and 3-layer structured NN gave an intersection point of 0.55 and 0.5677 separately, which is slightly better than the random guess. However, as we compared two graphs, threshold values near the intersecting point have a better overall performance on 3-layered NN, that is, if threshold equals to 59, we get a higher specificity (0.6) with relatively good sensitivity (0.55) on 3-layered NN. This result appears that as the learning structure goes deeper, the grading results may be improved. Because of that, we also tested a 4-layered NN. The intersection value drops down to 0.5167, which performs worse than both of the 2-layered and 3-layered NN. Moreover, the sensitivity and specificity do not outperform for other threshold values in the range.

In the second layer of CNN, the training images are subsampled by a scale of 2, which means a 2*2 pixel patch is examined on each output image after convolution. The average value for that 4 pixels are selected to form the output for this CNN layer. Subsampling reduces the size of output image in order to make the machine to learn features in a higher dimension. For this reason, the kernels of the second layer should more resemble signal filters like Gaussian filter. Some kernels from the second layer of CNN are shown below.

B. Results Obtained from Convolutional Neural Network One of the important results obtained from the CNN is the kernel learned from the unsupervised learning process. It functions as the noise suppressor as well as the feature enhancer of the model. By visualizing the kernels of different layers, the properties and features can be seen. In this project, we adopted different structures of CNN to find the best performance. Some kernels demonstrate interesting features

Figure 4. Visualizations of kernels in the second layer

Figure 5. Evolution of kernel

learned from the algorithm. The following graph illustrates some kernels learned by CNN model from different layers. Figure 3. Visualizations of kernels in the first layer

The visualization of kernels in the first layer shown above indicates several features of the tumor. From the first row of three kernels, a distinct edge of tumor is represented in different directions. These kernels enhanced the edges of tumors so that a distinct edge between tumor cells and normal cells is enhanced. Coincidently, edge detection is actually a commonly used pre-processing procedure used in other classification research based on machine vision. The second row of three kernels displays a small-sized and distributed tumor cells. This kind of kernels matches with one of the

The kernel is initialized randomly in small values between -0.5*10^-5 and 0.5*10^-5 (see the upper-left image in Figure 5). As the training goes on, a distinctive gap forms to separate two segments shown in the kernel, which leads to the final kernel shown in the bottom-right picture in Figure 5. This process is unsupervised, which the kernel is learned by the CNN spontaneously. The CNN results below show a maximum 10 % of improvement on accuracy in one layer structure, but the more complex structure does not suggest better results by any means. The sensitivity and specificity VS threshold graph of different structure are shown below.

701

Figure 6.1. Sensitivity and specificity VS threshold using 25

Figure 6.2. Sensitivity and specificity VS threshold using 16

kernels with size 21*21

kernels with size 21*21

Figure 8. Sensitivity and specificity VS threshold using 3-layered structure, with 16 kernels of size 21*21 in the first layer, 12 kernels of size 11*11 in the second layer and 8 kernels of size 5*5 in the third layer. Blocker

Figure 6.3. Sensitivity and specificity VS threshold using 2-layered CNN structure with 16 kernels with size 21*21 in the first layer and 8 kernels with size 11*11 in the second layer.

Figure 6.4. Sensitivity and specificity VS threshold using 3-layered structure, with 1 kernels of size 21*21 in the first layer, 16 kernels of size 11*11 in the second layer and 12 kernels of size 5*5 in the third layer.

The CNN with only one layer structure shows the best performance over both sensitivity and specificity with intersected value of 0.6667. The rest of the figures show a sensitivity intersect with specificity at value of 0.64, 0.5867 and 0.6 respectively on 1-layered structure with 25 kernels with size 21*21, 1-layered structure with using 41 kernels, 2-layered CNN structure with 16 kernels in the first layer and 8 kernels with size 11*11 in the second layer, and 3-layered structure with 1 kernel of size 21*21 in the first layer, 16 kernels of size 11*11 in the second layer and 12 kernels of size 5*5 in the third layer. V. DISCUSSION ON RESULTS Our experiment results also show that more layers of deep learning structure may not necessarily improve the performance for brain tumor grading. We find that the training results at the final convolutional layer, which is the layer before the classifier, do not show a clear distinction in image for different types of tumors.

One of the limitations of this study is that the training samples, especially the low-grade data, are relatively small. We have total samples of 190 while only 25 of them belong to low-grade tumors. For deep learning machine, training sample size is always the key factor to the learning machine. LeNet-5 uses MNIST database that consists of 60,000 training examples while in the youtube cat experiment, frames from 10 million youtube videos are used as training examples [6]. Also suggested by Simard et al. [8], the most important practice to obtain a good result for learning system is getting a training set as large as possible. Nevertheless, in our study, based on the image data, we show that, with certain structure of the convolutional learning network, we can get much improved results than conventional Neural Network. In our experiments, we increased our data by using multiple slices of the each patient’s 3D volume. Furthermore, we added the rotated slices in the training data, which helped to improve the grading performance. VI. CONCLUSION In this paper, we developed grading processes based on CNN deep learning structure on brain tumor classification. The results show a maximum improvement of 18% on grading performance of CNN based on sensitivity and specificity compared to NN. The visualizations of the kernels and output results at different layers show that the tumor feature can be closely resembled by the learned kernels. However, we also observed that more complex CNN structure might not outperform the results of simple structured CNNs. REFERENCES [1]

[2]

[3] Figure 7. Three training samples obtained at the end of each layer

Figure 7 shows the output of three types of tumor passing through different layers. Although different kernels are applied into both images, the outputs at the final layer return similar results. In the view of classifier, the three tumors are identical to each other, which leads the same grading result. In our experiments, the best result is from a 3-layer structure CNN with certain values of initialization, which improves nearly 18% over the results of the baseline Neural Network.

[4]

[5] [6] [7] [8]

702

B. Menze, A. Jakab, S. Bauer, J. Kalpathy-Cramer, K. Farahani, et al., “The Multimodal Brain Tumor Image Segmentation Benchmark (BRATS),” IEEE Trans-actions on Medical Imaging, Institute of Electrical and Electronics Engineers (IEEE), 2014, pp.33. W. Huang, Y. Yang, et al., "Random Feature Subspace Ensemble Based Extreme Learning Machine for Liver Tumor Detection and Segmentation". International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2014, Aug 29. M. Soltaninejad, et al., "Brain Tumour Grading in Different MRI Protocols using SVM on Statistical Features," Proceedings of the MIUA, 2014, Jul. E.I. Zacharaki, et al., "MRI-Based Classification of Brain Tumor Type and Grade Using SVM-RFE," Biomedical Imaging: From Nano to Macro. ISBI '09. IEEE International Symposium, 2009, Jun 28-Jul 1, pp1035-1038. B. Sudha, P.Gopikannan, et al., "Classification of Brain Tumor Grades using Neural Network," Proceedings of the World Congress on Engineering 2014 Vol I, WCE 2014, London, U.K, July 2 – 4. Q.V. Le, et al., "Building High-level Features Using Large Scale Unsupervised Learning," Cornel University Library, arXiv:1112.6209, 2011, Dec 29. Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, "Gradient-based learning applied to document recognition," Proceedings of the IEEE 86 (11): 1998, Nov, pp2278–2324. P. Y. Simard, D. Steinkraus, J. C. Platt, "Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis". Proceedings of 7th International Conference, Aug 3-6. 2003, pp. 958-963.