Deep Learning for Shot Classification in Gynecologic Surgery Videos

Deep Learning for Shot Classification in Gynecologic Surgery Videos Stefan Petscharnig and Klaus Sch¨offmann Alpen-Adria-Universit¨ at Klagenfurt, 9020 Klagenfurt, Austria, [email protected], [email protected]

Abstract. In the last decade, advances in endoscopic surgery resulted in vast amounts of video data which is used for documentation, analysis, and education purposes. In order to find video scenes relevant for aforementioned purposes, physicians manually search and annotate hours of endoscopic surgery videos. This process is tedious and time–consuming, thus motivating the (semi–)automatic annotation of such surgery videos. In this work, we want to investigate whether the single-frame model for semantic surgery shot classification is feasible and useful in practice. We approach this problem by further training of AlexNet, an already pretrained CNN architecture. Thus, we are able to transfer knowledge gathered from the Imagenet database to the medical use case of shot classification in endoscopic surgery videos. We annotate hours of endoscopic surgery videos for training and testing data. Our results imply that the CNN-based single-frame classification approach is able to provide useful suggestions to medical experts while annotating video scenes. Hence, the annotation process is consequently improved. Future work shall consider the evaluation of more sophisticated classification methods incorporating the temporal video dimension, which is expected to improve on the baseline evaluation done in this work.

Key words: Multimedia Content Analysis, Convolutional Neural Networks, Deep Learning, Medical Shot Classification

1

Introduction

Advances in the field of endoscopic surgery not only enable physicians to perform minimally invasive surgeries, but they are also capable of revisiting video material recorded from every surgery performed. These recorded videos can be used for documentation, analysis, and education. In order to provide an efficient video retrieval system supporting the aforementioned tasks, we need to be able to classify the nature of a shot, which may show a surgery action (e.g., coagulation of the ovary) or diagnostic investigation (e.g., shot of the uterus). As many surgery actions can be bound to the use of certain surgery devices and the organs are discriminative, we consider frame–based shot classification with a convolutional neural network (CNN) sufficient for certain use cases, such

2

Stefan Petscharnig and Klaus Sch¨ offmann

as suggestion of annotations for an expert reviewer in order to simplify the annotation process. For other use cases such as the automatic detection of the start of the surgery, we conjecture that temporal information has to be included in order to provide results which are of practical use. With these considerations in mind, this work addresses the following research goal: Evaluate the predictive performance of shot classification for the use case of gynecologic surgery videos using a single–frame CNN model. In order to cope with this task, we have manually annotated 18 hours of specific video material from gynecological surgeries. We use this material as ground truth for training and testing a deep CNN of an existing architecture with pre–trained initial weights (AlexNet [4]). We evaluate the fine tuned model’s performance for classification of individual frames in gynecologic surgery videos with means of precision and recall for 14 different semantic content classes. Eventually, we deal with a case study of classifying suturing shots, in order to show the practical applicability of our results. This work is novel as to the best of our knowledge there is no work that performs a semantic shot classification within gynecologic surgery videos. Our results indicate that CNN models are of practical worth in the very specific domain of gynecologic laparoscopic surgery videos. Furthermore, we think that the insights gathered from the generated model and the frame based classification evaluation can be used as a base for the application of methods which also incorporate the temporal dimension of the videos, such as histogram–based event detection or Recurrent Neural Networks (RNNs). The remainder of this paper is structured as follows: Section 2 shows other CNN approaches in medical use cases. Section 3 deals with issues relevant for CNN training and validation, this is data and fine tuning methodology of the used CNN, whereas Section 4 has a focus on the presentation of the results. Eventually, Section 5 concludes the paper by summarizing the results and outlining future work.

2

Related Work

For the use case of interstitial lung diseases, [5] provides a simple CNN model containing a single convolutional layer. They yield per–class precision and recall between 0.8 and 0.9 for classification into five classes (normal, emphysema, ground glass, fibrosis, and micronodules) outperforming the SIFT feature as well as Restricted Boltzmann Machines. The authors of [2] propose a deep CNN model containing five convolutional layers for the classification of CT images into seven classes of interstitial lung diseases (healthy, ground glass opacity, micronodules, consolidation, reticulation, and honeycombing). Their results imply that, for this use case, their CNN approach outperforms other CNNs as well as state of the art methods using handcrafted features. In [12], a multi–stage deep learning framework is presented. With this framework, the problem of body-part recognition is solved. In total, they achieve best performance regarding recall,

Shot classification for gynecologic surgery

3

precision and f–score compared against logistic regression, SVMs, and CNNs. The importance of CNNs in medical applications is also apparent from their use within other applications such as nucleus segmentation [11], polyp detection in colonoscopy videos [6], microcalcification detection in digital breast tomosynthesis [8], mitosis detection in breast cancer histology [1], and short–term breast cancer risk prediction [7]. Our work is delimited to the aforementioned research as in contrast to the classification of a state (e.g., healthy or consolidation, type of tissue), we aim at classifying both, diagnostic classes and surgical events, i.e. surgical actions. Furthermore there haven’t been any efforts made regarding the classification of images extracted from laparoscopic surgery videos. The authors of [9] and [10] deal with fine tuning and transfer learning effects of CNNs. These pieces of work are based on the use cases of lymph node detection, interstitial lung disease classification, polyp detection and image quality assessment in colonoscopy, pulmonary embolism detection in computed tomography images, and intima-media boundary segmentation in ultrasonographic images. Their results imply that CNNs are suitable for computer aided diagnosis problems, and transfer learning from large–scale annotated natural image datasets is beneficial for performance. We base the selection and further training of an already existing CNN architecture [4] for our use case on these results.

3

Learning Content Classes

As a base for our ground truth, we use 24 gynecological surgery videos (approximately 26 hours of raw video data) showing myoma resection and endometriosis treatment. For our dataset, we chose classes corresponding with frequent events in gynecologic surgery, e.g. suturing, or diagnosis of extent of a disease. In collaboration with surgeons from LKH Villach, we identified 14 semantic content classes either categorizing an action in the surgery (suction & irrigation, suture, dissection (blunt), cutting, cutting (cold), sling, coagulation, injection) or a certain diagnostic aspect, in particular a clearly visible organ (uterus, ovary, oviduct, liver, colon), as well as blood accumulations. These aforementioned two types of classes (surgery actions and diagnostic shots) are mixed corresponding to a real world scenario, as usually both types occur in a single surgery video. Hence, we think it is not convenient to separate those classes. As the identified content classes describe basic surgical actions and anatomical structures in gynecology, inter-expert disagreement is negligible and non-surgeons are able to identify those classes correctly with instructions from surgeons. Hence, the annotation process itself was executed by ourselves on the shot-level. We manually annotated 842 video shots consisting of 798228 different video frames which were chosen according to the given instructions. Table 1 characterizes the different classes of the annotated dataset in dimensions of number of shots, number of frames, average duration, standard deviation, and semantic description. Please note the high standard deviation of the video durations of the surgical action classes. This indicates that the duration of the single actions heavily depends on the circumstances of the surgery and the indi-

4


ID Class 1 Suction & Irrigation

Shots Frames avg [s] sd [s] Description 190

94260

19.8

28.8 Application of the suction and irrigation tube

2 Suture

25 102152

3 Dissection (blunt)

53

69689

52.5

75.5 Blunt dissection of tissue (e.g by tearing it apart)

4 Cutting

64 111756

69.8

89.2 Thermally dissect tissue (e.g. with mono-polar electrodes)

103 208210

80.8

74.2 Dissect tissue with a sharp instrument (e.g. scissors)

5 Cutting (cold) 6 Sling

163.4 154.6 Process of suturing

8

6634

33.1

24.3 Dissection of large parts of tissue with an electrical sling

176

85932

19.5

17.3 Application of coagulation in order to close a wound

8 Blood

52

12909

9.9

12.2 Noticeable amount of blood visible

9 Uterus

46

24358

21.1

39.3 Clearly visible uterus

10 Ovary

57

31607

22.1

26.3 Clearly visible ovary

11 Oviduct

15

8379

22.3

25.1 Clearly visible oviduct

12 Liver

31

25483

32.8

60.5 Clearly visible liver

13 Colon

12

9832

32.7

43.0 Clearly visible colon

14 Injection

10

7027

28.1

10.2 Injection with a needle

7 Coagulation

Table 1. An overview on the annotated dataset: class id, class name, number of shots, number of frames, average duration in seconds, standard deviation of duration in seconds, and class description.

vidual patient’s anatomy and general assumptions on a basis of video sequence length can not be made. We provide example frames for three of the annotated classes: cutting, suture, and sling. Please note the similarity of the chosen frames between different classes (c.f. Figure 1 and Figure 2) as well as the dissimilarity within a class (c.f. Figure 3 and Figure 4). The classes and their semantics as well as number of shots, average shot duration, standard deviation of the shot duration, and the number of extracted frames are presented in Table 1. In total, the manually annotated ground truth contains 842 shots labeled as the best matching class out of the dataset’s classes. All shots together consist of 798228 frames. We derive the best matching class implicitly by camera positioning and current action, e.g., the action in the center of the image or the organ which is inspected by a surgeon is the action or object of interest. With the surgical task classes, there is the issue that a shot contains frames which could be classified as a diagnostic class as well. For example, suturing the ovary may contain images with the ovary without a surgical needle, or the suture


Fig. 1. Suture

Fig. 2. Sling

Fig. 3. Cutting

Fig. 4. Cutting

5

is not clearly visible. On the one hand, this frame does not look like it belongs to a suturing shot, but on the other hand it indeed does belong to the suturing shot as the image was recorded in its context. For the annotation of our dataset, we chose to stick to the latter case and annotate such frames as the surgical task by defining begin and end of the surgical action. Each frame from beginning until the end of a shot is labeled with the corresponding shot label for the class it belongs to. Due to this circumstance, the dataset also may contain blurry frames or frames in which instruments may cover huge parts of the camera. We argue that these frames are nonetheless part of the corresponding shot and thus correctly labeled. For training the CNN, we used the caffe framework [3] and the CNN architecture of AlexNet [4]. The network architecture contains eight trainable layers: five convolutional and three fully connected layers. We modified the number of neurons on the output layer in order to match our use case, the classification into 14 content classes. Furthermore, the architecture features MAX–pooling, local response normalization (LRN), and dropout. For a detailed description of the CNN architecture, please refer to [4]. We split the dataset in half by choosing half the shots of each class belonging to the training set, the other half to the test set. We do not split 80–20 in favor of a more diverse test set. Please note that the class distribution of the dataset is highly imbalanced: the three classes dissection (blunt), cutting (cold), and cutting combined constitute approximately 50% of the annotated frames (see Table 1. We use oversampling in order to balance the training set by duplicating shots of the under-represented classes until the number of training images per class is approximately balanced. This results

6


in a training set size of 861016 images generated out of 429011 unique frames. The neural network expects image patches of 227×227 pixels. At training time, we re–size single frames on–the–fly to a height of 256 pixels while preserving the aspect ratio. We then perform a central crop to a width of 256 pixels. The result of this operation is a 256×256 image. In each epoch, a random 227×227 patch out of such a resulting image is extracted and fed to the network as training input. In order to cope with this huge amount of data, we extended the Caffe framework with a pre–fetching data layer performing the extraction of video frames and the above described data preparation and augmentation on–the–fly. We do not arbitrarily shuffle the input images, because then video decoding becomes a serious performance bottleneck. As a trade–off, we shuffle the shots at the beginning of each epoch and switch to extracting frames from the next shot at least every 100 frames. The training was performed on a machine featuring an Intel(R) Core(TM) i7-5960X CPU 3.00GHz processor, 64GB of DDR–4 RAM, a Samsung SSD 850 pro and a NVIDIA GeForce GTX TITAN X graphics card. This system took approximately an hour of training time per epoch. We fine–tuned the network with a batch size of 600 images for 100 epochs as the loss function has stabilized at this point. We set the base learning rate of the fully connected layers to 0.1 as the low–level image features are pre–trained already. All other layers used a learning rate of 0.00001.

4

Evaluation

We evaluate the performance of the network by means of precision and recall on a per–class basis. Our test set contains 369217 images extracted from gynecological surgery videos and manually annotated as discussed above. The distribution of the images over the classes within the test set is approximately equal to the distribution in the whole dataset. In particular the classes suction & irrigation, suture, dissection (blunt), cutting, cutting (cold), and coagulation are represented by approximately 11%, 9%, 10%, 15%, 26%, and 10% respectively. The n × n confusion matrix depicted in Table 2 summarizes the single–frame prediction results for our n=14 classes. Columns in the table denote the predicted class referred to by its class ID, while rows indicate the true class. Cell colors illustrate prediction percentages relative to the number of examples for a class. For example, the darkest shade (which is used in row 6 and column 6) indicates that the network predicted for over 80% percent of the test images of class 6 (indicated by the row) their true class 6 (column). For the calculation of precision, recall, and f–value of class i, we determine T Pi (true positive classification of class i) N Pi (number of false positive predictions for class i), and F Ni (number of false negative predictions of class i). To evaluate precision and recall in a class–based manner has the advantage that the imbalance of the classes in the test set is taken into account. Over all classes, we achieve an average precision of 0.422, an average recall of 0.430, and an average f–value of 0.41. In the following we discuss the per–class results and try

Shot classification for gynecologic surgery 1

2

3

4

– –

5

6

7

8

9

4%

5%

–

59%

60%

–

10

11

7

12

13

14

19%

20%

–

39%

79%

80%

–

1 2 3 4 5 6 7 8 9 10 11 12 13 14

40%

Table 2. Confusion matrix for the fine tuned model. Rows denote true class id, columns predicted class id. Colors indicate prediction accuracy relative to the number of test examples per class. For the semantics of the class IDs please refer to Table 1.

to explain the mis–classifications. We also calculate the probability that the true class is among the top three predictions. We refer to this probability as Recall@3, whose class–based average is 0.699. For the per class results for precision, recall, recall@3, and f–value, please refer to Table 3. In total, we correctly classified 179289 out of 368400 images resulting in an overall prediction accuracy of 48.67 percent. For 288200 images the true label was in the top three predictions, resulting in an overall top three prediction accuracy of 78.23%. These accuracy values are higher than the average per–class recall, as the dataset bias is not taken into account. We do not consider specificity, as we consider the use of this measure for the performance evaluation of multi–class problems as problematic. We discuss the results of selected individual classes in the following. Suction & irrigation (1). For the application of the suction an irrigation tool, we have a recall of 0.270, which is clearly below the mean recall of 0.430. The high number of false positive classifications of the (true) classes suturing

8


Class ID

1

2

3

4

5

6

7

Precision

0.441

0.611

0.407

0.732

0.576

0.695

0.295

Recall

0.270

0.672

0.411

0.419

0.661

0.858

0.403

Recall@3

0.720

0.885

0.764

0.639

0.913

0.932

0.877

f–value

0.335

0.64

0.409

0.533

0.616

0.768

0.34

Class ID

8

9

10

11

12

13

14

Precision

0.155

0.447

0.314

0.174

0.634

0.277

0.156

Recall

0.168

0.421

0.509

0.250

0.231

0.625

0.122

Recall@3

0.566

0.664

0.752

0.410

0.382

0.909

0.352

f–value

0.161

0.434

0.389

0.205

0.339

0.384

0.137

Table 3. Performance overview for the fine tuned model. For the semantics of the class IDs please refer to Table 1.

and coagulation as suction & irrigation is explainable due to the fact, that suction and irrigation is done within the same scene in order to clean tissue. The precision of this class is with 0.411 about average. For a given image of suction & irrigation, the network frequently classified it as coagulation, cutting (cold), ovary, suture, blood, and oviduct. We observe that the network recognized the most–frequent contexts and regions of suction and irrigation which in fact, are the ones mentioned above. Suture (2). For frames extracted from suturing shots, the network achieves a precision of 0.611 and a recall of 0.672 resulting in one of the best performances among the classes. We conclude that this is due to the fact that the surgical needle is exclusively used for the procedure of suturing and the suture itself is clearly visible in the image in many examples. The frequent mis–classification of suturing as coagulation my be explained by two circumstances: (1) the tool for coagulation is used to position tissue and suture (which, of course, is a wrong prediction), as well as (2) the fact that it might become necessary to perform coagulation during suturing (which is essentially a correct prediction). The classification of suture as ovary is comprehensible for suturing shots of the ovary (contrary to other diagnostic classes which only feature parts of sutured tissue or multiple candidates for diagnostic classes). Furthermore, an example for a confusion of suturing and sling is given in Figure 1 and Figure 2. This confusion originates in the fact that the suture may look similar to the sling. Another frequent mis–classification is the class suction & irrigation. While suturing, tissue may be positioned and then neither suture nor surgical needle are visible in the center of the image. Furthermore, the tool used for positioning contributes to this classification, as it looks similar to the tool used for suction & irrigation. Dissection (blunt), cutting, and cutting (cold) (3-5). For these three dissection classes combined, the achieved results are above average. The most

Shot classification for gynecologic surgery Class ID

1

2

3

4

5

6

7

S1

0.053

0.756

0.000

0.000

0.034

0.097

0.024

S2

0.000

0.415

0.001

0.000

0.347

0.195

0.002

Class ID

8

9

10

11

12

13

14

S1

0.001

0.005

0.021

0.000

0.000

0.008

0.000

S2

0.006

0.000

0.031

0.001

0.001

0.000

0.000

9

Table 4. Detailed classification results for suturing shots Shot1 and Shot2. Cell values denote the (rounded) percentage of frames classified as the class which is determined by the column.

frequent classifications for examples of these classes are either the other two classes of the coagulation class. These results may originate from tool usage, as the used tools in this four classes are similar. Sling (6). With a recall of 0.858, precision of 0.695, and f–value of 0.768, this class yields the best performance. We infer that this originates in the appearance of the electrical sling used for the thermal separation which is only visible in these shots. Coagulation (7). For the class of coagulation, our net achieved a recall similar to the average recall and a precision slightly less than the average of the net. In most cases where the net mis–classified coagulation examples, the predicted classes were classes with a high probability that either coagulation is part of the scene, or the same tool is used. Diagnostic shots (8-13). In total, the diagnostic shot classes Blood, Uterus, Ovary, Oviduct, Liver, and Colon performed with mean recall 0.67, average precision 0.334, and average f–value 0.319 slightly worse than the average performance. At first glance this result is surprising, as these classes do not need temporal information and seem to be identifiable from a single frame more easily than surgery actions. Deeper analysis of the results shows that these classes frequently were identified as surgical actions cutting (cold), suction & irrigation, suturing, and coagulation. This may originate in the fact, that the network also identified these organs within a surgery action shot, as for example suturing the ovary is very frequent among suturing scenes. Injection (14). With a recall of 0.122, precision 0.156, and f–value of 0.137, this class yields the worst results. In most test cases the injection was either classified as suture or oviduct. We assume that these result can be improved with further training examples. Suture (2)– shot classification. We use the trained model for a case study of a naive majority–based approach to shot classification. We classify each frame of the shot. These frames then vote for their predicted classes. The class with the majority of votes determines the shot’s class. We select two different shots of suturing the ovary, which were neither part of the training nor part of the test set.

10

Stefan Petscharnig and Klaus Sch¨ offmann 14 13 12

Predicted class

11 10 9 8 7 6 5 4 3 2 1

0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Frame index

Fig. 5. Frame–wise classification result overview for suturing shot 1 (S1)

The first shot (S1, 5000 frames) is a suturing shot in which surgical needle and suture are in the center of the image almost all the time. The second shot (S2, 20000 frames) is chosen to be more challenging, as suture and surgical needle are are poorly visible over large parts of the shot. For these two example shots S1 and S2, we achieve a recall of 0.756 and 0.415 respectively. For class–based classification results, please refer to Table 4. Figure 5 and Figure 6 show the detailed classification results for each frame of the shots S1 and S2. For parts of both shots, the net classifies the suturing scene as ovary(10) which is the best ’false’ classification possible as the ovary is sutured in these specific shots. For shot S1, there are two clusters of false classification for the classes sling (6) and suction & irrigation (1) (c.f. Figure 5). In total, such clusters do not influence the majority decision, which clearly favors the classification of suture (0.756). For shot S2, suture (0.415), sling (0.347), and coagulation (0.195) are the most frequent classification suggestions. As shown in Figure 6, the classification oscillates between these three classes during the whole shot. However, there is still a clear majority for suture. Such oscillations may become problematic, when the number of classifications for two or more classes is approximately equal. We declare the detailed investigation of such cases to future work.

5

Conclusion

In this work, we perform a detailed investigation of gynecological surgery shot classification performance of a deep CNN. In particular, we trained AlexNet [4], an existing CNN architecture with pre-trained initial weights for this very special


11

14 13 12

Predicted class

11 10 9 8 7 6 5 4 3 2 1

0

2000

4000

6000

8000

10000

12000

14000

16000

Frame index

Fig. 6. Frame–wise classification result overview for suturing shot 2 (S2)

context of gynecologic surgery videos. We achieve an average precision of 0.42, average recall of 0.43, average f–value of 0.41, and an accuracy of 48.67 percent. The average recall for the top 3 predictions is 69.9%, the accuracy in the top 3 prediction is 78.23%. We want to emphasize the novelty of this work, as this is the very first semantic content classification in the field of gynecologic surgery, a special area of laparoscopic surgery. Out results show that reasonable content classification is possible in this domain. Furthermore, our results are of practical relevance for use cases like suggestion of annotation in order to speed–up the manual annotation process. Future work concerns the investigation whether data augmentation to low density classes instead of simple duplication improves the results significantly and is thus a more suitable technique for imbalanced data sets in the gynecologic domain. Another topic to future work is a deeper analysis of the predefined network, i.e.the variation of filter number and removal of layers as well as filter visualization. Moreover, approaches for event detection based on histograms of class confidences, high level CNN features (general versus domain– specific model), as well as recurrent neural network models. We also aim at the recognition of surgery actions from action–specific motion sequences.

6

Acknowledgments

This work was supported by Universit¨at Klagenfurt and Lakeside Labs GmbH, Klagenfurt, Austria and funding from the European Regional Development Fund and the Carinthian Economic Promotion Fund (KWF) under grant KWF 20214 u. 3520/ 26336/38165.

12


References 1. S. Albarqouni, C. Baur, F. Achilles, V. Belagiannis, S. Demirci, and N. Navab. Aggnet: Deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Transactions on Medical Imaging, 35(5):1313–1321, May 2016. 2. M. Anthimopoulos, S. Christodoulidis, L. Ebner, A. Christe, and S. Mougiakakou. Lung pattern classification for interstitial lung diseases using a deep convolutional neural network. IEEE Transactions on Medical Imaging, 35(5):1207–1216, May 2016. 3. Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional Architecture for Fast Feature Embedding. In Proceedings of the 22Nd ACM International Conference on Multimedia, MM ’14, pages 675–678, New York, NY, USA, 2014. ACM. 4. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1106–1114. 2012. 5. Q. Li, W. Cai, X. Wang, Y. Zhou, D. D. Feng, and M. Chen. Medical image classification with convolutional neural network. In Control Automation Robotics & Vision (ICARCV), 2014 13th International Conference on, pages 844–848. IEEE, 2014. 6. S. Y. Park and D. Sargent. Colonoscopic polyp detection using convolutional neural networks. In SPIE Medical Imaging, pages 978528–978528. International Society for Optics and Photonics, 2016. 7. Y. Qiu, Y. Wang, S. Yan, M. Tan, S. Cheng, H. Liu, and B. Zheng. An initial investigation on developing a new method to predict short-term breast cancer risk based on deep learning technology. In SPIE Medical Imaging, pages 978521– 978521. International Society for Optics and Photonics, 2016. 8. R. K. Samala, H.-P. Chan, L. M. Hadjiiski, K. Cha, and M. A. Helvie. Deeplearning convolution neural network for computer-aided detection of microcalcifications in digital breast tomosynthesis. In SPIE Medical Imaging, pages 97850Y– 97850Y. International Society for Optics and Photonics, 2016. 9. H. C. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers. Deep convolutional neural networks for computer-aided detection: Cnn architectures, dataset characteristics and transfer learning. IEEE Transactions on Medical Imaging, 35(5):1285–1298, May 2016. 10. N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, and J. Liang. Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE Transactions on Medical Imaging, 35(5):1299–1312, May 2016. 11. F. Xing, Y. Xie, and L. Yang. An Automatic Learning-Based Framework for Robust Nucleus Segmentation. IEEE Transactions on Medical Imaging, 35(2):550–566, Feb 2016. 12. Z. Yan, Y. Zhan, Z. Peng, S. Liao, Y. Shinagawa, S. Zhang, D. N. Metaxas, and X. S. Zhou. Multi-instance deep learning: Discover discriminative local anatomies for bodypart recognition. IEEE Transactions on Medical Imaging, 35(5):1332– 1343, May 2016.