Real-Time Image-based Smoke Detection in

3 downloads 0 Views 4MB Size Report
ing an RGB image to the 256x256 pixel CNN input format consumes most of the time. However, conducting the same process on a one channel image certainly ...
Real-Time Image-based Smoke Detection in Endoscopic Videos Andreas Leibetseder, Manfred Jürgen Primus, Stefan Petscharnig, Klaus Schoeffmann Institute of Information Technology Alpen-Adria University, 9020 Klagenfurt Austria {aleibets|mprimus|spetsch|ks}@itec.aau.at

ABSTRACT The nature of endoscopy as a type of minimally invasive surgery (MIS) requires surgeons to perform complex operations by merely inspecting a live camera feed. Inherently, a successful intervention depends upon ensuring proper working conditions, such as skillful camera handling, adequate lighting and removal of confounding factors, such as fluids or smoke. The latter is an undesirable byproduct of cauterizing tissue and not only constitutes a health hazard for the medical staff as well as the treated patients, it can also considerably obstruct the operating physician’s field of view. Therefore, as a standard procedure the gaseous matter is evacuated by using specialized smoke suction systems that typically are activated manually whenever considered appropriate. We argue that image-based smoke detection can be employed to undertake such a decision, while as well being a useful indicator for relevant scenes in postprocedure analyses. This work represents a continued effort to previously conducted studies utilizing pre-trained convolutional neural networks (CNNs) and threshold-based saturation analysis [25]. Specifically, we explore further methodologies for comparison and provide as well as evaluate a public dataset comprising over 100K smoke/non-smoke images extracted from the Cholec80 dataset [43], which is composed of 80 different cholecystectomy procedures. Having applied deep learning to merely 20K images of a custom dataset, we achieve Receiver Operating Characteristic (ROC) curves enclosing areas of over 0.98 for custom datasets and over 0.77 for the public dataset. Surprisingly, a fixed threshold for saturation-based histogram analysis still yields areas of over 0.78 and 0.75.

CCS CONCEPTS • Applied computing → Health informatics; • Computing methodologies → Neural networks; Image processing;

KEYWORDS smoke detection; endoscopic surgery; deep learning; image processing; CNN classification ACM Reference format: Andreas Leibetseder, Manfred Jürgen Primus, Stefan Petscharnig, Klaus Schoeffmann. 2017. Real-Time Image-based Smoke Detection in Endoscopic Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. ThematicWorkshops’17, October 23–27, 2017, Mountain View, CA, USA © 2017 ACM. ISBN 978-1-4503-5416-5/17/10. . . $15.00 DOI: https://doi.org/10.1145/3126686.3126690

Videos. In Proceedings of ThematicWorkshops’17, October 23–27, 2017, Mountain View, CA, USA, , 9 pages. DOI: https://doi.org/10.1145/3126686.3126690

1

INTRODUCTION

Surgery often constitutes a last resort for treating patients – with good reason, as inducing physical trauma to a body can be risky and in unfortunate scenarios even entail serious deleterious consequences. Regardless, this kind of practice is an essential component in human health care and, whenever its application is deemed appropriate, it can greatly benefit from constantly improving technological advances: nowadays many medical interventions are carried out via minimally invasive surgery (MIS) employing high-tech devices in order to avoid inflicting more injuries than necessary upon patients, which connotes a major improvement over traditional open surgery. Endoscopy is the most well-known form of MIS and it involves inserting a small camera – the endoscope – together with procedure-dependent instruments through natural openings or small artificially created incisions into the patient’s body. In this way, trained physicians are able to apply internal treatments by observing their actions on an external monitor displaying the transmitted video signal. Such an approach can be applied to a diverse amount of medical procedures and is typically identified by regarding the insertion locality of the endoscope: rhinoscopy, for example, is conducted via the nostrils, arthroscopy utilizes incisions above the joint cavities and in laparoscopy the abdominal wall is pierced through. Aside from partially requiring different special-purpose equipment, all endoscopic procedures share at least one common component: the camera device, which not only acts as the "eye of the surgeon" but is as well perfectly suited for recording whole surgeries, simply for archiving purposes or in order to enable post-procedural examination: assessing prior treatment footage can in effect greatly assist the medical staff in creating patient case documentations, educating apprentices and planning further treatments. In fact, modern hospitals already capture carried out endoscopic surgeries by default or even are required to by law, which in particular creates manifold opportunities in the area multimedia processing: on one hand the live video signal can be analyzed for real-time surgery support and on the other hand computer-aided techniques applied to existing media archives could considerably alleviate case revisitations. Aiming at contributing to both of these application areas within the scope of this work, we specifically target image classification of a common scenario during endoscopic surgeries: the emergence of smoke. Smoke is an unwanted byproduct of surgical electro- or laser cauterization, i.e. the deliberate burning of tissue in order to prevent or stop a patient from bleeding during surgeries. The resulting

ThematicWorkshops’17, , October 23–27, 2017, Mountain View, CA, USA

A. Leibetseder et al.

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

(j)

(k)

(l)

(m)

(n)

(o)

(p)

(q)

(r)

Figure 1: Sequential sample frames for surgical smoke development in various intensities: low (1a - 1f), medium (1g - 1l), high (1m - 1r). gaseous matter is composed of water together with chemical, biological as well as physical particles, all of which are considered to be hazardous for patients and operating medical staff, as has been highlighted in various scientific articles [4, 7, 11, 17, 28, 33, 40]. Additionally, extensive development of smoke can seriously obstruct a physician’s field of view, which due to the dependence on a single camera feed already is somewhat limited in endoscopic interventions. Figure 1 portrays such situations showing three different scenes outlining a low, medium and high development of smoke. Considering all of above mentioned undesirable effects, it becomes apparent that surgical smoke should be removed urgently and safely, the latter of which surprisingly still at times is being disregarded: not uncommonly practitioners simply release the hazardous fumes into the operating room via the stopcock of an endoscopic port [4]. The more considerate standard way of evacuating smoke is utilizing special medical-purpose suction systems, which in current practice typically are activated manually by the medical staff. Unfortunately a simple evacuation strategy creates yet another problem in endoscopy: pressure loss. Most systems, therefore, are built for being able to compensate this situation by inducing surgical gas1 in order to prevent bodily cavities from collapsing. Evidently, manually activating these systems early can be considered wasteful regarding resources, while late activation causes visual impairments to the operating surgeons. As a consequence, automatic evacuation would certainly serve as a solution for this dilemma, yet, current attempts follow the the rather naive approach of linking the activation of cauterization instruments to the suction systems [9, 10, 39]. Pursuing such a strategy is very restrictive to hardware and can as well be wasteful, since actually burning tissue does not always correspond to merely heating up equipment. Therefore, we consider real-time image analysis in order to detect emerging smoke a more sophisticated, universally applicable approach. 1 For

instance, in laparoscopy medical-grade CO 2 is typically used [29].

Image-based smoke detection could not only resolve the problem of when to activate evacuation systems, it as well can be utilized for post-procedural video analysis. Since cauterization often is part of situations critical to the applied treatment, such scenes are very likely to be of some relevance for surgeons when revisiting patient cases. Hence, automatically indicating segments containing smoke within hours long video material can greatly alleviate the tedious task of having to manually peruse lengthy surgery recordings. This work is considered a continuation of the work presented in [25], where discovering smoke is treated as a binary classification task: two custom laparoscopic datasets are evaluated using a simple saturation histogram thresholding algorithm and two variants of convolutional neural network (CNN) classification with the GoogLeNet [38] architecture. Our contribution is the publication2 as well as evaluation of a third public dataset comprising over 100K images extracted from the Cholec80 [43] database, i.e. a collection of 80 different cholecystectomy3 procedures, a more thorough exploration of the thresholding algorithm and the inclusion of CNN AlexNet [21] architecture in all conducted experiments. The paper is organized into four sections: related work outlined in Section 2, details about applied methodologies in Section 3, evaluations in terms of performance and runtime analyses in Section 4 and a concluding Section 5 summarizing scientific contributions as well as future considerations.

2

RELATED WORK

Computer-aided processing of endoscopic media is still in its infancy [31], yet, great potential can be found in improving a vast amount of tasks involved in various related types of surgeries [26]. While in general real-time surgery analysis aims at enhancing the video stream, filtering information or offering computerized decision support, post-processing approaches are mainly concerned 2 Mapping 3 During

files are available at http://www.itec.aau.at/ftp/datasets/Smoke_cholec80. a cholecystectomy the gallbladder is surgically removed.

Real-Time Image-based Smoke Detection in Endoscopic Videos ThematicWorkshops’17, , October 23–27, 2017, Mountain View, CA, USA

Figure 2: SAN and SPA Classification: saturation histogram extraction and setting thresholds tp /tc for classification. with quality assessment, data management and multimedia retrieval [31]. Unfortunately, drawing from the tremendous research conducted in non-medical domains [37] is somewhat limited for both application areas, since endoscopic recordings greatly differ from video content traditionally analyzed by the multimedia community: apart from internal organs being very different from nonanatomic objects, body tissue primarily appears in red-yellowish colors, hardly corresponding to any other application area like for example a soccer match that mainly exhibits greenish tones. Accordingly, research on endoscopic classification tasks is fairly scarce. Several studies have been conducted by Häfner et al.: in [14] they classify colonic mucosa using discrete wavelet-transformation on images as an input for a k-nearest neighbors (k-NN) as well as a Bayes classifier, in [15] they propose detecting colon cancer via pit pattern classification (Kudo et al. [23]) and in [16] they improve said classification technique for endoscopic images by developing a novel color texture operator optimized towards runtime performance as well as compactness. Regarding deep learning based methodologies, Park et al. [34] determine polyp regions in colonoscopy images with an accuracy of 90% when learning hierarchical features. Finally, Petscharnig et al. [35, 36] continue training AlexNet (Krizhevsky et al. [22]) for distinguishing between 14 different surgical shots using a large database of gynecology images with an average accuracy of 48.67%. When contemplating the task of visual smoke detection, almost all conducted scientific studies target non-medical domains such as forest fire control [32, 44, 45], achieving classification results via image separation [41], optical flow computation [8, 20] or pattern recognition [12, 13, 42]. As already hinted above, applying such methodologies to the medical domain is very limited, since aside from the distinct color palette difference also the lighting conditions diverge strongly. As for detecting smoke in the context of endoscopy, apart from the non-vision-based examination of smoke evacuation benefits conducted by Takahashi et al. [39] and a Sony Corporation patent outlining an image-based system using motion blur as well as pixel block analysis [6], we could only discover one related scientific article: Loukas et al. [27] evaluate 76 cholecystectomy shots of 26-58 frames, which amounts to 1976-4408 images,

via extracting their space-time optical flow together with a few kinematic features and utilize a one-class support vector machine (OCSVM) for classification. They further compare their results with several fire surveillance methods based on wavelet-based image decomposition [5, 13, 24], outperforming them by about 20% (83-86% vs. 63% accuracy).

3

COMPARED METHODOLOGIES

In total, we compare five smoke classification approaches, three of which have been developed in [25], yet not evaluated on the newly introduced dataset (see Section 4.1). Following Section 3.1 outlines two employed saturation-based thresholding techniques and Section 3.2 describes three methodologies based on deep learning.

3.1

Saturation Analysis

As our previous study [25] indicated, images containing smoke appear colorless. Hence, the saturation channel of the HSV color space shows a correlation with smoke development. However, it as well has been pointed out that in order to accurately measure saturation drops the channel’s average distribution for non-smoke images must be known beforehand, because it is required for estimating a good classification threshold tc for the formerly proposed method labeled Saturation Peak Analysis (SPA). Although performing well for empirically identified tc values, this generally renders the methodology inapplicable as well as incomparable to the CNN-based approaches. Accordingly for this evaluation we set a fixed threshold calculated from the ones on average performing best for datasets A and B4 : tc = AV G (0.20, 0.25, 0.30, 0.40, 0.45, 0.50) = 0.35. Furthermore, we additionally evaluate the naive thresholding approach in its own right, which previously was used just as a fallback classifier, in case SPA could not determine any local maximum above peak threshold tp = 0.35. We simply call the method Saturation Analysis (SAN) and it classifies smoke S in saturation histogram H (pred S (H )) if it determines that the majority of bin values are below tc . Its modus operandi compared to SPA is illustrated in Figure 2 and defined via Formulas 1 and 2: 4 See

Section 4.1 for details about datasets.

ThematicWorkshops’17, , October 23–27, 2017, Mountain View, CA, USA

pred S (H ) =

1 X bi , |H | i=0

4.1 (1)

b ∈H i ≤tc

pred N S (H ) = 1 − pred S (H ), (2) where bi is the i-th bin value of histogram H (|H | = 256) and pred N S (H ) the prediction for no smoke. SPA is left unchanged, meaning that SAN still acts as its fallback methodology whenever its classification strategy fails. In short, SPA calculates local maxima (peaks) above threshold tp and considers their proportion according to tc . It is defined by Formulas 3 and 4: {p | p ∈ pk (H ) ∧ p ≤ tc } pred S (pk (H )) = , pk (H )

(3)

pred N S (pk (H )) = 1 − pred S (pk (H )), (4) where function pk (H ) ⊂ N0 computes the set of peak positions for tp = 0.35 and tc = 0.35 as proposed in [25].

3.2

CNN Classification

When evaluating two variants of the 22-layered GoogLeNet [38] architecture in [25], in particular one using standard RGB colors (GLN RGB) and another trained on saturation channel images only (GLN SAT), their slower runtime performances have been pointed out as drawbacks for real-time classification. Therefore, we add a more shallow 8-layered AlexNet [21] variant (ALEX RGB) to evaluations conducted in this work, which for comparability reasons is trained in the same fashion as GLN RGB: 20K images are taken from custom dataset A, exhibit an even distribution, i.e. about half of them show various intensities of smoke while the other half does not, and are applied for training as well as validation in a 80:20 split. The resulting model is acquired after 40 minutes on a machine running Linux Mint 17.3 (64-bit) [1] with following hardware specs: Intel Core i7-3770K CPU @ 3.50GHz x 4, 16 GiB DDR3 @ 1333 MHz, Nvidia GeForce GTX 980 Ti. For improving the model’s performance, several training attempts have been carried out with varying learning rates5 – an optimal setting was found using Adam [19] with an initial learning rate of 0.0001. Since epoch 100 exhibits good performance in terms of accuracy and loss (99.51 vs. 0.06), it has been chosen for all evaluations within this study.

4

EXPERIMENTAL RESULTS

This section covers detailed results and comparisons of all five above outlined methodologies. Section 4.1 introduces all three evaluated datasets (DS) with a focus on the newly created DS C. In subsequent Section 4.2 retrieved classification results on all datasets are analyzed thoroughly, while individually concentrating on the proposed approaches. Afterwards, a runtime performance evaluation on all methodologies is conducted in Section 4.3 and the final Section 4.4 sets them in relation to each other as well as other studies by discussing advantages and drawbacks. 5 For

model training popular deep learning framework Caffe [18] is used.

A. Leibetseder et al.

Datasets

In [25], two manually extracted datasets A (DS A) and B (DS B) are evaluated using GLN RGB, GLN SAT and SPA. DS A consists of 30K images of laparoscopic scenes, 20K of which, as mentioned, have been used for training the CNN methods. The approximately remaining 10K images served for evaluations and comparisons with DS B, containing close to 4.5K laparoscopy images taken from a different video database than DS A, which is characterized by the datasets’ overall differences in color spectrum and image saturation. As an additional means of performance comparison, in this study we provide a new dataset DS C6 , as well representing a ground truth for smoke/non-smoke image sequences. It includes manually selected shots comprising at least ten sequential frames, which are extracted from all 80 videos contained in the publicly available Cholec80 dataset [43]. Altogether we annotated 100K frames and in particular obtain between 200-1300 samples of each class in every video, ensuring an unbiased representation of the conducted surgeries. In order to establish comparability to DS A and B an additional factor has to be taken into account: when regarding the Cholec80 database, that is to say images of DS C, the content area of endoscopic videos predominantly is surrounded by a dark circle of variable shape and position, which in fact is typical for endoscopic media due to the composition of the camera device. While not representing a big issue for surgical treatment, it exerts a big impact on image analysis, as can be observed in Figure 3a, illustrating an image taken from the Cholec80 database together with its saturation histogram.

(a) DS C.1 (original)

(b) DS C.2 (ROI)

Figure 3: Saturation histogram comparison for Cholec80, when using the entire image (3a) vs. using a rectangular center crop (3b).

At first glance, the depicted histogram’s bin values seem not to be normalized to the height of Figure 3a. However, when investigating the picture more closely, the lower- and uppermost bins exhibit a very high saturation – noise introduced by the black borders that greatly skews the image for analysis. In order to prevent such distortions we could entirely remove the dark areas by applying accurate detection mechanisms such as [30], yet, for GoogLeNet as well as AlexNet classification a square pixel dimension input is required, rendering simple region of interest (ROI) extraction 6 DS

C is obtainable via utilizing provided mapping files available at http://www.itec. aau.at/ftp/datasets/Smoke_cholec80 for extracting images from the Cholec80 dataset.

Real-Time Image-based Smoke Detection in Endoscopic Videos ThematicWorkshops’17, , October 23–27, 2017, Mountain View, CA, USA Table 1: Evaluation results for Datasets A, B, C.1 and C.2, cc = 0.50. (a) DS A.

(b) DS B.

Method Accuracy Precision Sensitivity Specificity

F1

Method Accuracy Precision Sensitivity Specificity

F1

ALEX RGB

0.914

0.935

0.890

0.938

0.912

ALEX RGB

0.671

0.607

0.993

0.340

0.753

GLN RGB

0.942

0.952

0.932

0.953

0.942

GLN RGB

0.779

0.697

0.998

0.555

0.821

GLN SAT

0.870

0.906

0.826

0.914

0.864

GLN SAT

0.914

0.879

0.962

0.864

0.919

SAN 0.35

0.839

0.877

0.789

0.889

0.831

SAN 0.35

0.520

0.513

1.000

0.027

0.679

SPA 0.35

0.816

0.886

0.726

0.907

0.798

SPA 0.35

0.576

0.544

1.000

0.141

0.705

(c) DS C.1.

(d) DS C.2.

Method Accuracy Precision Sensitivity Specificity

F1

Method Accuracy Precision Sensitivity Specificity

F1

ALEX RGB

0.619

0.751

0.357

0.881

0.484

ALEX RGB

0.668

0.733

0.528

0.807

0.614

GLN RGB

0.668

0.796

0.453

0.884

0.577

GLN RGB

0.711

0.771

0.600

0.822

0.675

GLN SAT

0.576

0.720

0.248

0.904

0.368

GLN SAT

0.676

0.751

0.528

0.825

0.620

SAN 0.35

0.658

0.707

0.540

0.776

0.612

SAN 0.35

0.695

0.776

0.548

0.842

0.642

SPA 0.35

0.658

0.706

0.539

0.776

0.612

SPA 0.35

0.697

0.771

0.560

0.834

0.649

(e) Average for A, B and C.2.

Method Accuracy Precision Sensitivity Specificity ALEX RGB

0.751

0.758

0.803

0.695

0.760

GLN RGB

0.811

0.807

0.843

0.777

0.812

GLN SAT

0.820

0.845

0.711

0.868

0.801

SAN 0.35

0.685

0.722

0.779

0.586

0.717

SPA 0.35

0.696

0.734

0.762

0.627

0.717

sufficient as is shown in Figure 3b. The green rectangle indicates the extraction area set to 65% of both image width and height7 , which, although not perfect for every video, removes most of the noise for the whole dataset. In order explore, if proceeding in such a manner affects results, we run evaluations on original images as well as extracted images and further on refer to the original dataset as DS C.1 and to the extracted ROI images as DS C.2. It is not necessary to apply the same procedure to DS A and B because their source videos have exclusively been recorded using a zoomable endoscopic device, effectively removing the dark circle by magnifying the video up until a point where it is not visible any more. Finally, as already hinted, all test datasets (A, B, C.1, C.2) are evenly distributed, meaning that about 50% of their images exhibit various degrees of smoke (c.f. Figure 1) and the other 50% show no visual signs of smoke. No regard is (yet) given to the different smoke intensities, various lighting conditions, blurriness or other effects such as reflections: for instance, an out of focus image showing low amounts of smoke is considered an image containing smoke. The given datasets are strictly evaluated via all five previously outlined classification approaches: ALEX RGB, GLN RGB, GLN SAT, SAN and SPA with thresholds tc = 0.35. 7 For evaluations, in equivalence to DS A and B, we resize the extracted image to GLN’s

and AlexNet’s input resolutions of 256x256 pixels.

F1

4.2

Evaluation

The entirety of results is shown in Table 1, listing specific evaluation metrics for all proposed methodologies and datasets using a particular prediction confidence of cc = 0.50: when for example predicting smoke, this value implies that any classifier needs a 50% or higher confidence for an accurate classification. Such a specific value for cc is chosen to establish comparability for the table listings – a more comprehensive and visual comparison is given in Figure 4 showing corresponding receiver operating characteristic (ROC) curve plots, which contrasts the true positive rate or classification accuracy for smoke (sensitivity) against the false positive rate, i.e the probability of falsely classifying smoke (1-sensitivity). The area under the curve (AUC) of a method then describes its overall performance for various cc values: an AUC of 1.0 identifies a perfect classifier, whereas 0.5 (diagonal dashed line) means a method does not perform better than a random classifier. When inspecting Table 1a indicating results for DS A, GLN RGB, as discovered in [25], still exhibits the highest performance in every metric: it correctly classifies 93.2% of all smoke samples (sensitivity), while at the same time with 95.3% specificity correctly classifying even more non-smoke samples. The method’s overall achieved accuracy of 94.2% is closely followed by ALEX RGB, performing almost equally well with an accuracy of 91.4% and a sensitivity of

ThematicWorkshops’17, , October 23–27, 2017, Mountain View, CA, USA

A. Leibetseder et al.

(a) DS A.

(b) DS B.

(c) DS C.1.

(d) DS C.2.

Figure 4: ROC curve comparison for Datasets A, B, C.1 and C.2. 89.0% and a specificity of 93.8%. In the same fashion, i.e. slightly performing worse than ALEX RGB, with GLN SAT the prediction accuracy drops to 87.0%, 82.6% sensitivity and 91.4% specificity. Interestingly, the naive SAN seems to perform marginally better than SPA (83.9% vs. 81.6% accuracy) when merely considering DS A’s metrics in Table 1a, yet, looking at the ROC curves plot in Figure 4a comparing the AUCs for DS A, SPA outperforms SAN, albeit only with a margin of about 0.06 (0.9064 vs. 0.8391). The other methodologies’ AUCs exactly match the ranking as well reflected by Table 1a: GLN RGB (0.9862), ALEX RGB (0.9709) and GLN SAT (0.9415). Table 1b for DS B differs significantly from the former table: GLN SAT outperforms all other approaches with 91.4% accuracy, 96.2% sensitivity and 86.4% specificity. GLN RGB still exhibits a somewhat acceptable accuracy of 77.9%, although showing very imbalanced sensitivity and specificity values (99.8% vs. 55.5%). ALEX RGB seems performs even worse (67.1% accuracy) and SPA as well as SAN even produce accuracies below 60.0% (57.6% vs. 52.0%). The ROC curve plots for DS B in Figure 4b, however, only partially reflects these observations: as expected GLN SAT performs best with an AUC of 0.9822, while GLN RGB ranks second with a surprising

AUC of 0.9769 and ALEX RGB still achieves an AUC of 0.8840. The saturation-based methods exhibit both worse values, yet only SAN truly acts as a random classifier with an AUC of 0.5136 – SPA unexpectedly accomplishes an AUC of 0.7862. Evaluations for DS C.1 as well as C.2 are listed in Table 1c and Table 1d. Judging by the obvious performance decrease when comparing C.2 result values to C.1 ones, which is reflected as well in corresponding ROC curve plots (Figure 4c and Figure 4d), it becomes apparent that the ROI extraction is beneficial for every approach, in particular GLN SAT, the accuracy of which is improved by as much as 10% (57.6% for C.1 vs. 67.6% for C.2). Due to this significance, we further merely consider evaluations for C.2. Apart from unanimously performing about 10−20% worse than on the other datasets, all methodologies seem to additionally be troubled by classifying smoke, as is shown by the low sensitivity of 60% at the most, which stands in contrast to the specificity revolving around 83%. Overall GLN RGB slightly outperforms the other methods, yet, all exhibit accuracies between 66.8%-71.11%, which also is portrayed by the ROC curves plot in Figure 4d, as they are ranked tightly after each other: GLN RGB (0.7774), SPA (0.7550), GLN SAT (0.7496), ALEX RGB (0.7358), SAN (0.6951).

Real-Time Image-based Smoke Detection in Endoscopic Videos ThematicWorkshops’17, , October 23–27, 2017, Mountain View, CA, USA Finally, an overall performance evaluation is characterized in Table 1e listing the average metrics from Tables 1a (DS A), 1b (DS B) and 1d (DS C.2). GLN RGB and GLN SAT seem to perform equally well, showing accuracies, sensitivities and specificities between 71.1% and 86.8%. The values for ALEX RGB are slightly worse, but still around 69.5% to 80.3%, while SAN and SPA exhibit the worst reslts, albeit merely slightly lower than ALEX RGB (58.6%-77.9%).

4.3

Runtime Evaluation

Execution times are critical when requiring results in real-time8 . Therefore, we list the average wall clock timings for all proposed methodologies in Table 2 in order to get a better understanding of their capabilities. Two metrics and their total are considered when working with standard HD images: preparation and classification. We use Python [3] implementations for evaluating the datasets and although the preparation step mainly involves applying OpenCV [2] image processing tasks, such as changing color spectra, resizing frames and extracting saturation histograms, it also includes custom components like SPAs local maxima detection. Classification for CNNs is conducted on the GPU9 in order to achieve the highest possible performance and for the saturation-based approaches simple Python functions suffice. Table 2: Avg. runtime performance per 1080p image in ms. Method

Preparation

Classification

Overall

SAN

7.726

0.093

7.818

SPA

12.487

0.006

12.493

GLN SAT

18.847

75.307

94.154

ALEX RGB

45.386

60.952

106.338

GLN RGB

45.132

105.223

150.355

The measurement results are expectedly much worse for CNNbased approaches: clearly many-layered GLN RGB performs slowest with a pure classification time of about 105 ms, even though already using a highly parallelized GPU implementation for performance optimization. GLN SAT is slightly faster and achieves a classification time of circa 75 ms. At last, ALEX RGB works best among CNN approaches when merely accounting for classification time. Both saturation-based methods SAN and SPA are considerably faster classifiers, as with 0.09 and 0.006 ms they easily outperform any other method. When additionally considering preparation, overall classification times worsen, since depending on the task at hand, processing 1080p images can be time-consuming: for instance, resizing an RGB image to the 256x256 pixel CNN input format consumes most of the time. However, conducting the same process on a one channel image certainly is a lot faster, which is why, although ALEX RGB outperforms GLN SAT in terms of classification time, the latter makes up for it by featuring a much shorter preparation time (19 ms vs. 45 ms). As the preparation time of the saturation-based 8 In

a 25 fps video real-time requirements amount to: Section 3.2 for full system specifications.

9 See

1000 25

= 40 ms per image.

approaches (8 - 13 ms) can be considered negligible, the overall runtime performance method ranking on the employed test computer is as follows: SAN (8 ms), SPA (13 ms), GLN SAT (94 ms), ALEX RGB (106 ms), GLN RGB (150 ms).

4.4

Discussion

When contemplating the outcomes for all evaluated datasets, evidently color image GoogLeNet (GLN RGB) yields the best performance in obtaining ROC AUCs of at least greater than 0.77. Nevertheless, the method as well is the computationally most expensive choice and requires near 150 ms for a single high definition image classification, which technically renders it unsuitable for real-time classification. In practice, however, as can be observed in Figure 1, smoke development typically is not fluctuating very much across frames, that is to say that very likely skipping predictions for some images still does not exert a significant impact on the classifier’s performance in live systems. Alternatively, AlexNet using colored images (ALEX RGB) offers a runtime performance increase of close to 30% with a total of around 106 ms and, although performing quite well for DS A (AUC: 0.9709), its least achieved ROC AUC is even less than GoogLeNet fed with saturation images (GLN SAT) – seemingly not by much for DS C.2 (AUCs: 0.7358 vs. 0.7496) but considerably when regarding DS B (AUCs: 0.8840 vs. 0.9822). Additionally, GLN SAT maintains the advantage of only having to process one channel value per pixel, hence, it even is faster in terms of computation speed (∼ 94 ms). Due to this observation, it can be expected that feeding AlexNet with saturation only images should result in the in the best runtime performance increase for proposed CNN methodologies, which is thus planned for future investigations. When lastly considering the non CNN-based approaches, it becomes apparent that SAN is performing worst because, although showing a high accuracy on DS A (AUC: 0.8391), it is nevertheless outperformed by all methodologies in every dataset. SPA on the other hand performs remarkably well with AUCs of at least 0.75550, considering the fact that a fixed threshold of tc = 0.35 is chosen for all evaluations, which sets it apart from the experiments conducted in [25], where the best thresholds for every dataset were empirically determined. Choosing a non-optimal threshold only severely manifests itself in evaluations for DS B, where performance rapidly decreases with increasing thresholds: setting the optimal tc = 0.30 to tc = 0.35 results in a performance loss of about 20% (AUCs: 0.9770 vs. 0.7862). Nonetheless, both saturation-based approaches further are truly capable of real-time classification showing a maximum runtime performance of around 13 ms. As for comparing the outcomes across datasets, C.2 seems to provide images that are considerably harder to classify, which can be attributed to several reasons. First and foremost, the dataset is much larger10 , which leaves more room for misclassification. Secondly, all proposed approaches are developed on DS A, i.e. the dataset on which all CNNs are trained on and the non CNN-based methods’ parameters were determined from, as pointed out in [25]. Accordingly, when compared to results retrieved from evaluating DS A (Table 1a, Figure 4a), all other evaluations generally yield worse results with regard to the overall methodology performances. In future studies, therefore, it would be interesting to base the 10 100K

(DS C.2), 10K (№ test images from DS A) and 4.5K (DS B).

ThematicWorkshops’17, , October 23–27, 2017, Mountain View, CA, USA classifiers on C.2, other datasets or even their combination and compare outcomes to evaluations on different datasets, such as DS A and B. Finally, considering the most relevant study by Loukas et al. [27], we can identify similar deficits as for the DS A evaluations: they as well develop their classifier on a specific custom dataset comprising even less images than DS A (between 1.9K-4.4K vs. 30K) and achieve high classification accuracies of 86% specificity as well as 83% sensitivity. Hence, it seems imperative for them to as well test their approach on other datasets – for instance, on C.2 provided with this study. Additionally, since wavelet-based fire surveillance methods have been developed in completely different contexts, as is pointed out by the authors themselves, their comparison to them11 unfortunately seems of arguable significance, which we already pointed out in Section 2.

5

CONCLUSION

Primarily addressing real-time binary classification of smoke in endoscopic videos, we evaluate the performance of several deep learning based methods and two saturation histogram based thresholding algorithms. Continuing the work presented by Leibetseder et al. [25], in addition to training GoogLeNet [38] using full color as well as saturation only images, we explore AlexNet [21] fed with full color images. Furthermore, besides a more sophisticated approach of analyzing image saturation histograms, we investigate a much simpler, naive algorithm. We as well provide a large new public dataset of cholecystectomy sequences from the Cholec80 database [43], part of which show smoke in various intensities and the other part no visual signs of smoke. Overall full color GoogLeNet shows the best performance in terms of classification, yet, the exhibits the worst runtime performance. When regarding real-time systems, it is likely that they could compensate for this deficiency by skipping frames. As an alternate approach fewer layered CNN AlexNet trained with full color images can be utilized, showing good classification performance but altogether achieving slower computation speeds than GoogLeNet trained on saturation only images. Nevertheless, both alternatives are not fully real-time capable, hence, it is interesting to observe that fixating the threshold of the more sophisticated saturation histogram thresholding approach exhibits competitive performance on almost all datasets. Lastly, the naive thresholding algorithm is outperformed by all other methodologies, yet on occasion still achieves reasonable results. Judging by the above observations, we deem it feasible to devise applications for real-time surgical smoke detection in medical grade systems. Devices incorporating such mechanisms could then be utilized to automatically activate smoke evacuation devices or employed for procedural post-processing of medical archives. Regarding real-time capabilities, we argue that not evacuated smoke develops slowly across several frames, which leads us to believe that dropping frames should suffice for approaches incapable of producing predictions in real-time. This, however, needs to be more thoroughly explored in future studies. In future work, we will investigate and devise further detection methodologies, with an emphasis on applicable approaches we 11 The

wavelet-based methods merely exhibit an AUC of 0.63.

A. Leibetseder et al.

possibly find in the literature. Specifically, an interesting aspect will be attempting to improve CNN-based methods by using the extensive dataset provided with this study. Analyzing variations in image saturation seems to work well for deep learning as well as the proposed thresholding algorithms. Therefore, we plan on further exploring CNNs trained with saturation images and on examining automatic threshold determination, which could potentially be achieved by histogram shifting. Furthermore, combining thresholding observations with CNN model predictions could as well greatly improve achieved results. In conclusion, we hope to encourage others to take advantage of our public dataset and that this way the problem at hand can be brought to a greater audience.

ACKNOWLEDGMENTS We are grateful to our colleague Sabrina Kletz, who helped with creating datasets and contributed valuable ideas. This work was supported by Universität Klagenfurt and Lakeside Labs GmbH, Klagenfurt, Austria and funding from the European Regional Development Fund and the Carinthian Economic Promotion Fund (KWF) under grant KWF 20214 u. 3520/26336/38165.

REFERENCES [1] 2006. Linux Mint 17.3 "Rosa" - Cinnamon (64-bit). (2006). https://linuxmint.com/ edition.php?id=204 Accessed: 2017-03-28. [2] 2017. OpenCV library. (2017). http://opencv.org/ [3] 2017. Python programming language. (2017). https://www.python.org/ [4] O. S. Al Sahaf, I. Vega-Carrascal, F. O. Cunningham, J. P. McGrath, and F. J. Bloomfield. 2007. Chemical composition of smoke produced by high-frequency electrosurgery. Irish Journal of Medical Science 176, 3 (2007), 229–232. DOI: http://dx.doi.org/10.1007/s11845-007-0068-0 [5] Simone Calderara, Paolo Piccinini, and Rita Cucchiara. 2011. Vision based smoke detection system using image energy and color information. Machine Vision and Applications 22, 4 (jul 2011), 705–719. DOI:http://dx.doi.org/10.1007/ s00138-010-0272-1 [6] Ming-Chang Liu Chen-Rui Chou. 2016. System and Method for Smoke Detection During Anatomical Surgery. (2016). https://www.google.com/patents/ US20160239967 [7] Seock Hwan Choi, Tae Gyun Kwon, Sung Kwang Chung, and Tae Hwan Kim. 2014. Surgical smoke may be a biohazard to surgeons performing laparoscopic surgery. Surgical Endoscopy and Other Interventional Techniques 28, 8 (2014), 2374–2380. DOI:http://dx.doi.org/10.1007/s00464-014-3472-3 [8] Yu Chunyu, Fang Jun, Wang Jinjun, and Zhang Yongming. 2010. Video Fire Smoke Detection Using Motion and Color Features. Fire Technology 46, 3 (jul 2010), 651–663. DOI:http://dx.doi.org/10.1007/s10694-009-0110-z [9] Ioan Cosmescu. 1991. Automatic smoke evacuator system for a surgical laser apparatus and method therefor. (1991). https://www.google.com/patents/ US5199944 [10] Ioan Cosmescu. 2006. Automatic smoke evacuator and insufflation system for surgical procedures. (2006). https://www.google.com/patents/US20070249990 [11] M. Dobrogowski, W. Wesołowski, M. Kucharska, A. Sapota, and L. Pomorski. 2014. Chemical composition of surgical smoke formed in the abdominal cavity during laparoscopic cholecystectomy – Assessment of the risk to the patient. International Journal of Occupational Medicine and Environmental Health 27, 2 (jan 2014), 314–325. DOI:http://dx.doi.org/10.2478/s13382-014-0250-3 [12] Ricardo J Ferrari, Hong Zhang, and C Ronald Kube. 2007. Real-time detection of steam in video images. Pattern Recognition 40, 3 (2007), 1148–1159. [13] J. Gubbi, S. Marusic, and M. Palaniswami. 2009. Smoke detection in video using wavelets and support vector machines. Fire Safety Journal 44, 8 (2009), 1110–1115. DOI:http://dx.doi.org/10.1016/j.firesaf.2009.08.003 [14] M. Häfner, A. Gangl, M. Liedlgruber, A. Uhl, A. Vécsei, and F. Wrba. 2009. Combining Gaussian Markov random fields with the discretewavelet transform for endoscopic image classification. In DSP 2009: 16th International Conference on Digital Signal Processing, Proceedings. DOI:http://dx.doi.org/10.1109/ICDSP.2009. 5201226 [15] M. Hafner, A. Gangl, M. Liedlgruber, A. Uhl, A. Vecsei, and F. Wrba. 2010. Endoscopic Image Classification Using Edge-Based Features. In 2010 20th International Conference on Pattern Recognition. IEEE, 2724–2727. DOI:http://dx.doi.org/10. 1109/ICPR.2010.667

Real-Time Image-based Smoke Detection in Endoscopic Videos ThematicWorkshops’17, , October 23–27, 2017, Mountain View, CA, USA [16] M. Häfner, M. Liedlgruber, A. Uhl, A. Vécsei, and F. Wrba. 2012. Color treatment in endoscopic image classification using multi-scale local color vector patterns. Medical Image Analysis 16, 1 (2012), 75–86. DOI:http://dx.doi.org/10.1016/j.media. 2011.05.006 [17] C Hensman, D Baty, RG Willis, and A Cuschieri. 1998. Chemical composition of smoke produced by high-frequency electrosurgery in a closed gaseous environment. Surgical endoscopy (1998). http://www.springerlink.com/index/ 3PDVCC89D248BJT0.pdf [18] Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of the 22nd ACM international conference on Multimedia. ACM, 675–678. [19] Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014). [20] I. Kolesov, P. Karasev, A. Tannenbaum, and E. Haber. 2010. Fire and smoke detection in video with optimal mass transport based optical flow and neural networks. In 2010 IEEE International Conference on Image Processing. IEEE, 761– 764. DOI:http://dx.doi.org/10.1109/ICIP.2010.5652119 [21] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. In Advances in Neural Information Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger (Eds.). Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf [22] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. ImageNet Classification with Deep Convolutional Neural Networks. Curran Associates, Inc., 1097–1105. http://papers.nips.cc/paper/ 4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf [23] S Kudo, S Hirota, T Nakajima, S Hosobe, H Kusaka, T Kobayashi, M Himori, and A Yagyuu. 1994. Colorectal tumours and pit pattern. Journal of clinical pathology 47, 10 (oct 1994), 880–5. DOI:http://dx.doi.org/10.1136/JCP.47.10.880 [24] C Y Lee, C T Lin, C T Hong, and M T Su. 2012. SMOKE DETECTION USING SPATIAL AND TEMPORAL ANALYSES. International Journal of Innovative Computing Information and Control 8, 7A (2012), 4749–4770. [25] Andreas Leibetseder, Manfred Jürgen Primus, Stefan Petscharnig, and Klaus Schoeffmann. 2017. Image-based Smoke Detection in Laparoscopic Videos: Fourth International Workshop, CARE 2017, Held in Conjunction with MICCAI 2017, Quebec City, Canada, September 14, 2017, Accepted. Springer. [26] M. Liedlgruber and A. Uhl. 2009. Endoscopic image processing - an overview. In 2009 Proceedings of 6th International Symposium on Image and Signal Processing and Analysis. IEEE, 707–712. DOI:http://dx.doi.org/10.1109/ISPA.2009.5297635 [27] Constantinos Loukas and Evangelos Georgiou. 2015. Smoke detection in endoscopic surgery videos: a first step towards retrieval of semantic events: Smoke detection in endoscopic surgery videos. The International Journal of Medical Robotics and Computer Assisted Surgery 11, 1 (mar 2015), 80–94. DOI: http://dx.doi.org/10.1002/rcs.1578 [28] Dietmar Mattes, Edah Silajdzic, Monika Mayer, Martin Horn, Daniel Scheidbach, Werner Wackernagel, Gerald Langmann, and Andreas Wedrich. 2010. Surgical smoke management for minimally invasive (micro)endoscopy: An experimental study. Surgical Endoscopy and Other Interventional Techniques 24, 10 (2010), 2492–2501. DOI:http://dx.doi.org/10.1007/s00464-010-0991-4 [29] T Menes and H Spivak. 2000. Laparoscopy: searching for the proper insufflation gas. Surgical endoscopy 14, 11 (nov 2000), 1050–6. http://www.ncbi.nlm.nih.gov/ pubmed/11116418

[30] Bernd Münzer, Klaus Schoeffmann, and Laszlo Böszörmenyi. 2013. Detection of circular content area in endoscopic videos. In Computer-Based Medical Systems (CBMS), 2013 IEEE 26th International Symposium on. IEEE, 534–536. [31] Bernd Münzer, Klaus Schoeffmann, and Laszlo Böszörmenyi. 2017. Content-based processing and analysis of endoscopic images and videos: A survey. Multimedia Tools and Applications (2017). DOI:http://dx.doi.org/10.1007/s11042-016-4219-z [32] JA Ojo and JA Oladosu. 2014. Video-based Smoke Detection Algorithms: A Chronological Survey. Computer Engineering and Intelligent Systems 5, 7 (2014), 38–50. [33] D Ott. 1993. Smoke production and smoke reduction in endoscopic surgery: preliminary report. Endoscopic surgery and allied technologies 1, 4 (aug 1993), 230–2. http://www.ncbi.nlm.nih.gov/pubmed/8050026 [34] Sun Young Park and Dusty Sargent. 2016. Colonoscopic polyp detection using convolutional neural networks, Georgia D. Tourassi and Samuel G. Armato (Eds.). International Society for Optics and Photonics, 978528. DOI:http://dx.doi.org/10. 1117/12.2217148 [35] Stefan Petscharnig and Klaus Schöffmann. 2017. Deep Learning for Shot Classification in Gynecologic Surgery Videos. Vol. 10132. Springer International Publishing, Cham, 702–713. http://link.springer.com/10.1007/978-3-319-51811-4 [36] Stefan Petscharnig and Klaus Schöffmann. 2017. Learning laparoscopic video shot classification for gynecological surgery. Multimedia Tools and Applications (apr 2017). DOI:http://dx.doi.org/10.1007/s11042-017-4699-5 [37] Klaus Schoeffmann, Marco A Hudelist, and Jochen Huber. 2015. Video interaction tools: a survey of recent work. ACM Computing Surveys (CSUR) 48, 1 (2015), 14. [38] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1–9. [39] Hidekazu Takahashi, Makoto Yamasaki, Masashi Hirota, Yasuaki Miyazaki, Jeong Ho Moon, Yoshihito Souma, Masaki Mori, Yuichiro Doki, and Kiyokazu Nakajima. 2013. Automatic smoke evacuation in laparoscopic surgery: a simplified method for objective evaluation. Surgical Endoscopy 27, 8 (Aug. 2013), 2980–2987. DOI:http://dx.doi.org/10.1007/s00464-013-2821-y [40] H. P. Thiébaud, M. G. Knize, P. A. Kuzmicky, D. P. Hsieh, and J. S. Felton. 1995. Airborne mutagens produced by frying beef, pork and a soy-based food. Food and Chemical Toxicology 33, 10 (1995), 821–828. DOI:http://dx.doi.org/10.1016/ 0278-6915(95)00057-9 [41] Hongda Tian, Wanqing Li, Lei Wang, and Philip Ogunbona. 2012. A novel video-based smoke detection method using image separation. In Proceedings - IEEE International Conference on Multimedia and Expo. 532–537. DOI:http: //dx.doi.org/10.1109/ICME.2012.72 [42] B Ugur Toreyin, Yigithan Dedeoglu, and A Enis Cetin. 2006. Contour based smoke detection in video using wavelets. IEEE, 1–5. [43] Andru P. Twinanda, Sherif Shehata, Didier Mutter, Jacques Marescaux, Michel de Mathelin, and Nicolas Padoy. 2017. EndoNet: A Deep Architecture for Recognition Tasks on Laparoscopic Videos. IEEE Transactions on Medical Imaging 36, 1 (jan 2017), 86–97. DOI:http://dx.doi.org/10.1109/TMI.2016.2593957 [44] Shiqian Wu, Feiniu Yuan, Yong Yang, Zhijun Fang, and Yuming Fang. 2015. Real-time image smoke detection using staircase searching-based dual threshold AdaBoost and dynamic analysis. IET Image Processing 9, 10 (oct 2015), 849–856. DOI:http://dx.doi.org/10.1049/iet-ipr.2014.1032 [45] Feiniu Yuan. 2011. Video-based smoke detection with histogram sequence of LBP and LBPV pyramids. Fire Safety Journal 46, 3 (2011), 132–139. DOI: http://dx.doi.org/10.1016/j.firesaf.2011.01.001