MULTI-CLASS CLASSIFICATION OF PULMONARY ENDOMICROSCOPIC IMAGES Mohammad Rami Koujan1 , Ahsan Akram2 , Paul McCool2 , Jody Westerfeld3 , David Wilson3 , Kevin Dhaliwal2 , Stephen McLaughlin1 , Antonios Perperidis?1,2 1
Institute of Sensors, Signals & Systems, Heriot Watt University, UK. 2 EPSRC Proteus IRC Hub, MRC Centre for Inflammation Research, University of Edinburgh, UK. 3 Community Health Network, Community South Hospital, Indianapolis, USA. ? E-mail:
[email protected] ABSTRACT
Optical endomicroscopy (OEM) is an emerging medical imaging tool capable of providing in-vivo, in-situ optical biopsies. Clinical pulmonary OEM procedures generate data containing a multitude of frames, making their manual analysis a highly subjective and laborious task. It is therefore essential to automatically classify the images into clinically relevant classes to aid reaching a fast and reliable diagnosis. This paper proposes a methodology to automatically classify OEM images of the distal lung. Due to their diagnostic relevance, three classification tasks are targeted: (i) differentiating between alveolar images containing predominantly elastin from those flooded with cells, (ii) separating normal from abnormal elastin frames, and (iii) multi-class classification amongst normal, abnormal, and cell frames. Local Binary Patterns along with a Support Vector Machine classifier, and One-Versus-All Error Correcting Output Codes strategy for the multi-class classification case, are employed obtaining accuracy of 92.2%, 95.2%, 90.1% for the tasks (i), (ii), (iii), respectively. Index Terms— Optical endomicroscopy, distal lung, texture analysis, frame classification. 1. INTRODUCTION Optical endomicroscopy (OEM) is a fibre-based medical imaging tool with clinical and pre-clinical utility [1] that facilitates the acquisition of in-vivo, in-situ optical biopsies. OEM employs a proximal light source, laser scanning or LED flood illumination, linked to a flexible multi-core coherent fibre bundle, performing microscopic fluorescent imaging at its distal end. The diameter of the packaged fibre (< 500µm) permits it to pass through the working channel of endoscopes, enabling the real-time imaging of tissues that were previously inaccessible through conventional endoscopy. In pulmonology, the auto-fluorescence (at 488nm) generated through the presence of elastin and collagen has enabled This work was supported by EPSRC via EP/K03197X/1.
Fig. 1: Characteristic examples of OEM images illustrating normal elastin (left), abnormal elastin (middle), and cells (right).
the exploration of the distal pulmonary tract as well as the assessment of the respiratory bronchioles and alveolar gas exchanging units. However, the nature of the OEM data acquisition results in long, continuous frame sequences (potentially > 1k), rendering the process of their manual analysis laborious and highly subjective. Consequently, parsing the video sequences into clinically relevant classes can be highly beneficial to the diagnostic process, (i) reducing the associated human/computational resources, while (ii) enabling more targeted and objective image quantification/interpretation. Motivated by their high diagnostic relevance, this study groups OEM frames of the alveolar ducts in three commonly found classes (Figure 1), comprising of (i) normal elastin, (ii) abnormal elastin that can be an indicator of a range of respiratory diseases including, pulmonary nodule [2] or fibrotic tissue [3], and (iii) cell flooded frames, a potential indicator of Amiodarone-related pneumonia [4] or acute lung cellular rejection following lung transplantation [5]. There have recently been some studies attempting to group pulmonary OEM images in binary classes such as differentiating between informative and uninformative frames [6], or normal and abnormal elastin frames [7, 8]. However, to our knowledge, this is the first study (i) attempting multi-class classification of clinically relevant pulmonary OEM frames, (ii) utilising a large and varied set of images for training and testing, with no manual selection criteria that can potentially bias the results. The proposed algorithms and associated parameter selection have therefore been optimised for improved generalisation in such a diverse scenario.
2. METHODOLOGY Frame classification was achieved in three steps. Video pre-processing: Prior to frame classification, each video was (i) filtered with a Gaussian kernel (σ = 0.5) to suppress the effect of noise in the classification process, (ii) contrast enhanced ensuring a common dynamic range across all datasets, and (iii) cropped, maintaining the largest square region (363×363 pixels) within the circular field of view (FOV) as seen in Figure 1. The remaining 4 segments (each 9% of the circular FOV) were not included in the texture estimations and the subsequent frame classification. This decision was based on the assumption that no clinically relevant information is likely to be imaged exclusively in these edge segments. Step 1: Local Binary Patterns (LBP) extraction: As described in [9], for each pixel in frame Im , where m ∈ [1, M ] represented the frame number, a binary vector of length N was constructed by thresholding N equally-spaced neighbours located on a circle of radius R around the central pixel pc , with respect to that central pixel’s value Gpc . That is, a local neighbour pi , where i ∈ [1, N ], of value Gpi on a circle of radius R was assigned 0 if it is smaller than Gpc and 1 otherwise, making the constructed binary vectors invariant to monotonic image intensity transformations. Consequently, an image with P pixels generates P binary vectors of size N computed from a local neighbourhood of radius R. Uniform binary codes were subsequently defined as patterns with at most two bit-wise transitions, either from 0 to 1 or 1 to 0. For vectors of length N , there can be N (N − 1) + 2 such unique uniform binary code patterns. Investigation indicated that uniform codes constituted more than 90% of all binary codes in the available OEM dataset. As rotation can be introduced during the acquisition of OEM videos, it was essential to employ a rotation invariant texture descriptor, eliminating the rotation effect in the subsequent classification task. Uniform binary patterns were made rotation invariant by representing them using the number of elements with value 1 in each associated vector. The result was in N + 1 distinct values ranging from 0 to N that can represent any rotationally invariant uniform binary vector of length N . The remaining non-uniform patterns were encoded with the value N + 1. Finally, the encoded values were aggregated in a histogram Xm of N + 2 bins, which characterised the current frame Im . Each histogram was considered as a point in the N + 2 dimensional space and passed to the next stage for classification. Step 2: Support Vector Machine (SVM) classification: SVM performs a binary classification task by finding an optimal hyperplane in the feature space, maximising the decision boundary and nearest training point from each class [10]. SVM employs a linear model of the form y(X) = W T φ(X) + b 0
(1)
where X ∈ RN represented a single observation in the N 0 = N + 2 dimensional space. For each observation (frame) in
the training set with M 0 data points, the target value was tm0 (tm0 ∈ {−1, 1} ) for m0 ⊆ m. φ(X) embodied a fixed feature-space transformation (kernel), b was the bias, and W was the normal vector to the hyperplane, known as the decision boundary, in the N 0 dimensional space. Under the margin maximisation objective, the optimisation 1 ||W ||2 , subject to W,b 2 T tm0 (W φ(Xm0 ) + b) ≥ 1, m0 = 1, 2, ...., M 0 arg min
(2)
was used to estimate parameters W and b that make the perpendicular distance between the decision boundary and its nearest point as high as possible. Lagrange multipliers am0 ≥ 0, were applied to each observation such that 0
W =
M X
am0 tm0 φ(Xm0 ),
m0 =1
(3)
0
0=
M X
am0 tm0 .
m0 =1
Using kernel k representation k(Xi , Xj ) = φ(Xi )T φ(Xj ), along with equation (3) enabled to rewrite equation (1) as M X
y(Xm00 ) =
am0 tm0 k(Xm00 , Xm0 ) + b
(4)
m0 =1
in order to be able to classify any out of sample (test) data point Xm00 with m ∈ m0 ∪ m00 . While a range of kernels were tested, a Gaussian kernel k(Xi , Xj ) = exp(−||Xi − Xj ||2 /2σ 2 ) was employed. Karush-KuhnTucker (KKT) conditions [11] stated that am0 = 0 for any point that did not lie on the maximum margin plane. To avoid overfitting the dataset and resulting in poor generalisation performance, slack variables ξm0 ≥ 0 were introduced [12, 13] with ξm0 = 0 for data points that were on or inside the correct margin boundary and ξm0 = |tm0 −y(Xm0 )| for all other points. Hence, any point on the decision boundary yields ξm0 = 1, and ξm0 > 1 corresponded to misclassified points. The new optimisation problem was therefore formulated as follows 0
arg min
W,ξm0 ,b
C
M X m0 =1
1 ξm0 + ||W ||2 , subject to 2
tm0 (W φ(Xm0 ) + b) ≥ 1 − ξm0 , m0 = 1, 2, ...., M 0 T
(5)
where the parameter C > 0 (called box constraint) controlled the trade-off between the margin and the slack variable penalty, compensating between hard margin and soft margin. Step 3: Error Correcting Output Codes (ECOC): ECOC strategies [14] were employed to perform the 3-class classification problem separating normal-, abnormal-elastin
structure, and cell flooded images. The coding method of choice was the One-Versus-All (OVA) as it outperformed alternatives such as One-Vs-One, Ordinal and Ternarycomplete. A coding matrix Z ∈ [−1, 1]K×L was defined, where its K = 3 rows corresponded to the number of classes, and L = 3 columns to the number of the binary classifiers. For each binary learner, OVA considered one class as positive and the rest as negative, with each row representing the binary codeword of the associated class. A loss-weighted function [15] was utilised during the decoding step assigning an out of sample observation Xm00 to the class that minimises the sum of losses over the binary learners PL |vr,c |g(vr,c , sc (Xm00 ))| (6) yˆ(Xm00 ) = arg min c=1 PL r c=1 |vr,c | where vr,c was an element of the coding matrix Z at row r and column c, sc was the score of the binary learner c, and g = exp(−vr,c sc )/2 was the exponential binary loss function. 3. DATA ANALYSIS 33 OEM frame sequences (8Hz) of the distal lung were used during the development and testing of the proposed algorithm. All data were obtained during the routine care of patients undergoing investigation for an indeterminate pulmonary nodule (axial diameter < 30mm) at the Columbus Lung Institute, Indiana, USA. Of the original database (126 subjects), 43 subjects were rejected due to (i) short duration of sequences (< 10 frames), (ii) corrupted data (i.e. not readable, misaligned / out-of-focus fibre), or (iii) lack of distal lung frames. Of the remaining videos, 33 were randomly selected with no other subjective criteria (such as image quality) that could potentially bias the proposed algorithms. All data were acquired by a single expert operator using a 488nm Cellvizio with a 1.4mm diameter and a 600µm field-of-view Alveoflex fibre (Mauna Kea Technologies, Paris, France), and were stored in the proprietary .mkt format. The study was approved by the Western Institutional Review Board. Prior to any analysis, frames containing no clinically relevant information were automatically detected and removed using the approach by [6]. Furthermore, highly correlated frames in nearly static frame sequences were also removed. A clinical investigator, with substantial prior experience in pulmonary OEM images, subsequently identified frames that belonged to one of three non-discrete classes (Figure 1), namely (i) normal elastin, (ii) abnormal elastin, and (iii) cells. Abnormal elastin comprised of a range of structural abnormalities including condensed, disrupted and disorganised elastin strands. These images and associated annotations were utilised for training and testing three distinct classification tasks, discriminating amongst frames that contain predominantly (i) cells or elastin strands, (ii) normal or abnormal elasin, and (iii) any of: normal, abnormal and cells in a multiclass segregation scenario. The entire dataset comprised of
4800 images (1600 per-class). 8-fold cross validation was employed (600 images per-fold, 200 per-class), making sure that no frames from the same video were distributed across more than 1 fold. This split ensured that potentially correlated frames from a single video were not included in both training and testing sets, biasing the classification results. 8 was the largest possible number of folds preserving this condition. Finally, the hyperparameters of the SVM classifier (sigma of the Gaussian kernel and box constraint C) were tuned by doing cross-validation on 50% of the entire dataset. MATLAB R2015b along with the Statistics and Machine Learning Toolbox (MathWorks, Inc., MA, USA) were used for the analysis reported in this paper. 4. RESULTS AND DISCUSSION A range of texture descriptors (e.g. co-occurrence matrices, wavelet features), classifiers (e.g. KNN, Discriminant Analysis) and ECOCs (e.g. One-Versus-One, Ordinal) were tested, with the reported combination of LBPs, SVM and OVA yielding the best classification results. Reporting the results of the remaining techniques is beyond the scope of this study. As shown in Table 1, the proposed methodology obtained frame classification with high accuracy (> 90%) in all binary and multi-class scenarios. Radius R = 4 provided the most discriminative LBPs in all classification tasks. This can be partially attributed to the fact that OEM images are sparsely and irregularly sampled at the location of each core within the fibre bundle. Preliminary analysis demonstrated core-to-core distances ranging between 3 and 5 pixels, with the remaining pixels approximated via linear interpolation amongst neighbouring cores. Consequently, an LBP radius of 4 pixels is anticipated to reduce the uncertainty introduced through interpolated values, enhancing the discriminative power of the LBP and subsequent classification performance. Further analysis is required to quantitatively assess this effect and to develop descriptors that account for such irregular sampling. Figure 2 provides a confusion matrix for the multi-class classification task along with representative frame examples for each of the target/output pairs. Due to the continuous nature of the data acquisition, transitions between scenes result to class overlap within a single frame. Misclassifications tend to occur in frames where multiple classes are distributed relatively evenly. In such cases, beyond any algorithmic shortcomings, there is a level of uncertainty in the manual annotations, with decisions being likely influenced from past frames within a sequence. Analysis of the intra- and inter-operator variability, while valuable, is beyond the scope of this study. Having developed a reliable method for the multi-class classification of pulmonary OEM images, future extensions (requiring substantial ground truth data) include (i) investigating the use of more powerful methodologies such as Convolutional Neural Networks for further improvements in the classification performance, especially in detecting abnormal
Table 1: Classification results obtained with 4 distinct combinations of LBP features, SVM classifier and OVA ECOC. (* Positive class) LBP Parameters N R N0 8 1 10 16 2 18 24 3 26 32 4 34
Normal Vs Abnormal Elastin* Sensitivity Specificity Accuracy 89.0 91.0 90.0 91.2 91.4 91.3 90.0 91.6 90.8 89.6 94.8 92.2
Elastin* Vs Cells Sensitivity Specificity Accuracy 86.3 86.0 86.2 94.3 93.5 94.0 95.7 90.6 94.0 96.1 93.4 95.2
Multi-Class Accuracy 71.4 85.6 89.3 90.1
[3] Richard C Newton, Samuel V Kemp, Guang-Zhong Yang, Daniel S Elson, Ara Darzi, and Pallav L Shah, “Imaging parenchymal lung diseases with confocal endomicroscopy,” Respiratory medicine, vol. 106, no. 1, pp. 127–137, 2012. [4] Mathieu Sala¨un, Francis Roussel, Genevi`eve Bourg-Heckly, Christine Vever-Bizet, St´ephane Dominique, Anne Genevois, Vincent Jounieaux, G´erard Zalcman, Emmanuel Bergot, JeanMichel Vergnon, et al., “In vivo probe-based confocal laser endomicroscopy in amiodarone-related pneumonia,” European Respiratory Journal, vol. 42, no. 6, pp. 1646–1658, 2013. [5] Jonas Yserbyt, Christophe Dooms, Marc Decramer, and Geert M Verleden, “Acute lung allograft rejection: diagnostic role of probe-based confocal laser endomicroscopy of the respiratory tract,” The Journal of Heart and Lung Transplantation, vol. 33, no. 5, pp. 492–498, 2014.
Fig. 2: Confusion matrix, over representative frame examples of each relevant Target / Output category, for LBP parameters obtaining the highest classification accuracy (N = 32&R = 4).
elastin and multi-class segregation, (ii) expanding the number of classes towards the full temporal parsing of OEM videos. 5. CONCLUSIONS The manual analysis and interpretation of the typically long OEM videos, along with any associated uncertainty, is a well documented bottleneck in the widespread adoption of this novel imaging modality. The methodology proposed, employing LBP, SVM and OVA ECOC, enables the accurate and robust multi-class classification of clinically relevant pulmonary OEM frames, an imperative step towards the automatic temporal parsing of OEM video sequences. It is anticipated that this methodology will facilitate a more targeted bedside diagnosis and associated treatment in a range of respiratory diseases. 6. REFERENCES [1] Luc Thiberville, Sophie Moreno-Swirc, Tom Vercauteren, Eric Peltier, Charlotte Cav´e, and Genevieve Bourg Heckly, “In vivo imaging of the bronchial wall microstructure using fibered confocal fluorescence microscopy,” American journal of respiratory and critical care medicine, vol. 175, no. 1, pp. 22–31, 2007. [2] Sohan Seth, Ahsan Akram, Paul McCool, Jody Westerfeld, David Wilson, Stephen McLaughlin, Kevin Dhaliwal, and Chris Williams, “Assessing the utility of autofluorescencebased pulmonary optical endomicroscopy to predict the malignant potential of solitary pulmonary nodules in humans,” Scientific Reports, vol. 6, no. 31372, 2016.
[6] Antonios Perperidis, Ahsan Akram, Yoann Altmann, Paul McCool, Jody Westerfeld, David Wilson, Kevin Dhaliwal, and Stephen McLaughlin, “Automated detection of uninformative frames in pulmonary optical endomicroscopy,” IEEE Transactions on Biomedical Engineering, vol. 64, no. 1, pp. 87–98, 2017. [7] Chesner D´esir, Caroline Petitjean, Laurent Heutte, Luc Thiberville, and Mathieu Salaun, “An svm-based distal lung image classification using texture descriptors,” Computerized Medical Imaging and Graphics, vol. 36, pp. 264–270, 2012. [8] Aur´elien Saint-R´equier, Benoˆıt Lelandais, Caroline Petitjean, Chesner D´esir, Laurent Heutte, Mathieu Sala¨un, and Luc Thiberville, “Characterization of endomicroscopic images of the distal lung for computer-aided diagnosis,” in Proceedings of the 5th International Conference on Intelligent Computing. Springer, 2009, pp. 994–1003. [9] Timo Ojala, Matti Pietikainen, and Topi Maenpaa, “Multiresolution gray-scale and rotation invariant texture classification with local binary patterns,” IEEE Transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 971–987, 2002. [10] Bernhard E Boser, Isabelle M Guyon, and Vladimir N Vapnik, “A training algorithm for optimal margin classifiers,” in Proceedings of the fifth annual workshop on Computational learning theory. ACM, 1992, pp. 144–152. [11] Olvi L Mangasarian, Nonlinear programming, SIAM, 1994. [12] Kristin P Bennett and Olvi L Mangasarian, “Robust linear programming discrimination of two linearly inseparable sets,” Optimization methods and software, vol. 1, no. 1, pp. 23–34, 1992. [13] Corinna Cortes and Vladimir Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995. [14] Thomas G Dietterich and Ghulum Bakiri, “Solving multiclass learning problems via error-correcting output codes,” Journal of artificial intelligence research, vol. 2, pp. 263–286, 1995. [15] Sergio Escalera, Oriol Pujol, and Petia Radeva, “On the decoding process in ternary error-correcting output codes,” IEEE transactions on pattern analysis and machine intelligence, vol. 32, no. 1, pp. 120–134, 2010.