AUTOMATED IDENTIFICATION OF MICROSTRUCTURES ON ...

69 downloads 0 Views 680KB Size Report
sequence of binary morphological operations. ... 2.2.3 Binary Tree Training and Testing ... generate sub-images with the detected blobs colored as red, green ...
AUTOMATED IDENTIFICATION OF MICROSTRUCTURES ON HISTOLOGY SLIDES Sokol Petushi1, Constantine Katsinis1, Chip Coward2, Fernando Garcia3, Aydin Tozeren2 Drexel University 1. Department of Electrical and Computer Engineering, 2. School of Biomedical Engineering, 3. Department of Pathology, Graduate Hospital

ABSTRACT Grading of breast cancer and the subsequent treatment options are largely dependent on the pathological examination of the histology slides from the tumor tissue. Tumor grading is currently based on the spatial organization of the tissue, including the distribution of cancer cells, the morphological properties of their nuclei and the presence/absence of cancerassociated surface receptors these cells express. In this study, we have developed an automated image processing method to detect and identify clinically relevant microscopic structures on histology slides. The tissue components identified with our program are as follows: fat cells, stroma, and three morphologically distinct cell nuclei types used in grading cancer on the Haematoxylin and Eosin (H&E) stained slides. The image processing is based on gray-scale segmentation, feature extraction, supervised learning, subsequent training and clustering. Our automated processing system has an accuracy of 89% ±0.8 in correctly identifying the three different nuclei types observed in H & E stained histology slides.

1. INTRODUCTION The purpose of this project was to develop an automated image processing method to identify the microstructures that are used in determining the grading of cancer from the two-dimensional images of paraffin-fixed histology slides of the excised tumor tissue [1]. At present, pathologist-based evaluation (grading) of tissue slides is imprecise and is not necessarily predictive of clinical outcome [2]. Some patients in the ‘better’ prognosis category will manifest aggressive disease and vice versa. Recent research showed that chemo- and hormone therapy is either unneeded or ineffective in a majority of breast cancer patients receiving these toxic interventions [3]. In this study we focused on H&E stained breast tumor histology slides. This staining dyes DNA rich cell nuclei into blue and collagen-rich stroma into pink. In H&E stained slides it is difficult to differentiate microstructures such as cell nuclei with dispersed DNA using gray-scale

segmentation from the surrounding stroma. Our automated method differentiates between nuclei with different DNA distribution patterns with 90% accuracy.

Fig. 1.

Considered Tissue Structures.

2. APPROACH 2.1 Dataset and Considered Tissue Structures Our dataset is composed of 24 images from H&E (Haematoxylin and Eosin) stained slides representing 24 different patients, (8-slides from each grade: I, II, III). The slides are scanned using a 40X magnification lens, covering almost all the tissue. Using a random generator function, the initial dataset is split into two sets, the training set and the testing set, such that they contain equal number of grade I, II and III slides images. Each slide image is composed of 400 sub-images, where the size of each sub-image is 460 × 620 pixels and is defined by the cameras fov (field of view). Because of the large size of the slide images (approximately 30 Mbytes each), we primarily use the sub-images as independent inputs in our system. In this study we are looking to identify the following 5 major observable tissue structures characteristic of a given histological breast cancer slide, Figure 1: Type_1 nuclei: the nuclei of inflammatory cells and lymphocytes, Type_2: nuclei: nuclei of cells of epithelial origin having nearly uniform chromatin distribution. These nuclei are significantly larger than the nuclei of lymphocytes. Type_3 nuclei: nuclei of cancer cells with non–uniform chromatin distribution, usually large and weak, Stroma: collagen based support structure, Fat like: areas representing water, carbohydrate, lipid or gas.

features are extracted. In addition, different size vectors are drawn at varying angles in stroma and fat like regions for measuring their pixels intensity and assessing anisotropy in intensity. Figure 3 shows the results for these two structures. The characteristic grayscale intensity and size for each structure are used to generate the labeled dataset in 2.2.3.

Fig. 2.

Mean Structures Models

2.2 Proposed Approach 2.2.1 Prior Structure Learning The first step in our analysis involves off-line learning of the 5 structures described in sub-section 2.2. This off-line learning process is done once and separately from the automated structure identification system. The output of this step is useful for the generation of labeled structure training and testing dataset, which are used for the training and testing of the proposed structure identification system. To start with, we randomly select a number of sub-images representing grade I, II and III slides, and use them to sub-sample representatives from all the five structures. For this purpose a square window is centered at the top of a reasonable number of type 1, type 2 and type 3 nuclei, and also was placed in regions of stroma and fat like structures.

Fig. 3. Intensity

Mean and STD for Stroma & Fat

Like Structures

The size of this window was empirically chosen such that it would be able to sample the type 2 and type 3 nuclei. The intensity of pixels within the window is registered and at the end is used to estimate the mean structures models, Figure 2. Using these mean structure models, three initial features, intensity mean, intensity STD and shape area, are extracted for each type nuclei, while for the stroma and fat like structures only the first two

Fig.4. Designed GUI, used for the training and testing of the system

2.2.2 Segmentation In general, segmentation is an automated process of an image processing based system, which identifies the regions of interest and separates them from the background. The goal in histology image segmentation is to separate regions representing cell nuclei from the stroma and fat like regions. In [4], kernel-based dynamic clustering and genetic algorithm are used to segment the cell nuclei. In [5], the nuclei are identified using a receptive field filter to enhance negatively and positively stained cell nuclei and a squashing function to define each pixel’s class. In [6], gray-scale thinning, morphological and split/merge methods are used to segment cell nuclei. Due to the complex nature of the structures within histology images, we also make use of a hybrid method to segment the regions of interest, which most likely will represent type 1, 2 and 3 cell nuclei. This hybrid method is combination of optimal adaptive thresholding with local morphological opening and closing operations. The optimal adaptive thresholding is based on histogram partitioning to automatically determine the optimal thresholds, [7,8]. The algorithm searches all possible potential thresholds to find the one that maximizes the inbetween class variance and minimizes the within-class variance. Given a sub-image as input, this algorithm is used to adaptively compute an optimal threshold. In the next step, the thresholded sub-image undergoes a sequence of binary morphological operations. Morphological opening and closing [9, 10] in combination with the microstructures prior knowledge, are successively used to eliminate spikes, fill small gaps and separate slightly connected blobs.

features. This task showed us the discriminant level of our features, giving us the ability to reduce the amount of the extracted features from 10 to 7, without decreasing the accuracy of the used classification algorithms. The three first ranked features are shown in Figure 5. In the next step we experimented with different classification methods to achieve the one that best fit to our training dataset. We found that the binary (decision) tree gives the lowest error in our case. For binary trees, at each node a split is made using a test of the form: Xi ∈ A, for some computable subset A ⊂ Xi , 1 ≤ i ≤ n.

Fig.5.

The Three Most Discriminant Features Ranked by the Binary Tree Feature Estimator

Next, a different intuitive morphological operator that we designed is used again in combination with the microstructures prior knowledge to fill out bigger gaps and connect separate parts belonging to the same object. This step drastically improves the segmentation results for the type_3-structure detection. The segmentation output will be a bimodal image with the foreground showing regions belonging to type 1, 2 and 3 microstructures and false detections introduced by the stroma structure.

2.2.3 Binary Tree Training and Testing To automatically identify the considered 5 tissue microstructures the segmentation outputs have to be clustered accordingly as type 1, 2, 3 and stroma. In order to experiment with different clustering algorithms, we have generated a labeled dataset using our graphical user interface shown in Figure 4. Throughout this step the training set images are used. Moreover, to generate the labeled dataset, we use the prior knowledge about the size and intensity of type1, 2 and 3 cell nuclei and pre-cluster the segmentation results by labeling the detected blobs correspondingly. As result from this automated step we generate sub-images with the detected blobs colored as red, green and blue, showing type 1, 2 and 3 cell nuclei correspondingly. Next, the GUI was used to ground-truth the pre-clustering results by showing on screen the location of each candidate in the original image and allowing us decide whether the pre-clustering class was correct or not. At the same time a feature vector of ten variables was extracted for each candidate and associated with its class. At the end, a labeled dataset of 4792 feature vectors was generated. We randomly split this set into two subsets, a training set of 2500 feature vectors, and a testing set of 2292 feature vectors. In the following, we used a package of clustering algorithms [11] to look at the separability of the extracted

Fig. 6.

Binary Tree Used in this Approach

Thus, for a binary tree, in each node the associated set T ⊂ X is split into two subsets TL = {x ∈ T: xi ∈ A} and TR = {x ∈ T: xi ∉ A}. In line with the convention above, we turn left when the test is affirmed, and we turn right if the test is negated. We trained and tested our binary tree with normalized (simple, PCA, LDA) and non-normalized data set. After a number of experiments with our data, we limited the binary tree in 50 nodes and 51 leaves and used simple normalization of the data. Figure 7 shows the decision regions using two of the features, while Table 1 shows results from the testing of the binary tree with 2292 feature vectors of the testing set.

Table 1.

Classification Statistical Results.

Fig.8.

Original Grayscale Image Showing Type II and III Concentration Areas

Fig.7.

Decision Regions/Scatter Plot of Max. Intensity Value (y-axis) vs. STD (x-axis)

2.2.4 Structure Detection and Identification System Combining the previously described steps, a structure detection and identification system was designed. In the following, the execution order of the automated processing of this system is briefly described. We use the hybrid segmentation method, to segment the original grayscale converted sub-images and extract the regions that mostly represent type I and II nuclei or tumor cells. In the grayscale converted sub-images, we indicate the fat like regions and represent them as the average stroma. We segment the result to detect type III mostly nuclei and combine the detected blobs with the ones from the first segmentation. We extract the 7-features vector for each candidate blob and cluster them using the pre-trained binary tree. We estimate the distribution of type II and III cell nuclei and find the their concentration areas. Postprocess the clustering results using the estimated distribution information. We represent the identified structures in different colors and highlight the type II and III cell nuclei concentration areas. 3. CONCLUSIONS AND FUTURE RESEARCH In this paper, we identified clinically relevant microstructures on histology images using gray-scale segmentation, feature extraction, supervised learning, subsequent training and clustering. Application of our method on to a grade 3 histology slide is illustrated in Figures 8 and 9. Our method identified different tissue structures with 89% ±0.8 accuracy as opposed to an accuracy of 54% ± 2.7 in the absence of machine learning. Future research will be directed to improve the level of accuracy even further with the use of different training sets and learning schemes.

Fig.9. T1,

Identified Tissue Structures: T2, T3, Stroma, Fat Like

4. REFERENCES [1] Robbins, Pathologic Basis of Disease, 5th edition, Saunders. [2] Schnitt SJ, Connolly JL, Tavassoli FA, Fechner RE, Kempson RL, Gelman RS, Page DL (1996). Inter-observer reproducibility in the diagnosis of ductal proliferative lesions using standardized criteria AM J Surg Pathol.,16:1133-1143. [3] Golub TR, Slonim DK, Tamayo P et al, (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286: 531-537. [4] Faguo Y., Tianzi J, Cell Image Segmentation with KernelBased Dynamic Clustering and an Ellipsoidal Cell Shape Model, Journal of Biomedical Informatics, 34, pp. 67–73, 2001. [5] F. Schnorrenberg, C. Pattichis, K. Kyriacou, C. Schizas, Computer-Aided Detection of Breast Cancer Nuclei, IEEE Trans. on Information Tech. in Biomedicine, vol.1, No.2, 1997. [6] A. Nedzved, I. Pitas, “Morphological Segmentation of Histology Cell Images”, IEEE ICPR'00, Vol.1, pp. 1500, 2000. [7] N.Otsu, “A threshold selection method from gray-level histogram”, IEEE Transactions on System Man Cybernetics, Vol. SMC-9, No. 1, pp. 62-66, 1979. [8] P.S.Liao, T.S.Chen, P.C.Chung, “A Fast Algorithm for Multilevel Thresholding”, Journal of Information Science and Engineering 17, p.713-727, 2001 [9] John C. Russ, The Image Processing Handbook, 2nd Edition, p. 433-437, CRC Press. [10] M. Sonka, J.M. Fitzpatrick, Handbook of Medical Imaging, vol. 2, p. 182-200, SPIE Press. [11] LNKNET Package, MIT Open Source.