RESEARCH ARTICLE
Semi-Supervised Feature Transformation for Tissue Image Classification Kenji Watanabe1*, Takumi Kobayashi1, Toshikazu Wada2 1 Department of Information Technology and Human Factors, National Institute of Advanced Industrial Science and Technology (AIST), Tsukuba, Ibaraki, Japan, 2 Department of Computer & Communication Science, Wakayama University, Wakayama, Wakayama, Japan *
[email protected]
a11111
Abstract
OPEN ACCESS Citation: Watanabe K, Kobayashi T, Wada T (2016) Semi-Supervised Feature Transformation for Tissue Image Classification. PLoS ONE 11(12): e0166413. doi:10.1371/journal.pone.0166413 Editor: Dalin Tang, Worcester Polytechnic Institute, UNITED STATES Received: April 11, 2016 Accepted: October 29, 2016 Published: December 2, 2016 Copyright: © 2016 Watanabe et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Data Availability Statement: All machine learning repository files are available from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml/ ). For tissue images, the files are available from IICBU Biological Image Repository (https://ome.irp. nia.nih.gov/iicbu2008/). For feature extraction, the software of GIST extraction is available online (http://lear.inrialpes.fr/software). Funding: This work was supported by JSPS KAKENHI Grant Number 26330194 (https://www. jsps.go.jp/english/e-grants/index.html). The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Various systems have been proposed to support biological image analysis, with the intent of decreasing false annotations and reducing the heavy burden on biologists. These systems generally comprise a feature extraction method and a classification method. Task-oriented methods for feature extraction leverage characteristic images for each problem, and they are very effective at improving the classification accuracy. However, it is difficult to utilize such feature extraction methods for versatile task in practice, because few biologists specialize in Computer Vision and/or Pattern Recognition to design the task-oriented methods. Thus, in order to improve the usability of these supporting systems, it will be useful to develop a method that can automatically transform the image features of general propose into the effective form toward the task of their interest. In this paper, we propose a semisupervised feature transformation method, which is formulated as a natural coupling of principal component analysis (PCA) and linear discriminant analysis (LDA) in the framework of graph-embedding. Compared with other feature transformation methods, our method showed favorable classification performance in biological image analysis.
Introduction In biological image analysis, biologists manually identify and/or classify the images captured via a microscope. However, the data usually comprise a large number of images, and thus the analysis imposes a heavy burden on biologists, which increases the risk of false annotations. Therefore, in order to improve both efficiency and accuracy, there is a great demand for developing a system to support biologists with image annotation. Recently, many such systems have been proposed [1–5], and some of them are currently being used in biological and medical research. These supporting systems, which analyze biological images, are generally constructed based on feature extraction and classification methods. In those systems, task-oriented feature extraction methods, such as by using the shift-androtation-invariant feature extraction method for classifying biological particles [4], are very effective [1–4] at improving the classification accuracy. However, the improvement is limited when the method is applied to an unexpected task (such as when a feature extraction method
PLOS ONE | DOI:10.1371/journal.pone.0166413 December 2, 2016
1 / 20
Semi-Supervised FT for Tissue Image Classification
Competing Interests: The authors have declared that no competing interests exist.
for intracellular particles is applied to an image classification task for tissues) [6], and knowledge of Computer Vision and/or Pattern Recognition is necessary in order to successfully apply the various feature extraction methods. Unfortunately, few of the primary users of these systems, the research biologists, specialize in Computer Vision and/or Pattern Recognition. In recent years, the methods of deep learning such as convolutional neural networks (CNN) have produced promising performance in many image classification tasks [7, 8]. For training those CNN-based methods, it is necessary to prepare large-scale datasets as well as specialized knowledge about the CNN architectures, which however is generally not available in the field of biological classification. On the other hand, the CNN feature extractors “pretrained” on the large-scale data, e.g., ImageNet [9], of different domain are shown to be transferable by effectively improving, e.g., medical image classification [10]. In that case, it will be further useful to apply a (semi-) supervised feature transformation method that can automatically adapt the general features to various types of tasks by making these methods available to biologists lacking in specialized knowledge of feature extraction methods. Here, we simply define that the feature transformation as the linear mapping of y = ATx, in which the transformation matrix A is obtained by solving an optimization problem. We can apply the above feature transformation to obtain classifiable features y from various characteristics features x by using A without knowing how x is constructed. Therefore, we can regard a multivariate analysis as the feature transformation. When we apply the feature transformation to the extracted features in the classification of biological datasets, the feature transformation method should be applicable to the ill-posed problem without the specialized knowledge, because the biological dataset is generally small compared to the dimensionality of the input vector as shown in [11]. In this case, the multivariate analysis method can easily deal with the ill-posed problem by solving a dual formulation. Principal component analysis (PCA) uses a simple unsupervised feature transformation, and it is widely used for applications requiring dimensionality reduction and/or feature extraction [12]. It is essentially the same as the Karhunen-Loève transformation [13], and it is formulated as the problem of estimating the orthogonal transformation coefficients from a given set of input data by maximizing the variance of the transformed data. Some studies have shown that when the size of the training dataset is small, PCA can outperform LDA, and in addition, PCA is less sensitive to differences in the categories [14]. However, in general, (semi-) supervised feature transformations perform better than PCA. Fishers linear discriminant analysis (LDA) [15] is a well-known method for extracting the features that maximize the discrimination. LDA is formulated as the problem of estimating the transformation coefficients for labeled input data such that the ratio of the between-class variance to the within-class variance is maximized. When the label information is available, e.g., in classification tasks, LDA performs better than PCA [16]. However, especially in the biological field, it is difficult to prepare many training samples which were given reliable class labels. When the number of labeled samples is less than the number of dimensions, the covariance matrix of the classes may not be accurately estimated. In this case, the generalization performance for the testing samples cannot be guaranteed. In order to overcome this problem, various feature transformation methods have been proposed; these include semi-supervised discriminant analysis (SDA) [17] and the heuristic fusion algorithm [18]. For biological data such as tissue images, the given class labels are often unreliable, because objects to be measured inherently contain some physical and biological uncertainty. Moreover, some given labels might be incorrectly assigned by human intuition. Whereas, reliable labels would be available for a small portion of the training samples. In such case, the method of semi-supervised learning is effectively applied to transform the features extracted from the biological data and/or it.
PLOS ONE | DOI:10.1371/journal.pone.0166413 December 2, 2016
2 / 20
Semi-Supervised FT for Tissue Image Classification
SDA is a natural extension of LDA in a graph-embedding framework [19]. The graphembedding framework can be considered as a general expression of multivariate analysis, such as PCA and LDA, in a graph structure. The regularization term in SDA is based on the locality preserving projections (LPP) [20, 21] and is introduced to deal with the unlabeled training samples. Thus, it efficiently exploits both labeled and unlabeled data; the labeled data are used to maximize the discriminating power, while the unlabeled data are used to maximize the locality preserving power. When applied to actual data, especially when applied to biological microscopic images, it is difficult to determine the optimal similarity measure for the regularization term, because this depends on the characteristics of the sample. In this paper, we propose semi-supervised component analysis (SCA), a method for transforming features in order to improve the classification accuracy and the usability of image analysis in biological fields. Our method is formulated in the framework of semi-supervised learning, directly incorporating PCA and LDA via a graph-embedding expression; a discriminant criterion is added to the PCA when there are labeled training samples. This is not the same as the fusion algorithm [18], which heuristically and individually mixes the coefficients estimated by LDA and PCA, and this ensures that our proposed method performs at least as well as either PCA or LDA. In addition, our method does not require a priori knowledge of similarity, as does SDA. Furthermore, we also present a kernel-based method (similar to those used in [19–21]) to deal with ill-posed problems. A preliminary version of the proposed SCA has been published [22]. In the present paper, we propose a refined version and discuss its formulation. In addition, we introduce a scaling parameter to the definition of the SCA in order to improve the cooperation between PCA and LDA.
Methods In this section, we briefly review PCA and LDA expressed by the graph-embedding framework, and we then present SCA.
Principal component analysis PCA is a linear transformation method that is widely used to estimate the orthogonal bases so as to maximize the variance of projected data. Suppose X = [x1 . . . xn] 2