Novel Round-Robin Tabu Search Algorithm for Prostate ... - IEEE Xplore

2 downloads 0 Views 1MB Size Report
Novel Round-Robin Tabu Search Algorithm for Prostate Cancer Classification and Diagnosis. Using Multispectral Imagery. Muhammad Atif Tahir, Member, IEEE ...
782

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 10, NO. 4, OCTOBER 2006

Novel Round-Robin Tabu Search Algorithm for Prostate Cancer Classification and Diagnosis Using Multispectral Imagery Muhammad Atif Tahir, Member, IEEE, and Ahmed Bouridane, Senior Member, IEEE

Abstract—Quantitative cell imagery in cancer pathology has progressed greatly in the last 25 years. The application areas are mainly those in which the diagnosis is still critically reliant upon the analysis of biopsy samples, which remains the only conclusive method for making an accurate diagnosis of the disease. Biopsies are usually analyzed by a trained pathologist who, by analyzing the biopsies under a microscope, assesses the normality or malignancy of the samples submitted. Different grades of malignancy correspond to different structural patterns as well as to apparent textures. In the case of prostate cancer, four major groups have to be recognized: stroma, benign prostatic hyperplasia, prostatic intraepithelial neoplasia, and prostatic carcinoma. Recently, multispectral imagery has been used to solve this multiclass problem. Unlike conventional RGB color space, multispectral images allow the acquisition of a large number of spectral bands within the visible spectrum, resulting in a large feature vector size. For such a high dimensionality, pattern recognition techniques suffer from the well-known “curse-of-dimensionality” problem. This paper proposes a novel round-robin tabu search (RR-TS) algorithm to address the curse-of-dimensionality for this multiclass problem. The experiments have been carried out on a number of prostate cancer textured multispectral images, and the results obtained have been assessed and compared with previously reported works. The system achieved 98%–100% classification accuracy when testing on two datasets. It outperformed principal component/linear discriminant classifier (PCA-LDA), tabu search/nearest neighbor classifier (TS-1NN), and bagging/boosting with decision tree (C4.5) classifier. Index Terms—Feature selection, multispectral images, nearest neighbor (1NN) classifier, prostate cancer diagnosis, round-robin (RR) classification, tabu search (TS).

I. INTRODUCTION VER the last decade, prostate cancer has surpassed lung cancer as the most commonly diagnosed cancer in the male population with approximately 22 900 new cases diagnosed every year in the U.K. alone [1]. There are a number of methods of diagnosis including a prostate-specific antigen (PSA) blood test. If a PSA positive result is obtained, the urologist will often advise a needle biopsy of the prostate in which a small sample of tissue is taken for analysis [2]. A pathologist will analyze the textures and structures present in the samples to make a diagnosis. Different grades of malignancy correspond to different structural patterns as well as to

O

Manuscript received January 24, 2005; revised April 28, 2005, January 31, 2006, and April 2, 2006. M. A. Tahir is with the Faculty of Computing, Engineering, and Mathematical Sciences, University of the West of England, Bristol BS16 1QY, U.K. (e-mail: [email protected]). A. Bouridane is with the School of Electronics, Electrical and Computer Science, Queen’s University, Belfast BT7 1NN, U.K. (e-mail: a.bouridane@ qub.ac.uk). Digital Object Identifier 10.1109/TITB.2006.879596

Fig. 1. Images showing representative samples of the four classes. (a) Stroma. (b) BPH. (c) PIN. (d) PCa.

apparent textures. In the case of the prostate gland, four major groups have to be recognized as follows [2]: 1) stroma: STR (normal muscular tissue); 2) benign prostatic hyperplasia: BPH (a benign condition); 3) prostatic intraepithelial neoplasia: PIN (a precursor state for cancer); 4) prostatic carcinoma: PCa (abnormal tissue development corresponding to cancer). Fig. 1 shows samples of the four classes. In the last two decades, the development in machine vision and intelligent image processing systems combined with advancements in computer hardware has made possible the analysis of histopathological images. By enabling quantitative measurements, machine vision provides valuable assistance to pathologists, and can contribute to reducing diagnosis error cases, thereby avoiding consequent legal and financial issues. In addition to providing objective measurements, computer vision techniques can also reduce the tedious aspect of human image interpretation, and therefore improve the accuracy of the diagnosis. Numerous investigations have been carried out using different approaches such as morphology, texture analysis, and others for the classification of prostatic samples [3]–[8]. However, all these studies have been performed using a color space that is limited either to gray-level images, or to the standard RGB channels. In both cases, the color sampling process results

1089-7771/$20.00 © 2006 IEEE

TAHIR AND BOURIDANE: NOVEL RR-TS ALGORITHM FOR PROSTATE CANCER CLASSIFICATION AND DIAGNOSIS

in the loss of a considerable amount of spectral information, which may be extremely valuable in the classification process. The recent development of technologies such as highthroughput liquid crystal tunable filters (LCTF) have introduced multispectral imaging to pathology, enabling a complete highresolution optical spectrum to be generated at every pixel of a microscope image. Such an approach represents a completely novel way of analyzing pathological tissues. A few pioneering investigations have been carried out such as [9], where the authors used a large set of multispectral texture features for the detection of cervical cancer. In [10], spectral morphometric characteristics were used on specimens of breast carcinoma cells stained with haematoxylin and eosin (H&E). Their analysis showed a correlation between specific patterns of spectra and different groups of breast carcinoma cells. Larsh et al. [11] suggested that multispectral imaging can improve the analysis of pathological scenes by capturing patterns that are transparent both to the human eye and the standard RGB imaging. Recently, Roula and coworkers have described a novel approach, in which additional spectral data is used for the classification of prostate needle biopsies [12], [13]. The aim of their approach is to help pathologists reduce the diagnosis error rate. Instead of analyzing conventional grey scale or RGB color images, spectral bands have been used. Results show that the multispectral image classification using supervised linear discriminant analysis (LDA) outperforms both RGB and greylevel-based classification. Although, an overall classification accuracy of 94% was achieved in their research, a principal component analysis (PCA) technique was used to reduce the high dimensionality of the feature vector. PCA has an obvious drawback because each principal component is considered to be a linear combination of all other variables, the new variables may not have a clear physical meaning. A classification applied to PCA-reduced features may not be optimal since training may contain undesirable artefacts due to illumination, occlusion, or errors from the underlying data-generation method. It is desirable not only to achieve dimensionality reduction but also to take into account the problems mentioned earlier in order to further improve the classification accuracy. The major problem arising in using multispectral data is highdimensional feature vector size (>100). The number of training samples used to design the classifier is small relative to the number of features. For such a high dimensionality problem, pattern recognition techniques suffer from the well-known curse-of-dimensionality problem [14]. In previous papers [15], [16], we addressed the high input dimensionality problem by selecting the best-subset of features using intermediate-memory tabu search (TS) followed by a classification using nearest neighbor classifier (1NN) [17]. Though this approach yielded results superior to previously reported methods [12], [13], the classification accuracy can be further improved by decomposing this multiclass problem into a number of simpler two-class problems. In this case, each subproblem can be regarded separately and solved using a suitable binary classifier. The outputs of this collection of classifiers can then be combined to produce the overall result for the original multiclass problem. In this paper, we propose a novel round-robin (RR) classification algorithm

783

using a tabu search/nearest neighbor (TS/1NN) classifier to improve the classification accuracy. Round-robin classification is a technique that is suitable for use in multiclass problems. The technique consists of dividing the multiclass problem into an appropriate number of simpler binary classification problems [18]. Each binary classifier is implemented as TS/1NN classifier, and the final outcome is computed using a simple voting technique. A key characteristic of this approach is that, in a binary class, the classifier tries to find features that distinguish only that class. Thus, different features are selected for each binary classifier, resulting in an overall increase in classification accuracy. In contrast, in a multiclass problem, the classifier tries to find those features that distinguish all classes at once. The remainder of this paper is organized as follows. Section II discusses the curse-of-dimensionality problem that arises from using multispectral data. Section III describes the RR classification followed by the proposed RR TS/1NN classifier in Section IV. Section V discusses sample preparation and image acquisition. Finally, in Section VI, experiments are described and results presented. Section VIII concludes the paper. II. CURSE-OF-DIMENSIONALITY PROBLEM As discussed in Section I, the major problem arising from multispectral data is related to the feature vector size. Typically, with 16 bands and 8 features in each band, the feature vector size is 128 [12]. For such a high dimensionality problem, pattern recognition techniques suffer from the well-known curse-of-dimensionality problem: keeping the number of training samples limited and increasing the number of features will eventually result in badly performing classifiers [14], [19]. One way to overcome this problem is to reduce the dimensionality of the feature space. While a precise relationship between the number of training samples and the number of features is hard to establish, a combination of theoretical and empirical studies has suggested the following rule of thumb regarding the ratio of the sample size to dimensionality: the number of training samples per class should be greater than or equal to five times the features used [20]. For example, if we have a feature vector of dimension 20, then we need at least 100 training samples per class to design a satisfactory classifier. PCA (a well-known unsupervised feature extraction method) has been used by Roula et al. on the large resulting feature vectors to reduce its dimensionality to a manageable size. The classification tests have been carried out using the supervised LDA [21]. A classification accuracy of 94% has been achieved in their experiments. Another way to reduce the dimensionality of the feature space is by using feature selection methods. The term feature selection refers to the selection of the best subset of the input feature set. These methods used in the design of pattern classifiers have three goals: 1) to reduce the cost of extracting the features; 2) to improve the classification accuracy; and 3) to improve the reliability of the estimation of the performance, since a reduced feature set requires less training samples in the training procedure of a pattern classifier [14], [22]. Feature selection produces savings in the measuring features (since some of the features

784

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 10, NO. 4, OCTOBER 2006

are discarded), and the selected features retain their original physical interpretation [14]. This feature selection problem can be viewed as a multiobjective optimization problem since it involves minimizing the feature subset and maximizing classification accuracy. Mathematically, the feature selection problem can be formulated as follows. Suppose X is an original feature vector with ¯ is the new feature vector with cardinality cardinality n, and X ¯ ¯ n ¯ , X ⊆ X, J(X) is the selection criterion function for the new ¯ The goal is to optimize J(). feature vector X. This feature selection problem is NP-hard [23], [24]. Therefore, the optimal solution can only be achieved by performing an exhaustive search in the solution space [17]. However, exhaustive search is feasible only for small n. A number of algorithms have been proposed for feature selection to obtain near-optimal solutions [14], [22], [25]–[29]. The choice of an algorithm for selecting the features from an initial set depends on n. The feature selection problem is said to be of small scale, medium scale, or large scale accordingly as n belongs to the intervals [0,19], [20,49], or [50,∞], respectively [22], [27]. Sequential forward selection (SFS) [30] is the simplest greedy sequential search algorithm and has been used for land mine detection using multispectral images [31]. Other sequential algorithms such as sequential forward floating search (SFFS) and sequential backward floating search (SBFS) are more efficient than SFS, and usually find fairly good solutions for small- and medium-scale problems [26]. However, these algorithms suffer from the deficiency of converging to local optimal solutions for large-scale problems when n>100 [22], [27]. Recent iterative heuristics such as tabu search and genetic algorithms have proved to be effective in tackling this category of problems, which are characterized by having an exponential and noisy search space with numerous local optima [27], [28], [32], [33]. In previous papers [15], [16], we addressed the high input dimensionality problem by selecting the best-subset of features using an intermediate-memory TS with a classification process using an 1NN classifier. In these, the classifier treats all classes as one multiclass process. In this paper, we propose another scheme in which the multiclass problem is solved using RR classification where the classification problem is decomposed into a number of binary classes. The key point is that it is then possible to design simpler and more efficient binary classifiers as will be demonstrated in Section III. III. ROUND-ROBIN CLASSIFICATION Johannes [18] defines RR classification as follows: “The round-robin or pairwise class binarization transforms a c-class problem into c(c − 1)/2 two-class problems i, j, one for each set of classes i, j, i = 1, . . . , c − 1, j = i + 1, . . . , c. The binary classifier for problem i, j is trained with examples of classes i and j, whereas examples of classes k = i, j are ignored for this problem.” Fig. 2 illustrates a multiclass (four-class) learning problem where one classifier (TS/1NN classifier in this study) separates all classes. Fig. 3 shows round-robin learning with c(c − 1)/2 classifiers. For a four-class problem, the round-robin trains six

Fig. 2.

Multiclass learning. p: PIN. c: PCa. b: BPH. s: STR.

classifiers, one for each pair of classes. Each class is trained using a feature selection algorithm based on the intermediatememory TS/1NN classifier proposed in [16]. A simple voting technique [18] is then used to combine the predictions of the pairwise classifiers thereby computing the final result. In the case of a tie, a distance metric (squared Euclidean distance) is used for the final prediction. Fig. 4 illustrates the simple voting scheme. When classifying an unknown new sample, each classifier (1NN in this case) determines to which of its two classes the sample is more likely to belong. IV. PROPOSED IMPLEMENTATION OF ROUND-ROBIN TABU SEARCH USING INTERMEDIATE MEMORY FOR FEATURE SELECTION A. Overview of Tabu Search TS was introduced by Glover [34], [35] as a general iterative metaheuristic for solving combinatorial optimization problems. TS is conceptually simple and elegant. It is a form of local neighborhood search. TS starts from initial solution, and then examines feasible neighboring solutions. It moves from a solution to its best admissible neighbor, even if this causes the objective function to deteriorate. To avoid cycling, solutions that were recently explored are declared forbidden or tabu for a number of iterations. The tabu status of a solution is overridden when certain criteria (aspiration criteria) are satisfied. Sometimes intensification and diversification strategies are used to improve the search. In the first case, the search is accentuated in promising regions of the feasible domain. In the second case, an attempt is made to consider solutions in a broad area of the search space. The TS algorithm is given in Fig. 5. B. Fuzzy Objective Function In this paper, we present an RR-TS algorithm, where the quality of a solution is characterized by a fuzzy logic rule expressed in linguistic variables of the problem domain. Fuzzy set theory has recently been applied in many areas of science and engineering. In the most practical situations, one is faced with several concurrent objectives. Classic approaches usually deal with such difficulty by computing a single utility function as a weighted sum of the individual objectives, where more important objectives are assigned higher weights [15]. Balancing different objectives by weight functions is, at best, controversial. Fuzzy logic is a convenient vehicle for trading off different

TAHIR AND BOURIDANE: NOVEL RR-TS ALGORITHM FOR PROSTATE CANCER CLASSIFICATION AND DIAGNOSIS

Fig. 3.

Round-robin learning. p: PIN. c: PCa. b: BPH. s: STR.

Fig. 4.

Simple voting scheme.

785

Fig. 6. Membership function for fuzzy subset X, where, in this application, X is the number of features F , the number of incorrect predictions P , or the classification error rate E.

Fig. 5.

Flowchart of a short-term TS.

objectives. It allows the mapping of values of different criteria into linguistic values that characterize the level of satisfaction of the designer with the numerical value of objectives and operation over the interval [0,1] defined by the membership functions for each objective. Three linguistic variables are defined to correspond to the three component objective functions: number-of-features f1 , number-of-incorrect predictions f2 , and average classification error rate f3 . One linguistic value is defined for each component of the objective function. These linguistic values characterize the degree of satisfaction of the designer with the values of

objectives fi (x), i = {1, 2, 3}. These degrees of satisfaction are described by the membership functions µi (x) on fuzzy sets of the linguistic values where µ(x) is the membership value for solution x in the fuzzy set. The membership functions for the minimum number of features, the minimum number of incorrect predictions, and the low classification error rate are easy to build. They are assumed to be nonincreasing functions because the smaller the number of features f1 (x), the number of incorrect predictions f2 (x), and the classification error rate f3 (x), the higher is the degree of satisfaction µ1 (x), µ2 (x), and µ3 (x) of the expert system (see Fig. 6). The fuzzy subset of a good solution is defined by the following fuzzy logic rule: “IF a solution has small number of features AND small number of incorrect predictions AND low classification error rate THEN it is a good solution” According to the and/or like ordered-weighted-averaging logic [36], [37], the above rule corresponds to the following: 1 µi (x) 3 i=1 3

µ(x) = γ × min(µi (x)) + (1 − γ) ×

(1)

786

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 10, NO. 4, OCTOBER 2006

where γ is a constant in the range [0,1]. The shape of the membership function µ(x) is shown in Fig. 6. Membership of data in a fuzzy set is defined using values in the range [0,1]. The membership values for the number of features F , the number of incorrect predictions P , and the classification error rate E are computed using the following: 

1,

F Max −F F Max −F Min ,

µ1 (x) = 

0, 1,

P Max −P P Max −P Min ,

µ2 (x) =  µ3 (x) =

0,

ifF ≤ FMin ifFMin ≤ F ≤ FMax ifFMax ≤ F

(2)

ifP ≤ PMin ifPMin ≤ P ≤ PMax ifPMax ≤ P

(3)

ifE ≤ EMin ifEMin ≤ E ≤ EMax ifEMax ≤ E.

(4)

1,

E Max −E E Max −E Min ,

0,

Fig. 7. Example showing intensification steps for tabu search. ber of occurrences of each feature in the best solutions.

The maximum number of features (FMax ) is the size of the feature vector, and the minimum number of features (FMin ) is 1. The maximum number of incorrect predictions (PMax ) and the maximum classification error rate (EMax ) is determined by applying an 1NN classifier [38] for the initial solution. The minimum number of incorrect predictions (PMin ) is 0, while the minimum classification error rate (EMin ) is 0%. Neighbors are calculated using an squared Euclidean distance defined as D(x, y) =

m 

(xi − yi )2

(5)

i=1

where x and y are two input vectors and m is the number of features. C. Initial Solution The feature selection vector is represented by a 0/1 bit string where 0 indicates that the feature is not included in the solution, while 1 indicates that it is. All features are included in the initial solution. D. Neighborhood Solutions Neighbors are generated by randomly adding or deleting a feature from the feature vector of size n. For example, if 11001 is the current feature vector, then the possible neighbors with a candidate list size of three might be 10001, 11101, 01001. Among the neighbors, the one with the best cost [i.e., the solution that results in the minimum value of (1)] is selected and considered as a new current solution for the next iteration. E. Tabu Moves A tabu list is maintained to avoid returning to the previously visited solutions. Using this approach, if a feature (move) is added or deleted at iteration i, then adding or deleting the same feature (move) for T subsequent iterations (tabu list size) is Tabu.



is the num-

F. Aspiration Criterion Aspiration criterion is a mechanism used to override the tabu status of moves. It temporarily overrides the tabu status if the move is sufficiently good. In our approach, if a feature is added or deleted at iteration i, and this move results in a best cost for all previous iterations, then this feature is allowed to add or delete even if that feature is in the tabu list. G. Termination Rule The most commonly used stopping criteria in TS are: 1) after a fixed number of iterations; 2) after some number of iterations when there has been no increase in the objective function value; 3) when the objective function reaches a prespecified value. In our algorithm, the termination condition is implemented using the fixed number of iterations criterion. H. Intensification For intensification, the search is concentrated in the promising regions of the feasible domain. Intensification is based on some intermediate-term memory. Since, the solution space is extremely large (with initial feature vector n>100), the search in the promising regions is intensified by removing poor features from the search space. The following steps are proposed for this purpose. 1) STEP 1: Store the M best solutions in intermediate memory for T1 iterations. 2) STEP 2: Remove features that are not included in the best M solutions for N times. 3) STEP 3: Re-run the tabu search with the reduced set of features for another T2 iterations. 4) STEP 4: Repeat steps 1)–3) until there is improvement in the objective function. The values of M and N are determined empirically. As an example, assume that the following M = 5 best solutions as shown in Fig. 7 are found by TS during T1 iterations. Feature f1

TAHIR AND BOURIDANE: NOVEL RR-TS ALGORITHM FOR PROSTATE CANCER CLASSIFICATION AND DIAGNOSIS

Fig. 8.

Images showing different subbands of the multispectral image of type stroma.

Fig. 9.

Images showing different subbands of the multispectral image of type PCa.

is always used while feature f5 is never used in good solutions. For N = 2, the reduced feature set comprises only f1 , f2 , f3 , and f6 and f8 . The search space is now reduced from 28 to 25 . Thus, TS will search for the near-optimal solutions in a reduced search space avoiding visits to nonpromising regions.

787

that the classification accuracy increases with the number of spectral bands. Figs. 8–11 show the thumbnails of eight bands of multispectral images of type stroma, PCa, BPH, and PIN, respectively. A. Datasets Description

V. SAMPLE PREPARATION, IMAGE ACQUISITION, AND DATASETS DESCRIPTION Methods for data collection have been described in [12] and [13], and are reviewed briefly here. Entire tissue samples were taken from prostate glands. Sections 5-µm thick were extracted and stained using the widely used H&E stains. These samples were routinely assessed by two experienced pathologists and graded histologically as showing STR, BPH, PIN, and PCa. From these, whole sections subimages were captured using a classical microscope and CCD camera. An LCTF (VARISPECTM) was inserted in the optical path between the light source and the chilled CCD camera. The LCTF has a bandwidth accuracy of 5 nm. The wavelength is controllable through the visible spectrum (from 400 to 720 nm). This allowed the capture of different multispectral images of the tissue samples at different spectral frequencies. In order to show the impact of multispectral imaging, experiments were carried out by Roula et al. [12], [13] for varying numbers of bands. It has been shown

The RR TS/1NN classifier has been tested on two datasets reported in [12], [13], and [16]. In order to offset any bias due to the different range of values for the original features, the input feature values are normalized over the range [1,11] using (6) [39]. Normalizing the data is important to ensure that the distance measure allocates equal weight to each variable. Without normalization, the variable with the largest scale will dominate the measure   xi,j − mink =1...n x(k ,j ) × 10 + 1 xi,j = maxk =1,...,n x(k ,j ) − mink =1,...,n x(k ,j ) (6) where xi,j is the jth feature of the ith pattern, xi,j is the corresponding normalized feature, and n is the total number of patterns. The first dataset consists of textured multispectral images taken at 16 spectral channels (from 500 to 650 nm) [12]. Five hundred and ninety-two different samples (multispectral images) of size 128 × 128 have been used to carry out the analysis.

788

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 10, NO. 4, OCTOBER 2006

Fig. 10.

Images showing different subbands of the multispectral image of type BPH.

Fig. 11.

Images showing different subbands of the multispectral image of type PIN.

The samples are seen at low power (×40 objective magnification) by two highly experienced independent pathologists and labeled into four classes: 165 cases of stroma, 106 cases of BPH, 144 cases of PIN, and 177 cases of PCa. The size of the feature vector is 128 (16 bands ×8 features (1 statistical + 2 structural + 5 Haralick). The second dataset is derived from prostatic nuclei extracted from prostate tissue [13]. Nuclei are imaged under high power (×100 objective magnification). These nuclei are taken at 33 spectral channels (from 400 to 720 nm). Two hundred and thirty different images of size 256 × 256 have been used to carry out the analysis. The samples are labeled into three classes: 63 cases of BPH, 79 cases of PIN, and 88 cases of PCa. The size of the feature vector is 266 [(33 bands × 8 features (3 statistical + 5 Haralick) + 2 morphology features]. The explanation of the choice of features has been discussed in [12] and [13]. The following sections briefly review the features. 1) Dataset1: The following features are used in dataset1 [12]. Statistical feature: Statistical feature (variance) is calculated for each band in a multispectral image and added to the feature vector.

Haralick features: The five Haralick features [40] (dissimilarity, contrast, angular second moment, entropy, and correlation) are calculated for each band in a multispectral image and added to the feature vector. Structural features: It has been shown in [12] that statistical and Haralick features are not enough to capture the complexity of the patterns in prostatic neoplasia. BPH and PCa present more complex structures, as both contain glandular areas in addition to nuclei clusters. Accurate classification requires the quantification of these differences. Quantification will first require segmenting the glandular and the nuclear areas using the fact that the glandular areas are lighter than the surrounding tissue, while the nuclear clusters are darker. Two structural features (one with the number of pixels classified as glandular area and the other with number of pixels classified as nuclear area) are computed for each band in a multispectral image and added to the feature vector. 2) Dataset2: The following features are used in dataset2 [13]. Statistical features: Three statistical features (mean, standard deviation, and geometric moment) are calculated for each band in a multispectral image and added to the feature vector.

TAHIR AND BOURIDANE: NOVEL RR-TS ALGORITHM FOR PROSTATE CANCER CLASSIFICATION AND DIAGNOSIS

TABLE I CLASSIFICATION ERROR BY PCA/LDA REPORTED IN [12]

TABLE V CLASSIFICATION ERROR BY USING FEATURE SELECTION THROUGH TABU SEARCH

TABLE II CLASSIFICATION ERROR BY MULTICLASS LEARNING USING TS/1NN REPORTED IN [16]

TABLE VI CLASSIFICATION ERROR BY THE PROPOSED ROUND-ROBIN LEARNING USING TS/1NN

TABLE III CLASSIFICATION ERROR BY PROPOSED ROUND-ROBIN LEARNING USING TS/1NN

789

TABLE VII NUMBER OF FEATURES USED BY DIFFERENT CLASSIFIERS

TABLE IV CLASSIFICATION ERROR REPORTED IN [13] TABLE VIII NUMBER OF FEATURES USED BY DIFFERENT CLASSIFIERS

Haralick features: Five Haralick features [40] (dissimilarity, contrast, angular second moment, entropy, and correlation) are calculated for each band in a multispectral image and added to the feature vector. Morphology features: Two morphology features (nuclei area and nuclei round factor) [5] are calculated and added to the feature vector. VI. EXPERIMENTS AND DISCUSSION The leave-one-out method is used for cross validation [14]. For leave-one-out cross validation, a classifier is designed using (s − 1) samples and is evaluated on the one remaining sample; this is repeated s times, with different training sets of size (s − 1). Table I shows the classification error for the first dataset as reported in [12], where data reduction was performed using PCA and the classification was performed using supervised linear discrimination analysis. Table II shows the minimum classification error obtained by a multiclass method using a TS/1NN classifier, while Table III depicts the classification error obtained by the proposed round-robin learning using a TS/1NN classifier. From these results, it can be observed that an RR-based classification

yields better results than the multiclass approach. The overall classification error has been reduced to 1.23% from 5.71% and 2.90% as reported in [12] and [16], respectively. A key characteristic of the proposed round-robin approach is that different features are captured and used for each binary classifier in the four-class problem, thus producing an overall increase in classification accuracy. In contrast, in a multiclass problem, the classifier tries to find those features that distinguish all four-classes at once. Furthermore, the inherent curse-of-dimensionality problem, which arises in a multispectral data, is also resolved by the RR TS/1NN classifiers since each classifier is trained to compute and use only those features that distinguish its own binary class. [13] and [16], respectively. Table IV shows the classification error for the second dataset reported in [13], where data reduction is performed using PCA and classification using LDA. Table V shows the minimum classification error obtained by the multiclass technique using a TS/1NN classifier, while Table VI depicts the classification

790

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 10, NO. 4, OCTOBER 2006

Fig. 12. Images showing typical errors made by multiclass TS/1NN classifier but correctly classified by the proposed classifier. (a) Misclassified as PIN by multiclass classifier. (b) Misclassified as BPH by multiclass classifier. (c) Misclassified as PIN by multiclass classifier. (d) Misclassified as BPH by multiclass classifier.

Fig. 13.

Objective function versus number of iterations for BPH versus stroma before intensification.

error obtained by the proposed round-robin learning using a TS/1NN classifier. From these tables, it can be observed that the classification accuracy has been increased for all cases. The overall classification error has been reduced to 0% from 5.1% and 0.91% as reported in Table VII and VIII show the number of features used by various data reduction techniques for dataset1 and dataset2, respectively. Different numbers of features have been used by the various binary classifiers producing an overall increase in the classification accuracy. Fc represents those features that are common in two or more different binary classifiers. Although the total number of features has increased in our proposed roundrobin technique, the number of features used by each binary classifier is never greater than that used in any multiclass method with either PDA/LDA or TS/1NN. Consequently, multispectral data is better utilized by using a round-robin technique since the

use of more features means more information is captured and used in the classification process. Furthermore, simple binary classes are also useful for analyzing features and are extremely helpful for pathologists in distinguishing various patterns such as BPH, PIN, STR, and PCa. Fig. 12 shows typical errors made by multiclass TS/1NN classifier. These samples are correctly classified by the proposed RR-TS/1NN classifier because different features are now used for each binary classifier and result in the increase of classification accuracy. A. Quality of Solutions Produced by Tabu Search Fig. 13 shows the value of an objective function versus the number of iterations when searching the solution search space using TS for binary classifier BPH versus stroma. The objective

TAHIR AND BOURIDANE: NOVEL RR-TS ALGORITHM FOR PROSTATE CANCER CLASSIFICATION AND DIAGNOSIS

Fig. 14.

791

Objective function versus number of iterations for BPH versus stroma after intensification. TABLE IX TABU RUN-TIME PARAMETERS

functions here are fuzzy membership, the number of incorrect predictions, the classification error rate, and the number of features. All figures show how well-focused the TS is on the good solution space. From the graphs, it can also be seen that the TS rapidly converges to the feasible/infeasible region border for all of these objectives. Fig. 14 depicts the value of objective functions versus the number of iterations after reducing the size of the feature set by using an intensification technique discussed in Section IV. From these graphs, it can be seen that the search for the best solutions is now limited only to the good solution space, i.e., the membership function is in the range 0.1–0.15 for most of the iterations while the same membership function was in the range 0.15–0.3 without the inclusion of an intensification process. Similarly, the number of features is in the range of 10–20 for most of the iterations, while the number of features was in the range of 20–40 without intensification.

B. Run-Time Parameters for Tabu Search Table IX shows the tabu run-time parameters chosen after experimentation with different values. The values of

TABLE X COMPUTATION TIME COMPARISON

M and N , mentioned in Section IV-H, are 100 and 10, respectively. C. Computation Time The training of RR classifiers using TS/1NN is an offline procedure that is used to find the best subset of features while keeping the classification error rate low for each binary classifier. Once the TS finds the best subset of features, an 1NN classifier is used to determine the class of the new sample (a multispectral image), which provides an online diagnosis decision for the pathologist. Table X shows the computation times for multiclass and binary-classes when determining the class of the new sample. The execution times using RR classifiers are higher than those for the multiclass classifier. The cost of measuring features is increased because more features are required in the round-robin approach. Thus, the classification accuracy using the roundrobin technique has been improved for multispectral imagery at a cost of 1.54 and 1.68 times the execution time for dataset1 and dataset2, respectively.

792

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE, VOL. 10, NO. 4, OCTOBER 2006

TABLE XI CLASSIFICATION ACCURACY (%) USING VARIOUS ENSEMBLE TECHNIQUES

Dr. M. A. Roula from Glamorgan University, U.K. for useful discussions on the subject. REFERENCES

VII. RELATED WORK Bagging [41] and boosting [42] are also well-known ensemble design techniques to improve the prediction of classifier systems. In bagging, the training set is sampled to produce random independent bootstrap replicates [43]. A set of classifiers are constructed on each of these replicates. The results are then combined by simple majority voting to obtain the final decision. In the case of boosting (AdaBoostM1), the classifiers are constructed on weighted versions of the training set, which are dependent on previous classification results. Weighted voting is normally used in boosting to obtain the final decision. An empirical comparison between bagging and boosting is provided by Bauer and Kohavi [44]. Table XI shows the comparison between the proposed RR-TS/1NN versus Bagging and AdaBoost. Decision Tree (C4.5) [45] and NN classifiers are used as base classifiers for bagging and boosting. The reason for choosing C4.5 is that bagging and boosting improve the prediction of the classifier system when a decision tree is used as a base classifier. Unfortunately, bagging and boosting are unable to improve the classification accuracy when an NN classifier is used as a base classifier [46]. This fact is clearly seen from Table XI, where the classification accuracy is degraded while using AdaBoost, and only minor improvements are achieved while using bagging. However, classification accuracy is improved by using bagging and boosting when C4.5 is used as base classifier. Furthermore, it is clear from the table that the proposed round-robin ensemble technique using TS/1NN has outperformed both bagging and boosting ensemble-design techniques. VIII. CONCLUSION In this paper, a novel RR TS/1NN algorithm with intermediate-term memory is proposed for the classification of prostate needle biopsies using multispectral imagery. Results have indicated a significant increase in the classification accuracy. A key characteristic of the proposed round-robin approach is that different features are used for each binary classifier from multispectral images, thus producing an overall increase in the classification accuracy. In contrast, in a multiclass problem, the classifier tries to find only those features that distinguish all classes at once. This algorithm is generic and can be used for the diagnosis of other diseases such as lung and breast cancer. Furthermore, the proposed tabu search progressively zooms toward a better solution subspace as time elapsed, a desirable characteristic of approximation iterative heuristics. ACKNOWLEDGMENT The authors would like to thank the Department of Pathology at Queen’s University for providing the data samples and

[1] Cancer Research. CancerStats—Incidence—U.K. Annual Report of Cancer Research U.K., London, U.K., 2002. [2] J. N. Eble and D. G. Bostwick, Urologic Surgical Pathology. St. Louis, MO: Mosby, 1996. [3] P. H. Bartels et al., “Nuclear chromatin texture in prostatic lesions: I PIN and adenocarcinoma,” Anal. Quant. Cytol. Histol., vol. 20, no. 15, pp. 389–396, 1998. [4] T. D. Clark, F. B. Askin, and C. R. Bagnell, “Nuclear roundness factor: A quantitative approach to grading in prostate carcinoma, reliability of needle biopsy tissue, and the effect of tumor stage on usefulness,” Prostate, vol. 10, no. 3, pp. 199–206, 1987. [5] R. Christen et al., “Chromatin texture features in hematoxylin and eosinstained prostate tissue,” Anal. Quant. Cytol. Histol., vol. 15, no. 6, pp. 383– 388, 1993. [6] J. L. Mohler et al., “Nuclear morphometry in automatic biopsy and radical prostatectomy specimens of prostatic carcinoma. A comparison,” Anal. Quant. Cytol. Histol., vol. 16, no. 6, pp. 415–420, 1994. [7] C. Minimo et al., “Importance of different nuclear morphologic patterns in grading prostatic adenocarcinoma. An expanded model for computer graphic filters,” Anal. Quant. Cytol. Histol., vol. 16, no. 5, pp. 307–314, 1994. [8] D. E. Pitts, S. B. Premkumar, A. G. Houston, R. J. Badaian, and P. Troncosa, “Texture analysis of digitized prostate pathologic cross-sections,” in Proc. SPIE Med. Imaging Image Process, 1993, pp. 456–470. [9] Y. Liu, T. Zahoa, and J. Zhang, “Learning multispectral texture features for cervical cancer detection,” in Proc. IEEE Int. Symp. Biomed. Imaging, Washington, DC, 2002, pp. 169–172. [10] I. Barshack, J. Kopolovic, Z. Malik, and C. Rothmann, “Spectral morphometric characterization of breast carcinoma cells,” Brit. J. Cancer, vol. 79, no. 9–10, pp. 1613–1619, 1999. [11] P. Larsh, L. Cheriboga, H. Yee, and M. Diem, “Infrared spectroscopy of humans cells and tissue: Detection of disease,” Technol. Cancer Res. Treat., vol. 1, no. 1, pp. 1–7, 2002. [12] M. A. Roula, J. Diamond, A. Bouridane, P. Miller, and A. Amira, “A multispectral computer vision system for automatic grading of prostatic neoplasia,” in Proc. IEEE Int. Symp. Biomed. lmaging, 2002, pp. 193–196. [13] M. A. Roula, A. Bouridane, and P. Miller, “A quadratic classifier based on multispectral texture features for prostate cancer diagnosis,” in Proc. 7th Int. Symp. Signal Process. Appl., Paris, France, 2003, pp. 37–40. [14] A. K. Jain, R. P. W. Duin, and J. Mao, “Statistical pattern recognition: A review,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 1, pp. 4–37, Jan. 2000. [15] M. A. Tahir et al., “Feature selection using tabu search for improving the classification rate of prostate needle biopsies,” in Proc. IEEE Int. Conf. Pattern Recog., Cambridge, U.K., 2004, pp. 335–338. [16] , “A novel prostate cancer classification technique using intermediate memory tabu search,” Eurasip J. Appl. Signal Process., Adv. Intell. Vis. Syst. Methods Appl., vol. 14, pp. 2241–2249, 2005. [17] T. M. Cover and J. M. Van Campenhout, “On the possible orderings in the measurement selection problem,” IEEE Trans. Syst., Man, Cybern., vol. SMCA-7, no. 9, pp. 657–661, Sep. 1977. [18] J. Furnkranz, “Round robin classification,” J. Mach. Learn. Res., vol. 2, pp. 721–747, 2002. [19] L. O. Jimenez and D. A. Landgrebe, “Supervised classification in high dimensional space: Geometrical, statistical, and asymptotical properties of multivariate data,” IEEE Trans. Syst., Man, Cybern. C, Appl. Rev., vol. 28, no. 1, pp. 39–54, Feb. 1998. [20] K. Fukunaga, Introduction to Statistical Pattern Recognition, 2nd ed. New York: Academic, 1990. [21] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification. Hoboken, NJ: Wiley-Interscience, 2001. [22] M. Kudo and J. Sklansky, “Comparison of algorithms that select features for pattern classifiers,” Pattern Recognit., vol. 33, pp. 25–41, 2000. [23] E. Amaldi and V. Kann, “On the approximability of minimizing nonzero variables or unsatisfied relations in linear systems,” Theor. Comput. Sci., vol. 209, pp. 237–260, 1998. [24] S. Davies and S. Russell, “NP-completeness of searches for smallest possible feature sets,” in Proc. AAAI Fall Symp. Relevance, 1994, pp. 37–39.

TAHIR AND BOURIDANE: NOVEL RR-TS ALGORITHM FOR PROSTATE CANCER CLASSIFICATION AND DIAGNOSIS

[25] A. K. Jain and D. Zongker, “Feature selection: Evaluation, application, and small sample performance,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 19, no. 2, pp. 153–158, Feb. 1997. [26] P. Pudil, J. Novovicova, and J. Kittler, “Floating search methods in feature selection,” Pattern Recognit. Lett., vol. 15, pp. 1119–1125, 1994. [27] H. Zhang and G. Sun, “Feature selection using tabu search method,” Pattern Recognit., vol. 35, pp. 701–711, 2002. [28] W. Siedlecki and J. Sklansy, “A note on genetic algorithms for large-scale feature selection,” Pattern Recognit. Lett., vol. 10, no. 11, pp. 335–347, 1989. [29] S. B. Serpico and L. Bruzzone, “A new search algorithm for feature selection in hyperspectral remote sensing images,” IEEE Trans. Geosci. Remote Sens., vol. 39, no. 7, pp. 1360–1367, Jul. 2001. [30] A. W. Whitney, “A direct method of nonparametric measurement selection,” IEEE Trans. Comput., vol. C-20, no. 9, pp. 1100–1103, Sep. 1971. [31] G. A. Clark et al., “Multispectral image feature selection for land mine detection,” IEEE Trans. Geosci. Remote Sens., vol. 38, no. 1, pp. 304–311, Jan. 2000. [32] S. M. Sait and H. Youssef, Iterative Computer Algorithms With Applications in Engineering: Solving Combinatorial Optimization Problems. Los Alamitos, CA: Wiley-IEEE Comput. Soc. Press, 2000. [33] S. Yu, S. D. Backer, and P. Scheunders, “Genetic feature selection combined with composite fuzzy nearest neighbor classifiers for hyperspectral satellite imagery,” Pattern Recognit. Lett., vol. 23, pp. 183–190, 2002. [34] F. Glover, “Tabu search I,” ORSA J. Comput., vol. 1, no. 3, pp. 190–206, 1989. [35] , “Tabu search II,” ORSA J. Comput., vol. 2, no. 1, pp. 4–32, 1990. [36] H. J. Zimmerman, Fuzzy Set Theory and Its application, 3rd ed. Norwell, MA: Kluwer, 1996. [37] S. A. Khan, S. M. Sait, and H. Youssef, “Topology design of switched enterprise networks using fuzzy simulated evolution algorithm,” Eng. Appl. Artif. Intell., pp. 327–340, 2002. [38] T. M. Cover and P. E. Hart, “Nearest neighbor pattern classification,” IEEE Trans. Inf. Theory, vol. IT-13, no. 1, pp. 21–27, Jan. 1967. [39] M. L. Raymer et al., “Dimensionality reduction using genetic algorithms,” IEEE Trans. Evol. Comput., vol. 4, no. 2, pp. 164–171, Jul. 2000. [40] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features for image classification,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3, no. 6, pp. 610–621, Nov. 1973. [41] L. Breiman, “Bagging predictors,” Mach. Learn., vol. 24, pp. 123–140, 1996. [42] Y. Freund and R. Schapire, “Experiments with a new boosting algorithm,” Proc. 13th Int. Conf. Mach. Learn., San Mateo, CA: Morgan Kaufmann, 1996, pp. 148–156. [43] B. Efron and R. Tibshirani, An Introduction to the Bootstrap. New York: Chapman & Hall, 1993.

793

[44] E. Bauer and R. Kohavi, “An empirical comparison of voting classification algorithms: Bagging, boosting, and variants,” Mach. Learn., vol. 36, no. 1, pp. 105–139, 1999. [45] R. Quinlan, C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann, 1993. [46] B. Yongguang, N. Ishii, and X. Du, “Combining multiple k-nearest neighbor classifiers using different distance functions,” Lectures Notes in Computer Science (LNCS 3177), 5th Int. Conf. Intell. Data Eng. Autom. Learn. (IDEAL 2004). Exeter, U.K.

Muhammad Atif Tahir (S’03–M’06) received the B.E. degree from NED University of Engineering and Technology, Karachi, Pakistan, and the M.Sc. degree from King Fahd University, Dhahran, Saudi Arabia, both in computer engineering. He received the Ph.D. degree in computer science from Queen’s University, Belfast, U.K., in 2006. He is currently a Research Associate at the University of the West of England, Bristol, U.K. His current research interests include machine learning, custom computing using FPGAs, image/signal processing, pattern recognition, and QoS routing and optimization heuristics.

Ahmed Bouridane (M’98–SM’06) received the “Ing´eniorat d’Etat” degree in electronics from the National Polytechnic School of Algiers (ENP), Algiers, Algeria, the M.Phil. degree in VLSI design for signal processing from the University of Newcastle Upon Tyne, U.K. He received the Ph.D. degree in computer vision from the University of Nottingham, Nottingham, U.K., in 1992. He held several positions in R&D before joining Queen’s University, Belfast, U.K., where he is currently a Reader in Computer Science. His research interests include high-performance image/signal processing, image/video watermarking, custom computing using FPGAs, computer vision, and highperformance architectures for image/signal processing.