Pattern Anal Applic (2008) 11:247–257 DOI 10.1007/s10044-007-0100-z
THEORETICAL ADVANCES
On kernel difference-weighted k-nearest neighbor classification Wangmeng Zuo Æ David Zhang Æ Kuanquan Wang
Received: 15 December 2006 / Accepted: 8 November 2007 / Published online: 26 January 2008 Springer-Verlag London Limited 2008
Abstract Nearest neighbor (NN) rule is one of the simplest and the most important methods in pattern recognition. In this paper, we propose a kernel differenceweighted k-nearest neighbor (KDF-KNN) method for pattern classification. The proposed method defines the weighted KNN rule as a constrained optimization problem, and we then propose an efficient solution to compute the weights of different nearest neighbors. Unlike traditional distance-weighted KNN which assigns different weights to the nearest neighbors according to the distance to the unclassified sample, difference-weighted KNN weighs the nearest neighbors by using both the correlation of the differences between the unclassified sample and its nearest neighbors. To take into account the effective nonlinear structure information, we further extend differenceweighted KNN to its kernel version KDF-KNN. Our experimental results indicate that KDF-WKNN is much better than the original KNN and the distance-weighted KNN methods, and is comparable to or better than several state-of-the-art methods in terms of classification accuracy. Keywords Nearest neighbor Distance-weighted KNN Pattern classification Classifier Kernel methods
W. Zuo (&) K. Wang School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China e-mail:
[email protected] K. Wang e-mail:
[email protected] D. Zhang Biometrics Research Centre, Department of Computing, Hong Kong Polytechnic University, Kowloon, Hong Kong e-mail:
[email protected]
1 Introduction Nearest neighbor (NN) rule is one of the oldest, the simplest and the most important methods for regression, pattern classification, and case-based reasoning [11]. Nowadays NNs have been widely exploited in various artificial intelligence tasks, such as pattern recognition, data mining, posterior probabilities estimation, similaritybased query, computer vision, and bioinformatics. Various aspects of NN development are investigated from algorithmic innovations, computational concerns, theoretical analysis and visualization [6, 9, 29]. Assume that we are given a set of training data, X ¼ fx1 ; . . .; xm g; xi 2 RN with the corresponding labels y = {y1,…, ym} , yi [{x1, x2,…, xC}. Given an unknown sample x, the k-nearest neighbor (KNN) rule first identifies the k nearest neighbors of the m training samples, counts the number of samples ki that belong to class xi, and then assigns x to the class xj with the maximum kj of samples. To improve the computational or classification performance of original KNN, it has recently been proposed a variety of prototype condensing, distance measure, and weighted NN techniques. Prototype condensing techniques are used to efficiently reduce a large training set to a small, representative prototype set while maintaining the classification performance by instance filtering, instance abstraction, or their combination [1, 17–19]. Fast nearest neighbor classification has been investigated by efficiently exploiting metric structure and hardware resources [4, 15]. Distance measure technique intends to work out a dissimilarity measure or metric with a better class discriminative capability. Weighted NN aims to weigh the discriminative capability of different features (weighted metric) or different nearest neighbors (sample-weighted KNN).
123
248
Here we focus on the development of weighted NN to achieve better classification performance. We roughly grouped all the current weighted NN strategies into two major categories, weighted metric for NN classification and sample-weighted KNN.
Pattern Anal Applic (2008) 11:247–257
neighbor of an unclassified sample x arranged in an and increasing order according to the distance between xNN i x, d(x,xNN ). Distance-weighted KNN assign to the ith i nearest neighbor a weight wi as a function of dðx;xNN Þdðx;xNN Þ
i k dðx; xNN i Þ; wi ¼ dðx;xNN Þdðx;xNN Þ : In the same paper, Dudani k
1.1 Weighted metric for nearest neighbor classification Generally, weighted metric can be defined as a distance between an unclassified sample x and a training sample xi as P P P dðx; xi Þ ¼ ðx xi ÞT Ri ðx xi ÞT : If all , the i = metric is a global metric, otherwise it is a local metric. Short and Fukunaga proposed to use a local adaptive distance metric for a two-class problem [31]. Their method obtains NN the first KNN {xNN 1 ,…, xk } using the Euclidean distance. NN NN NN Let {y1 ,…, yk }, yi [ {x1, x2} be the class labels of the nearest neighbors. The local metric first defines M1 ¼ P P ðxNN xÞ=k1 and M0 ¼ i ðxNN xÞ=k; where i i yNN i ¼x1 k1 is the number of xji in class x1, and then calculate the local T NN weighted distance d(x, xNN i ) = |(M1 - M0) (x - xi )|. Subsequently, Fukugana and Flick presented an optimal P global metric for NN and developed a nonparametric estimator [12]. Later Hastie and Tibshirani proposed a discriminant adaptive nearest neighbor (DANN) metric P which estimates as a product of weighted local within and between sum-of-squares matrices [13, 14]. Recently, Domeniconi and Peng further developed DANN and presented an adaptive metric nearest-neighbor (ADAMENN) algorithm and an adaptive quasiconformal kernel (AQK) nearest neighbor classifier [8, 25]. Other strategies are investigated to learn the weighted metric for nearest neighbor classification. Ricci et al. and Paredes et al. proposed to learn a local distance measure for the viewpoint of data compression [21, 24]. One main advantage of this method is that it can simultaneously learn the weighted metric and reduce the number of prototypes [23, 26]. Most recently, Paredes and Vidal presented a method to learn the local weighted metrics by minimizing the leave-one-out (LOO) classification error [24].
1.2 Sample weighting for KNN classification The KNN classifier assigns to an unclassified sample the class label which has the maximum number of nearest neighbors. This rule, however, assumes locally constant class conditional probabilities and neglects that the nearest neighbor close to the unclassified sample should contribute more to classification. To weigh the samples nearby more heavily than those farther away in making the decision, Dudani proposed a distance-weighted KNN method (DSNN WKNN) [10]. Let {xNN 1 ,…, xk } denote the k nearest
123
1
also discussed several alterative weighting functions. Later Keller et al. extended the fuzzy set concepts to KNN classifier and proposed a fuzzy KNN method which can be considered as an alterative weighted KNN rule [16]. Theoretical analysis of DS-WKNN was pioneered by Bailey and Jain, whose study indicated that, given an infinite set of training samples, the asymptotic error rate of unweighted KNN is always lower than that of any weighted KNN [3, 21]. This conclusion, however, would not apply when the number of training samples is finite [20]. A number of experimental results also showed that, weighted KNN could achieve lower error rate than unweighted KNN in the finite-training-sample case [20, 36]. In this paper, we propose a novel sample weighting method for KNN classification. Unlike the DS-WKNN, the approach we propose is a difference-weighted KNN method (DF-WKNN). Difference-weighted KNN first NN obtain the KNNs {xNN 1 ,…, xk } of an unclassified sample x, and then calculate the difference between nearest NN neighbors and x, {xNN 1 -x,…, xk -x}. Based on the difference, we compute the weight of each nearest neighbor by solving a constrained least-squares optimization problem. We then can predict the classification results of the sample x using the difference-weighted KNN rule. To make use of the nonlinear structure information for classification, we further extend DF-WKNN to its kernel version, kernel difference-weighted KNN. The remaining part of the paper is organized as follows: Sect. 2 first presents a definition of the objective function in assigning weights for KNN classification, and then solves a constrained least-squares optimization problem for difference-weighted KNN classification. In Sect. 3, we discuss the extension of the proposed method to its kernel version and its computational complexity. In Sect. 4, experiments are carried out to evaluate the performance of the KDFWKNN method. Finally, Sect. 5 concludes this paper.
2 Difference-weighted KNN rule In this section, we propose a difference-weighted KNN rule method for pattern classification. First, we present the formulation of the difference-weighted KNN rule as a constrained least-squares optimization problem, and then propose our solution to this problem for determining the weight of each nearest neighbor.
Pattern Anal Applic (2008) 11:247–257
2.1 Problem formulation Let {(x1, y1),…,(xm, ym)} be a training set, where xi is the ith training sample, and yi is the corresponding class label. Given an unclassified sample x, the Euclidean distance metric is used to obtain the first k nearest neighbors NN {xNN 1 ,…, xk } and their corresponding class labels NN NN {y1 ,…, yk }, yNN [ {x1,…, xC}. A KNN is defined as a i sample-weighted KNN classifier if it consists of the next two steps: Assign each nearest neighbor xNN a weight wi using a i weight algorithm; (2) Assign the sample x a class label xjmax using the following rule 0 1 X xjmax ¼ arg min@ ð1Þ w i A:
(1)
xj
yNN i ¼xj
So far, there are some methods in assigning the weights. For example, Dudani proposed a distanceweighted KNN method which assigns to the ith nearest neighbor a weight wi as a function of their distance d(x, xNN i ). This approach, however, is intuitively derived and neglects the influence of the correlation of different neighbors. Most recently, Atiya proposed a maximum likelihood framework to estimate the weights of nearest neighbors [2]. In this section, we define the weight assignment as a constrained optimization problem of sample reconstruction from its neighborhood. Problem. The weights of the nearest neighbors W = [w1,…, wk]T is defined as a vector corresponding to the constrained optimal reconstruction of x using X = NN [xNN 1 ,…, xk ], 2 1 w ¼ arg min x wT X 2 w ð2Þ X s. t: wi ¼ 1: i
2.2 Solution of the constrained least-squares optimization problem In the constrained least-squares optimization problem defined in Sect. 2.1, the objective is a quadratic function while the constraint is a linear, thus is formulated as a quadratic programming (QP) problem. We then use the Lagrangian multiplier method to seek a simple and efficient solution to this QP problem. NN T Let D = [x - xNN 1 ,…, x - x1 ] . The optimization problem of Eq. (2) can be rewritten as
249
1 w ¼ arg min wT DDT w 2 w X s.t: wi ¼ 1:
ð3Þ
i
From Eq. (3), the optimal weights w are only dependent on NN the difference and its KNNs, {x - xNN 1 ,…, x - x1 }. Thus, we name our method rule as difference-weighted KNN (DF-WKNN), to distinguish from Dudani’s distanceweighted KNN (DS-WKNN). The Lagrangian function for the constrained optimization problem is 1 Lðw; kÞ ¼ wT DDT w kðwT 1k 1Þ; 2
ð4Þ
where 1k is a k 9 1 vector with each element equal 1. Let G = DDT, and rw L(w, k) = 0, rk L(w, k) = 0, we obtain the following linear equations Gw k1k ¼ 0;
ð5Þ
wT 1k 1 ¼ 0:
ð6Þ
We further rewrite these linear equations as G 1k w 0 ¼ : 1Tk 0 1 k
ð7Þ
If the matrix G is invertible, the inverse of the matrix G 1k A¼ is existed. So we can compute the 1Tk 0 Lagrangian multiplier k and the weights w by solving the linear system of equations. In fact, we can use a more efficient and numerically stable approach to compute the weights w. This approach first solves the system of linear equations Gw0 ¼ 1k ;
ð8Þ
and then rescale the weights using w ¼ w0 =ðwT0 1k Þ:
ð9Þ
It is simple to see that this approach would yield the same weight vector w as the Langrangian multiplier method. In some cases, the inverse of the matrix G is not unique if the matrix G is singular (e.g., the number of nearest neighbors, k [ N, the dimension of the sample). To avoid this case, we adopt a regularization method by adding a small multiple of the identity matrix, G ¼ G þ gtr ðGÞ=k;
ð10Þ
where tr(G) denotes the trace of the matrix G, and g = 10-0 *10-3 is the regularization parameter. Figure 1 illustrates an example of DF-WKNN and DS-WKNN in assigning the weights w. In Fig. 1, the distances of x1A, x1B, x2A, x2B to sample x are same. The
123
250
Pattern Anal Applic (2008) 11:247–257
Fig. 1 Assignment of weights to the nearest neighbors using a the distance-weighted KNN rule and b the difference-weighted KNN rule
(b)
(a) x2A x1A
x2A x1A
0.25 0.25
0.31 0.19
x
x
x1B 0.19
x1B 0.25 0.25
0.31
x2B
DS-WKNN rule will assign the same weight to each neighbor, and thus sample x can not be correctly classified. Different from DS-WKNN, DF-WKNN can utilize the correlation of the sample x with x2A, x2B, and assign higher weights to x2A and x2B. So sample x can be correctly classified using the DF-WKNN rule. We briefly summarize the main steps of DF-WKNN. Given an unclassified sample x, DF-WKNN first obtain NN the first KNNs {xNN 1 ,…, x1 } and their corresponding NN NN class labels {y1 ,…, y1 }, and then calculate the difference of x and its k nearest neighbors, D = [x - xNN 1 ,…, x - xNN 1 ]. Finally, the weights w of KNNs are determined by solving the system of linear equations [DDT + gtr(DDT)/k]w = 1k.
2.3 Discussion DF-WKNN assigns the weights according to the difference of x and its KNNs, and thus has some notable properties. In this subsection, we further discuss the characteristics of DF-WKNN from three aspects, optimal reconstruction with Gaussian noise, correlation, and relation with local linear embedding.
x2B
1 w ¼ arg min wT ðG þ gtrðGÞ=kÞw 2 w X s.t: wi ¼ 1:
ð12Þ
i
It is obvious to see that the regularization method actually is the optimal reconstruction of the test sample x with Gaussian noise.
2.3.2 Correlation The correlations between different NNs would be useful in classification. DS-WKNN, however, sets the weight of a neighbor only according to its distance, and thus the neighbors are basically considered to be independent. Different from DS-WKNN, the matrix G in DF-WKNN is defined by G = DDT. Let D0 = [x - xNN - m,…, x - xNN 1 1 T - m] , the scatter matrix of the difference vectors is defined by St ¼
k X
T
ðx xNN mÞðx xNN mÞT ¼ D0 D0 ; i i
ð13Þ
i¼1 k P where m ¼ 1k x xNN is the mean of the difference i i¼1 vectors. In matrix theory, the scatter matrix is a basic matrix to analyze the correlation between variables. We can then construct the relationship between D and St,
2.3.1 Optimal reconstruction with Gaussian noise
St ¼ DT D kmmT :
In Sect. 2.1, we follow Eq. (2) to compute the weights. If we assume the real output different from the values x by additive Gaussian noise e N ð0; gtrðGÞ=kÞ; we can then compute the optimal weights by 2 1 w ¼ arg min x wT X þ gtrðGÞ=kkwk2 2 w ð11Þ X s. t: wi ¼ 1:
We can see the matrix G is closely related to the scatter matrix St. Thus the DF-WKNN rule can effectively utilize the correlations between different nearest neighbors for classification.
i T NN T Let D = [x - xNN 1 ,…, x - x1 ] and G = DD . The optimization problem of Eq. (2) can be rewritten as
123
ð14Þ
2.3.3 Relation with local linear embedding Local linear embedding (LLE), one representative nonlinear dimensionality reduction methods, also uses the sample weighting approach to compute the weights which best
Pattern Anal Applic (2008) 11:247–257
251
reconstruct the data points [27, 28]. However, the aim of LLE is to map high-dimensional data into neighborhood preserving low dimensional space, and is thus less related to pattern classification. Different from LLE, DF-KNN is designed for sample-weighted KNN classification. Once the weights of the nearest neighbors are computed, DFKNN assigns the sample to class xj with the maximum accumulated weights, while LLE further computes a low dimensional embedding of the high dimensional data based on the sample weights by optimizing the global embedding cost function. 3 Kernel difference-weighted KNN In this section, we extend DF-WKNN to its kernel version, kernel difference-weighted KNN, and compare the computational complexities of KNN, DS-WKNN, DF-WKNN, and KDF-WKNN.
NN gkij ¼ ððUðxÞ UðxNN i ÞÞ ðUðxÞ Uðxj ÞÞÞ:
ð17Þ
Using the kernel trick, gkij can be calculated explicitly NN NN NN gkij ¼ kðx; xÞ kðx; xNN i Þ kðx; xj Þ þ kðxi ; xj Þ:
ð18Þ We can further derive a more compact expression of the kernel matrix Gk, Gk ¼ K þ 1kk kðx; xÞ 1k kTc kc 1Tk ;
ð19Þ
where K is a k 9 k Gram matrix with the element NN kij = k(xNN j), 1kk is a k 9 k matrix with each element i , x equals 1, 1k is a k 9 1 vector with each element equals 1, kc is a k 9 1 vector with the ith element equals k(x,xNN i ). After obtaining the kernel matrix Gk, we can assign the weights to the nearest neighbors by solve the linear system of equations [Gk + gtr(Gk)/k]w = 1k.
3.2 Computational complexity 3.1 Extension to kernel DF-WKNN classification Using the kernel trick, we extend DF-WKNN to its nonlinear version, kernel DF-WKNN (KDF-WKNN). DFWKNN uses the QP method to assign weights to nearest neighbors, which can not utilize the nonlinear structure information for classification. Extension to kernel method will provide a strategy to circumvent this restriction [22, 34]. This extension includes two steps: extension to kernel distance and extension to the kernel Gram matrix G. The Euclidean distance measure can be extended to the corresponding kernel distance measure. Given two samples x and x0 , we define a kernel function k(x, x0 ) = (U(x)U(x0 )). Using the kernel function, data x are implicitly mapped into a higher dimensional or infinite dimensional feature space F : x ! UðxÞ and the inner product in feature space can be easily computed using the kernel function k(x, x0 ) = (U(x)U(x0 )). Two popular kernel functions are radial basis function (RBF) kernel k(x, x0 ) = exp(-||x-x0 ||2/2) and polynomial kernel k(x, x0 ) = (1 + xx0 )d. The kernel distance in the feature space is then defined as
In this subsection, we discuss the computational complexity of four KNN-based methods, KNN, DS-WKNN, DF-WKNN, and KDF-WKNN. Since KNN-based methods are considered as lazy learning approaches, we need not compare their complexity in the training stage. In the following, we only focus on the complexity of different KNN algorithms in the classification stage. Let m denotes the number of the training samples, d denotes the feature dimension, and k denotes the number of the nearest neighbors with k m. (1)
2
dðx; x0 Þ ¼ kUðxÞ Uðx0 Þk ¼ kðx; xÞ 2kðx; x0 Þ þ kðx0 ; x0 Þ:
ð15Þ
The matrix G can also be extended to its kernel version by constructing the kernel Gram matrix Gk. In the original data space, the element gij of the matrix G is defined as NN gij ¼ ððx xNN i Þ ðx xj ÞÞ;
ð16Þ
where xNN is the ith nearest neighbor of the unclassified i sample x. Analogously, we define the element gkij of the kernel gram matrix Gk as
(2)
KNN:In the classification stage, the KNN rule (a) first computes the distances of the test sample to all the training samples, (b) identifies the KNNs, (c) counts the number of samples ki that belong to class xi and assigns x to the class xj with the maximum kj of samples. In step (a), we require O(md) multiplications. In step (b), we require O(mk) comparisons. The computational complexity of step (c) can be omitted because it only requires O(k) sum and comparison operations. The computational complexity of KNN is O(md) in terms of multiplication and O(mk) in terms of comparison. DS-WKNN: In the classification stage, the DSWKNN rule (a) first computes the distances of the test sample to all the training samples, (b) identifies the KNNs, (c) assign a weight wi to the ith nearest neighbor xNN i , and (d) calculate the sum of the weights Wj of the samples xNN that belong to class xj, i and assigns x to the class xj with the maximum Wj. In step (a), we require O(md) multiplications. In step (b), we require O(mk) comparisons. In step (c), we require
123
252
(3)
(4)
Pattern Anal Applic (2008) 11:247–257
O(k) divisions. In step (d), we only require O(k) sum and comparison operations. Usually k m, the computational complexity of steps (c) and (d) can then be omitted. The computational complexity of KNN is O(md) in terms of multiplication and O(mk) in terms of comparison. DF-WKNN:Similar to DS-WKNN, DF-WKNN also has four steps in the classification stage: (a) computes the distances of the test sample to all the training samples, (b) identifies the KNNs, (c) assign a weight wj to the jth nearest neighbor, and (d) calculate the sum of the weights Wj of the samples xNN that belong i to class xj, and assigns x to the class xj with the maximum Wj. In step (a), we require O(md) multiplications. In step (b), we require O(mk) comparisons. In step (c), we require O(k2d + k3) multiplications. In step (d), we only require O(k) sum and comparison operations. Different from DS-WKNN, the complexity of step (c) cannot be omitted and thus the computational complexity of DF-WKNN is O(md + k2d + k3) in terms of multiplication and O(mk) in terms of comparison. KDF-WKNN: The computational complexity of KDF-WKNN is also O(md + k2d + k3) in terms of multiplication and O(mk) in terms of comparison.
4 Experimental results and discussion In this section, we systematically evaluate the classification performance of the DF-WKNN and kernel DF-WKNN methods. First, we use 30 data sets from the UCI Machine Learning Repository to investigate the performance of DFWKNN and KDF-WKNN in pattern classification. Second, we compare the classification performance of the proposed KDF-WKNN method with those of several state-of-the-art methods, such as KNN, DS-WKNN, support vector machine (SVM), and the reduced multivariate polynomial model (RM).
character recognition. We further describe the experimental setup to test the classification accuracy of DF-WKNN and KDF-WKNN as follows: 1.
2. 3.
4.
Each of the data sets we randomly split into 10-folds, and use a 10-fold cross-validation method to determine the classifier parameters and classification accuracy. To reduce bias in evaluating the performance, we calculate the average and standard deviation of the classification accuracy of the 10 runs of 10-fold crossvalidation. Normalization. For all the data sets, each of all the input characters is normalized to values within [0, 1]. Distance measure. In previous sections, we only discuss to use the Euclidean distance between numeric features. Features from some data sets, however, may be categorical variables. In these cases, each categorical variable is converted into a vector of 0 or 1 variables. If a categorical variable x takes l values {c1, c2,…, cl}, it is replaced by a (l-1)-dimensional vector [x(1), x(2),…, x(l-1)] such as x(i) = 1 if x = ci and x(i) = 0 otherwise, for i = 1,…, l - 1. If x = cl, all the variables of the vector are zeros. Performance evaluation. To compare the differences in the performances of multiple classifiers, it is usual to select a number of data sets to test the individual performance scores (e.g., classification rate). It is then possible, based on these individual performance scores, to use various approaches to evaluate the overall performance of a classifier. One straightforward way is to calculate the average classification rate (ACR) over all data sets. This approach, however, lacks of statistical evidence and is seldom adopted. In our experiments, we use the Wilcoxon signed ranks test to compare the statistical difference between two classifiers, and use the Friedman test with the corresponding post hoc tests to compare multiple classifiers. Both the Wilcoxon and Friedman tests are nonparametric tests which do not assume the distribution or homogeneity of variance and are suitable for comparing classifiers over multiple data sets [7].
4.1 The experimental setup 4.2 Comparisons with the KNN classifiers The proposed DF-KWNN and KDF-KWNN methods are tested on 30 benchmark data sets from the UCI Machine Learning Repository [5] and StatLog data sets. Table 1 summarizes the information on the number of numeric and categorical features, number of classes C, and number of total instances m for each of the 30 data sets. These data sets include 14 two-class problems, 6 three-class problems and 10 multi-class problems, and cover a wide range of applications such as medical diagnosis, image analysis, and
123
Before applying DF-WKNN to a classification task, we should always determine two hyper parameter, the number of the nearest neighbors k, and the regularization parameter g. Besides these two parameters, KDF-WKNN will introduce another hyper parameter, parameter of the kernel function (e.g., Gaussian kernel r). In our experiments, the optimal values of these hyper parameters are determined using 10-fold cross-validation. Since it is very difficult to
Pattern Anal Applic (2008) 11:247–257
253
Table 1 Summary of data sets and their characteristics Data set
Instances
Classes
Features Category
Abalone
Table 2 Settings of hyper parameters and classification rates (%) obtained using DF-WKNN and KDF-WKNN Numeric
4,177
3
1
7
345
2
0
6
1,473
3
7
2
Dermatology
366
6
1
33
Ecoli
336
8
0
7
Glass
214
6
0
9
Haberman Heart
306 270
2 2
0 6
3 7
Bupa liver Cmc
Ionosphere Image Iris Letter Optdigit Page block
351
2
0
34
2,310
7
0
19
150
3
0
4
20,000
16
0
26
5,620
10
0
64
5,473
5
0
10
10,992
10
0
16
Pima
768
2
0
8
Shuttle land
279
2
6
0
Spam
4,601
2
0
57
Spect
267
2
22
0
Pendigit
Spectf Statlog DNA Tae Tic-tac-toe Thyroid
267
2
0
44
3,186
3
180
0
151 958
6 2
1 9
4 0
7,200
3
15
6 18
Vehicle
846
4
0
Vote
435
2
16
0
Wdbc
569
2
0
30
Wpbc
194
2
1
32
Wine
178
3
0
13
Zoo
101
2
15
1
Data set
Abalone Bupa liver Cmc Dermatology Ecoli Glass Haberman Heart Ionosphere Image Iris Letter Optdigit Page block Pendigit Pima Shuttle land Spam Spect Spectf Statlog DNA Tae Tic-tac-toe Thyroid Vehicle Vote Wdbc Wpbc Wine Zoo Average
DF- WKNN
KDF- WKNN
[k, g]
Accuracy
[k, r,g]
Accuracy
[201, 0.1] [151, 0.1] [51, 0.1] [31, 0.1] [101, 0.1] [25, 0.1] [81, 0.1] [201, 0.1] [51, 0.1] [25, 0.1] [47, 0.1] [5, 0.1] [17, 0.1] [65, 0.1] [81, 0.1] [251, 1] [31, 0.1] [41, 0.1] [65, 0.1] [31, 0.1] [71, 0.1] [1, 0.1] [3, 0.1] [41, 0.1] [35, 0.1] [101, 0.1] [25, 1] [101, 1] [91, 0.1] [9, 0.1]
65.94 73.91 47.98 95.64 87.11 70.51 75.56 83.70 92.71 97.32 97.93 96.63 99.20 96.90 99.68 77.29 96.26 92.20 82.85 88.28 90.52 64.83 100.0 95.38 82.21 96.57 97.42 81.13 98.93 96.93 87.38
[201, 6, 0.1] [151, 2, 0.1] [51, 4, 0.1] [31, 1, 0.1] [101, 0.5, 0.1] [25, 3, 0.1] [81, 8, 0.1] [201, 16, 0.1] [51, 10, 0.1] [25, 1, 0.1] [47, 4, 1] [5, 1, 0.1] [17, 3, 0.1] [65, 10, 0.01] [81, 1, 0.1] [251, 20, 1] [31, 4, 0.1] [41, 0.25, 0.1] [65, 8, 1] [31, 12, 0.1] [71, 6, 0.1] [1, 1, 0.1] [3, 1, 0.1] [41, 6, 0.001] [35, 12, 0.1] [101, 8, 0.1] [25, 1, 1] [101, 8, 1] [91, 2, 0.1] [9, 3, 0.1]
65.94 73.68 47.93 97.51 87.47 71.21 75.56 83.74 92.56 97.53 97.93 96.68 99.24 97.10 99.71 77.25 96.37 93.16 84.12 88.31 93.00 64.83 100.0 95.38 82.23 96.44 97.68 81.13 99.38 96.93 87.67
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.17 0.73 0.85 0.57 0.62 1.46 0.65 0.69 0.70 0.13 0.58 0.11 0.04 0.07 0.01 0.30 0.66 0.17 1.01 1.28 0.28 2.72 0.00 0.06 0.80 0.20 0.22 0.91 0.17 0.31
± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±
0.08 1.01 0.75 0.32 0.53 2.23 0.68 1.00 0.57 0.10 0.49 0.11 0.04 0.07 0.02 0.33 0.72 0.14 0.43 1.29 0.16 2.72 0.00 0.06 0.87 0.24 0.19 0.60 0.39 0.31
Bold number denotes the maximum average classification rate of each row
determine these three parameters at the same time, we adopt a stepwise selection strategy. We pre-fix g = 0.1 and r = 20, and try to find the optimal number of the nearest neighbors k. After the determination of k, we pre-fix g = 0.1 and k to its optimal value, and try to find the optimal r value. After the determination of k and r, we prefix k and r to its optimal values, and study the effect of the regularization parameter g. Table 2 shows the optimal values of hyper parameters, classification rates, and standard deviations of DF-WKNN and KDF-WKNN. KDF-WKNN’s average classification rate over all data sets is 87.67%, which is a little higher than that of DF-WKNN, 87.38%. We further count the number of data sets for which KDF-WKNN performs better than DF-WKNN, 17 (win record), the number of data sets for which KDF-WKNN and DF-WKNN have the same
classification rate, 8 (draw record), and the number of data sets for which DF-WKNN performs better than KDFWKNN, 5 (loss record). For most data sets, KDF-WKNN is able to achieve classification rates better than or competitive to DF-WKNN. Using the average classification rate on each of the test data sets, we compare the classification performance of KDF-WKNN with that of other KNN classifiers, such as KNN and DS-WKNN. Table 3 shows the average classification accuracy of KNN, DS-WKNN, and KDF-WKNN on each of the test data sets, and the overall average classification accuracy of each method over all data sets. The average classification rate of KDF-WKNN over all data sets is 87.38%, which is higher than those of KNN at 85.19%, and DS-WKNN at 85.34%.
123
254
Pattern Anal Applic (2008) 11:247–257
Table 3 Error rates (%) of KDF-WKNN and CPW-NN Data set
KDF-WKNN
CPW-NN
Balance
8.93
17.60
Cancer
3.36
3.53
DNA
7.84
4.21
Glass
30.04
27.48
Heart
17.00
19.82
Letter
4.08
4.20
Liver
27.15
36.95
Vehicle
18.55
28.09
3.48
5.26
Vote Wine Average
0.69
1.24
12.11
14.84
Bold number denotes the maximum average classification rate of each row
The comparison of classifiers based on the overall average classification rates, however, is not statistically convincible. We further carry out a Wilcoxon signed-ranks test [30] to compare the performance of two classifiers. Let Di be the difference between the classification rates of the two classifiers on the ith data sets. We rank Di according to their absolute values, rank(Di). If several difference values are equal, we assign the average rank to all these differences. Then we define R+ as the sum of ranks for the data sets on which one method outperforms the other X 1X Rþ ¼ rankðDi Þ þ rankðDi Þ; ð20Þ 2 D ¼0 D [0 i
following the experimental setup described in [24]. For the small data sets, classification results are obtained by averaging 100 5-CV runs. For the larger data sets (Letter, Dna), we adopt the standard training/test partition specified in the UCI/STATLOG repository. Table 3 shows the error rates of KDF-WKNN and the Class and Prototype Weighted NN on ten data sets. KDFWKNN’s average error rate over all data sets is 12.11%, which is lower than that of CPW-NN, 14.84%. We count the number of data sets for which KDF-WKNN performs better than CPW-NN, 8 (win record), and the number of data sets for which CPW-NN performs better than KDFWKNN, 2 (loss record). For most data sets, KDF-WKNN is able to achieve classification rates better than or competitive to CPW-NN.
4.3 Comparisons with the state-of-the-art classifiers In this section, we evaluate KDF-WKNN by comparing it with a number of state-of-the-art classifiers, such as KNN, DS-WKNN, support vector machine (SVM) and the reduced multivariate polynomial model (RM): (1)
i
-
Similarly, we define R as the sum of ranks for the data sets on which the second outperforms the first. Using the Wilcoxon signed-ranks test, we compare the classification performance of KDF-WKNN and KNN. We first obtain the rank of the absolute value of the difference on each data set, and then calculate the sum of ranks where KDF-WKNN outperforms KNN R+ = 463.5, and the sum of ranks where KNN outperforms KDF-WKNN R- = 1.5, and then the z statistics of the Wilcoxon signed-ranks test z = -4.75. If z \ -3.18, the KDF-WKNN method would be significantly superior to the KNN method in terms of its classification rate at a = 0.001. Analogously, experimental results also consistently show that KDF-WKNN method is significantly superior to DS-WKNN in terms of its classification rate at a = 0.001. To objectively evaluate KDF-WKNN, we also compare the classification rate of KDF-WKNN with that of another nearest neighbor method, class and prototype weighted NN classification approach (CPW-NN) [24]. To make the comparison easier, we take the experimental results from Ref. [24]. Using the hyper-parameters presented in Table 2, we evaluate the classification rate of KDF-WKNN
123
(2)
(3)
SVM is a recently developed nonlinear classification approach and has achieved great success in many application tasks [34]. In this section, we use the OSU-SVM toolbox (http://svm.sourceforge.net/docs/ 3.00/api/) with polynomial kernel function (PolyUSVM) as a benchmark method. To determine the hyper-parameters of PolyU-SVM, we followed Ref. [33] to use an exhaustive search method, where d is found from integers between 1 and 10, c is found from the set {0.01, 0.1, 1} and C is found from {1, 10, 100}. SVM with the RBF kernel function (RBF-SVM) is also used a benchmark method to evaluate the classification performance. We use an exhaustive search method, where r is found from {0.001, 0.01, 0.1, 1, 10, 100} and 10, and C is found from {0.1, 1, 10, 100, 1,000}. The RM model is a recently developed classification approach which first transforms the original data into a reduced polynomial feature space and then uses the ridge regression method to fit the reduced multivariate polynomial model. Toh et al. has reported that RM performs well in classification tasks which involve few features and many training data [32]. In training RM, we tune two hyper-parameters, the order of the reduced polynomial model d and the regularization parameter g, to achieve the optimal classification accuracy. To determine the hyper-parameters of RM, we followed Ref. [33] to adopt an exhaustive search
Pattern Anal Applic (2008) 11:247–257
255
method, where the order r is found from integers between 1 and 10, and b is found from the set {10-6, 10-5, 10-4, 10-3, 10-2, 10-1, 1}. In our experiments, we first compare the overall average classification rate and rank of each method over all data tests, and then use the Friedman test to statistically evaluate the difference between multiple classifiers. Table 4 lists the classification rates and standard deviations of KDF-WKNN and the other classifiers. The overall average classification rate of KDF-WKNN is 87.67%, which is higher than the classification rates of RBF-SVM (86.59%), Poly-SVM (87.53%), and RM (85.53%). The average rank of KDF-WKNN (2.10) is also lower than the ranks of RBFSVM (3.43), Poly-SVM (2.73), and RM (3.67).
Would KDF-WKNN achieve performance comparable to or better than the state-of-the-art approaches? We use the Friedman test to test this hypothesis. The Friedman test first ranks the jth of k classifiers on the ith of Nd data sets rji, and then calculates the average rank v2F over all the data sets, and calculate the Friedman statistic. If the Friedman test rejects the null-hypothesis, we further carry out a post-hoc Nemenyi test and a Bonferroni–Dunn test to reveal the differences in classification rates of several classifiers [35]. Using the Friedman test, we evaluate the performance of KDF-WKNN by comparing it with the other four classifiers. The Friedman statistic of average ranks is 41.33, which is higher than the critical value (v2F)0.05,5,29*5 = 2.22. Thus there has significant difference between these six classifiers.
Table 4 Comparisons of average classification rates (%) obtained using different methods on the 30 data sets Data set
KDF-WKNN
KNN
DS-WKNN
RBF-SVM
Poly-SVM
RM
Abalone
65.94 ± 0.08
64.09 ± 0.32
64.43 ± 0.40
66.40 ± 0.22
66.43 ± 0.18
66.46 ± 0.12
Bupa liver
73.68 ± 1.01
63.48 ± 1.42
64.64 ± 1.09
63.85 ± 1.15
72.96 ± 0.95
72.58 ± 0.77
Cmc
47.93 ± 0.75
45.42 ± 0.68
46.24 ± 0.99
50.00 ± 0.60
48.42 ± 0.79
54.25 ± 0.47
Dermatology
97.51 ± 0.32
96.90 ± 0.24
96.42 ± 0.28
92.89 ± 0.94
97.60 ± 0.42
97.14 ± 0.39
Ecoli
87.47 ± 0.53
87.32 ± 0.49
87.29 ± 0.50
87.25 ± 0.52
87.08 ± 0.46
87.61 ± 0.64 62.66 ± 1.77
Glass
71.21 ± 2.23
69.39 ± 1.60
66.07 ± 0.85
69.75 ± 1.01
71.31 ± 1.00
Haberman
75.56 ± 0.68
72.88 ± 0.72
74.74 ± 0.65
69.93 ± 0.95
73.80 ± 0.58
75.35 ± 0.63
Heart
83.74 ± 1.00
81.22 ± 0.35
79.67 ± 0.78
77.59 ± 1.20
84.00 ± 0.88
83.86 ± 0.91
Ionosphere
92.56 ± 0.57
86.89 ± 0.53
86.72 ± 0.52
95.24 ± 0.45
90.93 ± 0.53
88.54 ± 0.82
Image
97.53 ± 0.10
97.10 ± 0.13
97.16 ± 0.14
97.27 ± 0.14
97.17 ± 0.12
94.11 ± 0.20
Iris
97.93 ± 0.49
95.93 ± 0.46
95.60 ± 0.46
96.20 ± 1.13
97.00 ± 0.72
96.83 ± 0.44
Letter
96.68 ± 0.11
96.18 ± 0.46
96.32 ± 0.06
97.81 ± 0.07
95.69 ± 0.09
74.14 ± 0.05
Optdigit
99.24 ± 0.04
98.82 ± 0.05
98.89 ± 0.06
99.37 ± 0.04
99.09 ± 0.08
95.37 ± 0.08
Page block Pendigit
97.10 ± 0.07 99.71 ± 0.02
96.04 ± 0.10 99.39 ± 0.02
96.12 ± 0.11 99.44 ± 0.02
91.63 ± 0.03 99.61 ± 0.03
96.13 ± 0.05 99.46 ± 0.03
95.49 ± 0.06 95.68 ± 0.05
Pima
77.25 ± 0.33
74.08 ± 0.39
75.33 ± 0.39
77.09 ± 0.35
77.38 ± 0.31
77.47 ± 0.37
Shuttle land
96.37 ± 0.72
94.57 ± 0.88
94.75 ± 0.76
99.13 ± 0.50
98.46 ± 1.00
95.98 ± 0.34
Spam
93.16 ± 0.14
90.92 ± 0.18
90.92 ± 0.18
90.90 ± 0.15
93.81 ± 0.14
92.85 ± 0.16
Spect
84.12 ± 0.43
81.87 ± 0.86
81.80 ± 0.59
83.01 ± 0.77
81.41 ± 0.93
84.58 ± 0.76
Spectf
88.31 ± 1.29
83.75 ± 1.23
84.41 ± 1.21
90.52 ± 1.51
88.51 ± 1.02
77.18 ± 1.28
Statlog DNA
93.00 ± 0.16
88.00 ± 0.20
88.93 ± 0.13
96.14 ± 0.07
95.95 ± 0.09
95.09 ± 0.10
Tae
64.83 ± 2.72
64.83 ± 2.72
64.83 ± 2.72
62.78 ± 1.76
63.55 ± 2.19
56.82 ± 2.31
Tic-tac-toe
100.0 ± 0.00
100.0 ± 0.00
100.0 ± 0.00
99.83 ± 0.19
99.78 ± 0.04
98.33 ± 0.05
Thyroid
95.38 ± 0.06
93.88 ± 0.07
93.99 ± 0.08
95.86 ± 0.06
97.42 ± 0.07
94.32 ± 0.05
Vehicle
82.23 ± 0.87
71.38 ± 0.49
71.54 ± 0.49
81.79 ± 0.61
84.69 ± 0.57
83.10 ± 0.46
Vote
96.44 ± 0.24
93.54 ± 0.38
93.82 ± 0.46
95.57 ± 0.37
95.79 ± 0.48
95.43 ± 0.07 96.79 ± 0.40
Wdbc
97.68 ± 0.19
97.01 ± 0.26
97.01 ± 0.32
97.75 ± 0.16
97.82 ± 0.20
Wpbc
81.13 ± 0.60
78.04 ± 1.03
79.07 ± 1.09
79.78 ± 1.53
79.09 ± 0.98
82.78 ± 0.83
Wine
99.38 ± 0.39
96.24 ± 0.75
97.58 ± 0.59
96.00 ± 0.71
98.49 ± 0.77
98.88 ± 0.30
Zoo ACR
96.93 ± 0.31 87.67
96.73 ± 0.66 85.19
96.73 ± 0.66 85.34
96.71 ± 0.68 86.59
96.55 ± 0.38 87.53
96.25 ± 1.48 85.53
Average rank
2.10
4.75
4.32
3.43
2.73
3.67
Bold number denotes the maximum average classification rate of each row
123
256
Pattern Anal Applic (2008) 11:247–257 Kong Polytechnic University and the National Natural Science Foundation of China (NSFC) under the contract No. 60332010, the 863 Project under the contract No. 2006AA01Z308 and No. 2006AA01Z193. Finally, the authors would like to thank the anonymous reviewers for their constructive suggestions.
References
Fig. 2 Comparisons of five classifiers using a the Nemenyi test and b the Bonferroni–Dunn test
We further use the Nemenyi test and the Bonferroni– Dunn test to reveal the performance difference. Figure 2a shows the results of the Nemenyi test with a = 0.05. The Nemenyi test indicates that, at a = 0.05, the performance of KDF-WKNN is statistically better than those of KNN, DS-WKNN, and RM. Figure 2b shows the results of the Bonferroni–Dunn test with a = 0.05. The Bonferroni– Dunn test indicates that, at a = 0.05, the performance of KDF-WKNN is statistically better than those of KNN, DSWKNN, RBF-SVM, and RM. From the results of the Nemenyi test and the Bonferroni–Dunn test, we can conclude that KDF-WKNN is statistically superior to KNN, DS-WKNN, and RM, but there is not consistent evidence to indicate the statistical performance difference between KDF-WKNN and SVM. 5 Conclusion In this paper, we proposed a kernel difference-weighted KNN method for pattern classification. Given an unclassified sample x, KDF-WKNN use the difference between x and its neighborhood to weigh the influence of each neighbor, and then use weighted KNN rule to classify x. The proposed difference-weighted KNN method has some significant advantages. First, compared with distanceweighted KNN, DF-WKNN has a distinct geometric explanation as an optimal constrained reconstruction problem. Second, DF-WKNN is easy to be extended to its nonlinear version, kernel DF-WKNN. Third, experimental results show that, in terms of classification performance, KDF-WKNN is better than KNN and distance weighted KNN, and is comparable to or better than several state-ofthe-art methods, such as Poly-SVM, RBF-SVM, and the reduced multivariate polynomial model. Acknowledgments The work is supported in part by the UGC/CRC fund from the HKSAR Government, the central fund from the Hong
123
1. Aha DW, Kibler D, Albert MK (1991) Instance-based learning algorithms. Mach Learn 6:37–66 2. Atiya AF (2005) Estimating the posterior probabilities using the K-nearest neighbor rule. Neural Comput 17:731–740 3. Bailey T, Jain AK (1978) A note on distance-weighted k-nearest neighbor rules. IEEE Trans Syst Man Cybern 8:311–313 4. Beygelzimer A, Kakade S, Langford J (2006) Cover trees for nearest neighbor. In: International conference on machine learning 5. Blake CL, Merz CJ (1998) UCI repository of machine learning databases, Department of Information and Computer Sciences, University of California, Irvine. http://www.ics.uci.edu/mlearn/ MLRepository.html 6. Dasarathy BV (1991) Nearest neighbor (NN) norms: NN pattern classification techniques. IEEE Computer Society Press, Los Alamitos 7. Demsˇar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30 8. Domeniconi C, Peng J, Gunopulos D (2002) Locally adaptive metric nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 24:1281–1285 9. Duda RO, Hart PE, Stork DG (2001) Pattern classification, 2nd edn. Wiley, London 10. Dudani SA (1976) The distance-weighted k-nearest-neighbor rule. IEEE Trans Syst Man Cybern 6:325–327 11. Fix E, Hodges JL (1951) Discriminatory analysis:nonparametric discrimination:consistency properties. USAF School of Aviation Medicine, Project 21-49-004, Report No. 4:261–279 12. Fukunaga K, Flick TE (1984) An optimal global nearest neighbor metric. IEEE Trans Pattern Anal Mach Intell 6:314–318 13. Hastie T, Tibshirani R (1996) Discriminant adaptive nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 18:607–616 14. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York 15. Herrero JR, Navarro JJ (2007) Exploiting computer resources for fast nearest neighbor classification. Pattern Anal Appl 10:265– 275 16. Keller JM, Gray MR, Givens Jr. JA (1985) A fuzzy k-nearest neighbor algorithm. IEEE Trans Syst Man Cybern 15:580– 585 17. Kuncheva LI, Bezdek JC (1998) Nearest prototype classification clustering, genetic algorithms, or random search. IEEE Trans Syst Man Cybern C 28:160–164 18. Kuncheva LI, Bezdek JC (1999) Presupervised and postsupervised prototype classifier design. IEEE Trans Neural Netw 10:1142–1152 19. Lam W, Keung CK, Liu D (2002) Discovering useful concept prototypes for classification based on filtering and abstraction. IEEE Trans Pattern Anal Mach Intell 24:1075–1090 20. Macleod JES, Luk A, Titterington DM (1987) A re-examination of the distance-weighted k-nearest neighbor classification rule. IEEE Trans Syst Man Cybern 17:689–696 21. Morin RL, Raeside DE (1981) A reappraisal of distance-weighted k-nearest neighbor classification for pattern recognition with missing data. IEEE Trans Syst Man Cybern 11:241–243
Pattern Anal Applic (2008) 11:247–257 22. Mu¨ller KR, Mika S, Ra¨tsch G, Tsuda K, Scho¨lkopf B (2001) An introduction to kernel-based learning algorithms. IEEE Trans Neural Netw 12:181–202 23. Paredes R, Vidal E (2006) Learning prototypes and distances: a prototype reduction technique based on nearest neighbor error minimization. Pattern Recognit 39:180–188 24. Paredes R, Vidal E (2006) Learning weighted metrics to minimizing nearest-neighbor classification error. IEEE Trans Pattern Anal Mach Intell 28:1100–1110 25. Peng J, Heisterkamp DR, Dai H (2004) Adaptive quasiconformal kernel nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 26:656–661 26. Ricci F, Avesani P (1999) Data compression and local metrics for nearest neighbor classification. IEEE Trans Pattern Anal Mach Intell 21:380–384 27. Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290:2323–2326 28. Saul LK, Roweis ST (2003) Think globally, fit locally: unsupervised learning of low dimensional manifolds. J Mach Learn Res 4:119–155 29. Shakhnarovich G, Darrell T, Indyk P (2006) Nearest-neighbor methods in learning and vision: theory and practice. MIT press, Cambridge 30. Sheskin DJ (2004) Handbook of parametric and nonparametric statistical procedures, 3rd edn. Chapman & Hall/CRC, Boca Raton 31. Short RD, Fukunaga K (1984) The optimal distance measure for nearest neighbor classification. IEEE Trans Inf Theory 27:622–627 32. Toh KA, Tran QL, Srinivasan D (2004) Benchmarking a reduced multivariate polynomial pattern classifier. IEEE Trans Pattern Anal Mach Intell 26(6):740–755 33. Tran QL, Toh KA, Srinivasan D, Wong KL, Low SQ (2005) An empirical comparison of nine pattern classifiers. IEEE Trans Syst Man Cybern B 35:1079–1091 34. Vapnik VN (1998) Statistical learning theory. Wiley, New York 35. Zar JH (1999) Biostatistical analysis, 4th edn. Prentice Hall, Upper Saddle River 36. Zavrel J (1997) An empirical re-examination of weighted voting for K-NN. In: Daelemans W, Flach P, van den Bosch A (eds) Proceedings of the 7th Belgian-Dutch Conference on Machine Learning, Tilburg, pp 139–148
Author Biographies Wangmeng Zuo currently is a lecturer from School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China. He received the PhD degree in Computer Science and Technology from the Harbin Institute of Technology in 2007. From July to December 2004, from November 2005 to August 2006, and from July 2007 to January 2008, he worked as a research assistant in the Department of Computing, Hong Kong Polytechnic University. He is a reviewer of several international journals, including IEEE Trans. Information Forensics and Security, Neurocomputing, International Journal of Adaptive Control and Signal Processing, and International Journal of Image and Graphics. He is the author of 15 scientific papers in pattern recognition and computer vision. His current research interests include pattern recognition, computer vision, computational intelligence, and their applications in biometrics, medical diagnosis, and bioinformatics.
257 David Zhang graduated in Computer Science from Peking University. He received his MSc in Computer Science in 1982 and his PhD in 1985 from the Harbin Institute of Technology (HIT). From 1986 to 1988, he was a Postdoctoral Fellow at Tsinghua University and then an Associate Professor at the Academia Sinica, Beijing. In 1994, he received his second PhD in Electrical and Computer Engineering from the University of Waterloo. Currently, he is a Chair Professor at the Hong Kong Polytechnic University where he is the Founding Director of the Biometrics Technology Centre (UGC/CRC) supported by the Hong Kong SAR Government. He also serves as Adjunct Professor in Tsinghua University, Shanghai Jiao Tong University, Beihang University, Harbin Institute of Technology, and the University of Waterloo. He is the Founder and Editor-in-Chief, International Journal of Image and Graphics (IJIG); Book Editor, Springer International Series on Biometrics (KISB); Organizer, the International Conference on Biometrics Authentication (ICBA); Associate Editor of more than ten international journals including IEEE Trans on SMC-A/SMC-C/Pattern Recognition; Technical Committee Chair of IEEE CIS and the author of more than 10 books and 160 journal papers. Professor Zhang is a Croucher Senior Research Fellow, Distinguished Speaker of the IEEE Computer Society, and a Fellow of the International Association of Pattern Recognition (IAPR). Kuanquan Wang received his BE and ME degrees in Computer Science from Harbin Institute of Technology (HIT), Harbin, China, and his PhD degree in Computer Science from Chongqing University, Chongqing, China, in 1985, 1988, and 2001, respectively. From 1988 to 1998, he worked in the Department of Computer Science, Southwest Normal University, Chongqing, China as a Tutor, Lecturer and Associate Professor, respectively. From 1998 to now, he has been working in the Biocomputing Research Centre, HIT as a Professor and an Associate Director. Meanwhile, from 2000 to 2001 he was a visiting scholar in Hong Kong Polytechnic University supported by Hong Kong Croucher Funding and from 2003 to 2004 he was a research fellow in the same university. So far, he has published over 70 papers. His research interests include biometrics, image processing and pattern recognition. He is a senior member of the IEEE, an editorial board member of International Journal of Image and Graphics. He is also a reviewer of IEEE Trans. SMC and Pattern Recognition.
123