Generalized kernel framework for unsupervised spectral methods of dimensionality reduction Diego H. Peluffo-Ord´on˜ ez
John Aldo Lee
Michel Verleysen
Universidad Cooperativa de Colombia – Pasto Universit´e Catholique de Louvain Universit´e Catholique de Louvain Torobajo, Calle 18 No. 47 - 150 Molecular Imaging Radiotherapy Machine Learning Group - ICTEAM Pasto, Colombia and Oncology - IREC Place du Levant 3, B-1348 Email:
[email protected] Avenue Hippocrate 55, B-1200 Louvain-la-Neuve, Belgium Bruxelles, Belgium E-mail:
[email protected] Email:
[email protected]
Abstract—This work introduces a generalized kernel perspective for spectral dimensionality reduction approaches. Firstly, an elegant matrix view of kernel principal component analysis (PCA) is described. We show the relationship between kernel PCA, and conventional PCA using a parametric distance. Secondly, we introduce a weighted kernel PCA framework followed from leastsquares support vector machines (LS-SVM). This approach starts with a latent variable that allows to write a relaxed LS-SVM problem. Such a problem is addressed by a primal-dual formulation. As a result, we provide kernel alternatives to spectral methods for dimensionality reduction such as multidimensional scaling, locally linear embedding, and laplacian eigenmaps; as well as a versatile framework to explain weighted PCA approaches. Experimentally, we prove that the incorporation of a SVM model improves the performance of kernel PCA.
I.
I NTRODUCTION
start by outlining a matrix view of kernel principal component analysis (PCA). Then, we introduce a weighted kernel PCA framework followed from least-squares support vector machines (LS-SVM). This approach involves a latent variable that allows for reaching a relaxed LS-SVM problem. In this work, such a problem is addressed by a primal-dual formulation, similarly as done in [10]. The developed kernel approaches for dimensionality reduction are tested on wellknown data sets (an artificial Spherical shell, COIL-20 image bank [11], and MNIST image bank [12]). To test the ability of representation of our kernel approaches, we consider the standard implementations of classical multidimensional scalling (CMDS) [1], locally linear embedding (LLE) [3], and graph Laplacian eigenmaps (LE) [2]. As well, their kernel approximations are considered [13]. The DR performance is quantified by a scaled version of the average agreement rate between K-ary neighborhoods as described in [5]. As a result, we provide kernel alternatives to spectral methods for dimensionality reduction such as MDS, LLE, and LE; as well as a versatile framework to explain weighted PCA approaches. Experimentally, we prove that the incorporation of a SVM model improves the performance of kernel PCA. Mathematically, we also show relationship between kernel PCA, and conventional PCA using a parametric distance [7].
Dimensionality reduction (DR) aims at the extraction of lower dimensional, relevant information from highdimensional data, in order to either improve the performance of a pattern recognition system or allow for intelligible data visualization. As classical DR approaches, we find the wellknown principal component analysis (PCA) and classical multidimensional scaling (CMDS), which are respectively based on variance and distance preservation criteria [1]. Recently, the focus of DR methods relies on criteria aimed at preserving the data topology. Such a topology is often represented by a pairwise similarity matrix. From a graph-theory point of view, data may be represented by a non-directed and weighted graph, in which nodes represent the data points, and the similarity matrix is the weight (also affinity) matrix holding the pairwise edge weights. The pioneer methods incorporating similarities are Laplacian eigenmaps [2] and locally linear embedding [3], which are spectral approaches. Also, since normalized similarities can be seen as probabilities, methods based on divergences have emerged such as stochastic neighbour embedding (SNE) [4], and its variants and improvements [5]. This work focuses on spectral approaches, which are susceptible to be naturally represented by kernels. Spectral techniques for have been successfully applied in several dimensionality reduction tasks, namely relevance analysis [6], [7], and feature extraction [8], [9], among others.
In math terms, the goal of dimensionality reduction is to embed a high dimensional data matrix Y ∈ RD×N into a lowdimensional, latent data matrix X ∈ Rd×N , being d < D. Then, observed data and latent data matrices are formed by N observations, denoted respectively by yi ∈ RD and xi ∈ Rd , with i ∈ {1, . . . , N}. As well, from another point of view, observed data matrix is conformed by D variables such that y(l) ∈ RN with l ∈ {1, . . . , D}, meanwhile latent data matrix by d variables denoted as x(ℓ) ∈ RN with ℓ ∈ {1, . . . , d}.
Herein, we introduce an elegant generalized kernel perspective for spectral dimensionality reduction approaches. We
Particularly, PCA relies mainly on linear projections attempting to preserve the variance. When data matrix is cen-
The outline of this paper is as follows: The matrix representation for Kernel PCA is presented in section II. Section III explains the proposed weighted kernel PCA. Experimental results are shown in section IV. Finally, section V draws the conclusions and final remarks. II.
K ERNEL PCA
tered (that is having zero mean regarding its rows), such a variance preservation can be seen as an Euclidean inner product (dot product, as well) preservation. Suppose that there exists an unknown high dimensional representation space Φ ∈ RDh ×N such that Dh ≫ D, in which calculating the inner product should improve the representation and visualization of resultant embedded data in contrast to that obtained directly from the observed data. Hence, the need of a kernel representation arises to calculate the dot product in the unknown high dimensional space. Let φ(·) be a function that maps data from the original dimension to a higher one, such that:
is selecting W and X as the eigenvectors associated to the d largest eigenvalues of ΦΦ⊤ and the kernel matrix K = Φ⊤ Φ, respectively.
φ(·) : RD → RDh yi 7→ φ(yi ).
b 2 can be expressed where the problem of minimizing ||Φ − Φ|| F ⊤ b as maximizing its complement tr(Φ Φ). In addition, recalling equation (3), we have that
Therefore, the i-th column vector of matrix Φ is given by Φi = φ(yi ). By the Mercer’s condition or kernel trick takes place, a kernel function k(·, ·) allows for estimating the dot product φ(yi )⊤ φ(y j ) = k(yi , yi ). Arranging all the possible dot products in a matrix K = [ki j ], we get a kernel matrix: ⊤
K = Φ Φ,
Proof: The objective function can be extended as: b 2F = tr(Φ⊤ Φ) − 2 tr(Φ b ⊤ Φ) + tr(Φ b ⊤ Φ). b ||Φ − Φ||
b ⊤ Φ) = Since term tr(Φ⊤ Φ) = ||Φ||2F is constant and tr(Φ ⊤b b tr(Φ Φ), the following duality takes place: b ⊤ Φ) + ||Φ − Φ|| b 2F , ||Φ||2F = tr(Φ
b ⊤ Φ) = tr(Φ⊤ WW⊤ Φ) = tr(W⊤ ΦΦ⊤ W), tr(Φ
and, thus, the new optimization problem is: max tr(W⊤ ΦΦ⊤ W) W
(1)
X = W⊤ Φ.
(2)
Data embedding: Generally, the projection is performed over a lower dimensional space, which means that data are projected with a low-rank representation of rotation matrix (d < D). Nonetheless, data can be fully projected by using a whole base setting d = D. Furthermore, from equation (2) a lower-rank b ∈ RDh ×N can be obtained when d < D by data matrix Φ b = WX. Φ (3)
(6)
⊤
s. t. W W = Id .
where ki j = k(yi , y j ). The formulation of kernel PCA is done under the assumption that Φ is centered. This condition can be ensured by algebraically modifying the calculation of the dot product as it will be explained further. To project data, a linear combination with a d-dimensional base is used. Such a base can be arranged in an orthonormal rotation matrix W ∈ RDh ×d , such that W = [w(1) , . . . , w(d) ] and W⊤ W = Id , where w(ℓ) ∈ RDh and Id is a d-dimensional identity matrix. Then, projected data matrix X ∈ Rd×N can be calculated as:
(5)
To solve the previous problem, we can write a Lagrangian in the form: L(W|Φ) = tr(W⊤ ΦΦ⊤ W) − tr Λ(W⊤ W − Id ) , where Λ = Diag(λ1 , . . . , λDh ) are the Lagrange multipliers. By solving the first order condition on the Lagrangian, we get the following dual problem: ΦΦ⊤ W = WΛ ⇒ W⊤ ΦΦ⊤ W = Λ.
(7)
Therefore, a feasible solution is when Λ and W are the eigenvalue and eigenvector matrix, respectively. Furthermore, since this is a maximization problem, the eigenvectors associated to the d largest eigenvalues must be selected. Similarly, premultiplying equation (7) by Φ⊤ , we get Φ⊤ ΦΦ⊤ W = Φ⊤ WΛ ⇒ KX⊤ = X⊤ Λ,
(8)
and therefore embedded space X can be calculated as the eigenvectors of kernel matrix K.
b = WW⊤ Φ. The variance Then, we can also write that Φ criterion can be expressed as EΦ {||Φi − WW⊤ Φi ||22 } where || · ||2 and EΦ {·} denotes Euclidean norm, and expected value operator regarding Φ, respectively. Considering EΦ as the simple average, the mean-square-error-based objective function can be written as: N 1 X 1 b 2, ||Φi − WWΦi ||22 = ||Φ − Φ|| (4) F N i=1 N
b Links to generalized WPCA [7]: Similar to matrix Φ, define now a low-rank data matrix regarding the original b ∈ Rd×D , such that Y b = VX where V ∈ Rd×D observed space Y is the rotation matrix. Then, latent data matrix is in the form X = V⊤ Y. As well, define K as any positive semi-definite matrix. By using M-inner product or Mahalonis distance as an error measure, we can pose the following problem
where || · ||F stands for Frobenius norm. Following we explain how to solve the optimization problem and calculate the embedded space.
b ⊤ K(Y − Y)). b Previous problem b 2 = tr((Y − Y) where ||Y − Y|| K has a dual in the form:
Theorem II.1. (Optimal low-rank representation) A feasible optimal solution of the problem b 2F min kΦ − Φk W
⊤
s. t.W W = Id , d < D X = W⊤ Φ,
min ||Y − b Y||2K V
s. t. V⊤ V = Id , d < D, X = V⊤ Y.
max tr(V⊤ YKY⊤ V) V
(9)
(10)
s. t. V⊤ V = Id , whose solution can be calculated using eigenvectors again. Similarly as theorem II.1 proof, a solution to the previous problem is given by selecting V as the d eigenvectors corresponding to the largest eigenvalues of YKY⊤ . This approach
is widely described in [7]. The link between this approach and Kernel PCA becomes clear when taking advantages of the spectral properties of the quadratic forms. Since embedded data matrix is reconstructed by X = V⊤ Y, we can calculate it directly as the eigenvectors of kernel matrix K. Within this generalized framework, many Weighted PCA approaches can be easily understood. By setting K = Diag(δ) with δ ≻ 0, any WPCA approach with sample-wise weighted covariance matrix can be formulated [14]. Particularly, conventional PCA arises when K = IN .
where Γ = Diag([γ1 , . . . , γd ]) is a diagonal matrix holding the regularization parameters. Notice that this primal formulation can be seen as a least-squares SVM.
Furthermore, notice that dual maximization cost function of kernel PCA stated in equation 6 can be expressed as an energy term (covariance, as well, since data are beforehand centered) regarding embedding data (1/N) tr(W⊤ ΦΦ⊤ W) = (1/N) tr(XX⊤ ). Similarly, optimization problem in (10), in terms of a projected data matrix Z, can be understood as a covariance maximization. Let Ω ∈ RN×N be a lower triangular matrix so: K = Ω⊤ Ω. Such matrix can be calculated by the incomplete Cholesky decomposition [15]. Define Z = V⊤ YΩ⊤ and its covariance as:
where matrix A ∈ RN×ne holds the Lagrange multiplier vectors, that is, A = [α(1) , · · · , α(ne ) ], being α(l) ∈ RN the l-th vector of Lagrange multipliers. Solving the Karush-Kuhn-Tucker (KKT) conditions on (15), we get:
ZZ⊤ = V⊤ YKY⊤ V. Centering: Since kernel PCA is derived under the assumption that matrix Φ has zero mean, centering becomes necessary. To satisfy this condition, we can normalize the kernel function by: k(xi , x j ) ←k(xi , x j ) − Exi {k(xi , x j )} − Ex j {k(xi , x j )} + Exi {Ex j {k(xi , x j )}}
N×N
where 1N is a N-dimensional all ones vector. W EIGHTED K ERNEL PCA
For further statements, let us assume a whole latent variable model in the form x(ℓ) = w(ℓ) Φ + bℓ 1N , which in matrix terms can be expressed as: X = W⊤ Φ + b ⊗ 1⊤N
(12)
where bl is a bias term, and b = [b1 , . . . , bd ] and ⊗ denotes Kronecker product. As explained in previous section, kernel PCA can be written as a maximization problem regarding an energy term of Φ. SVM model: Here, we introduce a weighted version by incorporating a weighting matrix ∆ = Diag(δ1 , . . . , δN ) so that the problem formulation becomes 1 max tr(X∆X⊤ ) X,W,b N s. t. W⊤ W = Id (13) X = ΦW + b ⊗ 1⊤N , Previous problem can be relaxed as follows 1 1 max tr(X∆X⊤ Γ) − tr(W⊤ W) X,W,b 2N 2 s. t. X = W⊤ Φ + b ⊗ 1⊤N ,
∂L = 0 ⇒ X = N∆−1 AΓ −1 ∂X ∂L = 0 ⇒ W = ΦA ∂W ∂L = 0 ⇒ X = W⊤ Φ + b ⊗ 1⊤N ∂A ∂L = 0 ⇒ b⊤ 1N = 0. ∂b Therefore, by applying Lagrange multipliers and eliminating the primal variables from initial problem (13), the following eigenvector-based dual solution is obtained: AΛ = A∆(IN + (1N ⊗ b⊤ )(KΛ)−1 )K,
This can be done directly over the kernel matrix with: 1 1 1 K ←K − K1N 1⊤N − 1N 1⊤N K + 2 1N 1⊤N K1N 1⊤N N N N (11) = (IN − 1N 1⊤N )K(IN − 1N 1⊤N ),
III.
Dual problem: To solve the Weighted Kernel PCA problem, we form the corresponding Lagrangian of problem stated in equation (14), as follows: 1 1 L(X, W, Γ, A) = tr(X∆X⊤ Γ) − tr(W⊤ W) 2N 2 − tr(A⊤ (X − W⊤ Φ − b ⊗ 1⊤N )), (15)
(14)
(16)
N
where Λ = Diag(λ), Λ ∈ R , λ ∈ R is the vector of eigenvalues with λl = N/γl , λl ∈ R+ . Again, K ∈ RN×N is a given kernel matrix, satisfying the Mercer’s theorem such that Φ⊤ Φ = K. In order to pose a quadratic dual formulation satisfying the condition b⊤ 1N = 0 by centering vector b (i.e. with zero mean), the bias term is in the form bl = −1/(1⊤N ∆1N )1⊤N ∆Kα(l) . Therefore, the solution of problem (14) is reduced to the following eigenvector-related problem: AΛ = ∆HKA, where matrix H ∈ R as
N×N
(17)
is the centering matrix that is defined
H = IN −
1 1N 1⊤N ∆. 1⊤N V1N
Imposing a linear independency constraint on Lagrangian vector multipliers, A might be chosen as an orthonormal matrix. In consequence, a feasible solution is to estimate A and Λ as the spectral decomposition of ∆HK - eigenvectors and eigenvalue diagonal matrix, respectively. Finally, the embedded data can be calculated as follows: X = A⊤ Ω + b ⊗ 1⊤N .
(18)
The objective function of the proposed weighted kernel PCA can be seen as a squared Frobenius norm (M-inner product or weighted covariance matrix, as well), since 1 tr(X∆X⊤ ) = ||X||2(1/N)∆ N Indeed, if we define a weighted embedded data matrix as e ∈ Rd×N as X e = X Diag(δ1/2 , . . . , δ1/2 ), cost function can be X N 1 seen as an energy term as well.
IV.
E XPERIMENTAL RESULTS
The studied kernel approaches for dimensionality reduction are tested on well-known databases and the quality of the resultant embedded data is quantified by a non-supervised measure. Databases: Specifically, experiments are carried out over three conventional data sets. The first data set is an artificial spherical shell (N = 1500 data points and D = 3). The second data set is the COIL-20 image bank [11], which contains 72 gray-level images representing 20 different objects (N = 1440 data points –20 objects in 72 poses/angles– with D = 1282). The third data set is a randomly selected subset of the MNIST image bank [12], which is formed by 6000 gray-level images of each of the 10 digits (N = 1500 data points –150 instances for all 10 digits– and D = 242 ). Figure 1 depicts examples of the considered data sets.
performed in their standard algorithms. Also, in order to evaluate our framework, kernel approximations are also considered. CMDS kernel is the double centered distance matrix D ∈ RN×N so 1 KCMDS = − (I − 1N 1⊤N )D(I − 1N 1⊤N ), 2
(19)
where the i j entry of D is given by di j = ||yi − y j ||22 . A kernel for LLE can be approximated from a quadratic form in terms of the matrix W holding linear coefficients that sum to 1 and optimally reconstruct observed data. Define a matrix M ∈ RN×N as M = (IN − W)(IN − W⊤ ) and λmax as the largest eigenvalue of M. Kernel matrix for LLE is in the form KLLE = λmax IN − M.
(20)
Since kernel PCA is a maximization of the highdimensional covariance represented by a kernel, LE can be represented as the pseudo-inverse of the graph Laplacian L: KLE = L† ,
0.5 0.5 0 −0.5
0
−0.5
(a) COIL-20
−0.5 0
0.5
(b) Spherical shell
(c) MNIST Figure 1. The three considered data sets. To carry out the DR procedure, images from COIL-20 and MNIST data sets are vectorized.
Compared methods, kernels and parameters: We consider three spectral approaches, namely: classical multidimensional scalling (CMDS) [1], locally linear embedding (LLE) [3], and graph Laplacian eigenmaps (LE) [2]. They are all
(21)
where L = D−S, S is a similarity matrix and D = Diag(S1N ) is the degree matrix. All previously mentioned kernels are widely described in [13]. The similarity matrices are formed in such a way that the relative bandwidth parameter is estimated keeping the entropy over neighbor distribution as roughly log K where K is the given number of neighbors as explained in [16]. For all methods, input data is embedded into a 2-dimensional space, then d = 2. The number of neighbors is established as K = 30 for all considered data sets. Another parameter of interest is the weighting matrix ∆ for WKPCA. In this paper, we chose ∆ = D−1 to perform the normalization of Nystr¨om method [17], as done in Laplacian eigenmaps. Performance measure: To quantify the performance of studied methods, the scaled version of the average agreement rate between K-ary neighborhoods in the high-and lowdimensional spaces RNX (K) introduced in [5] is used. Such measure is widely accepted as a suitable non-supervised measure [18]–[21]. Let νiK and niK denote the K-ary neighborhoods of vectors χi and xi , respectively. agreement rate PNTheKaverage can be written as QNX (K) = i=1 |νi ∩ niK |/(KN). It varies between 0 (empty intersection) and 1 (perfect agreement). Knowing that random coordinates in X lead on average to QNX (K) ≈ K/(K − 1) [18], the useful range of QNX (K) is N − 1 − K, which depends on K. Therefore, in order to fairly compare or combine values of QNX (K) for different neighborhood sizes, the criterion can be rescaled to get RNX (K) = (N −1)QNX (K)− KN −1− K, for 1 ≤ K ≤ N −2. This modified criterion indicates the improvement over a random embedding and has the same useful range between 0 and 1 for all K. In experiments, the whole curve RNX (K) is shown, with a logarithmic scale for K. This choice is again justified by the fact that the size K and radius R of small neighborhoods in a P-dimensional space are (locally) related by K ∝ R p . A logarithmic axis also reflects that errors in large neighborhoods are proportionally less important than in small ones. Eventually, a scalar score is obtained by computing the
area under the RNX (K) curve in the log plot, given by
AUClog K (RNX (K)) =
N−2 P
RNX (K)/K
K=1
N−2 P
.
(22)
1/K
K=1
The AUC assesses the dimension reduction quality at all scales, with the most appropriate weights. Proposed kernel PCA and weighted kernel PCA are compared with standard spectral methods by using the previously defined kernels. Kernel PCA using KCMDS is termed K-CMDS, as well as K-LLE and K-LE when using KLLE , and KLE , respectively. Accordingly, methods are named WK-CMDS, WK-LLE and WK-LE when using weighted kernel PCA. Figures 2, 3, and 4 depict the results for the three employed data sets in terms of the quality measure. In addition, to graphically compare the considered methods, scatter plots of embedded data are shown. As can be appreciated, kernel PCA reaches roughly the same embedding space and RNX (K) curve as the standard implementations of CMDS, LLE and LE for all data sets. Scatter plots resemble to be rotated, such effect is caused by the configuration of eigenvectors that ordered regarding the largest eigenvalues. The XY plots were done in such a way that the principal component is along x-axis. Nonetheless, the scatter plots are approximately the same. The same happens with weighted kernel PCA. For LLE, we can appreciate slightly differences that can be attributed to the eigenvalue calculation involved to construct the kernel matrix from equation (20). Indeed, matrix M may be highly sparse depending on the local structure complexity of observed data. Also, the selected number of neighbors is a crucial aspect that has influence on the sparsity. In particular, in case of Spherical shell, the LLE kernel alternative gets exactly the same embedding. This is due to the clear and soft graph structure of the manyfold - points are close to each other (mainly regarding Geodesic distance) and follow a determined geometric rule. By the other hand, the proposed weighted kernel PCA approach not only represents properly the spectral approaches but may also improve their performance. The use of a whole latent variable model makes kernel PCA able to detect underline clusters into the observed data, keeping similar points (points belonging to the same cluster) in a different orthant [22]. This effect can be observed for all considered data sets. In particular for COIL-20, our approach yields a clear benefit on the preservation of local structure (smaller values of K). As noticed in Figure 3, the AUC of RNX (K) for smaller neighbors is significantly increased. In spite of this is done at the expense of slightly decreasing the ability to preserve larger neighbors relationship (global structure), it remains advantageous since the overall AUC is improved. In addition, generally, local structure preservation generates better embedding spaces [5], [23]. Interestingly, despite of this property is mainly for divergencebased methods (e.g. Stochastic Neighbors embedding) [5], [24], our weighted model for kernel PCA -being spectralattempts to preserve local structure as well. Additionally, matrix ∆ weight the samples according to the length of a weighted graph path where data points represent
nodes and the kernel matrix holds the pairwise weights, as in LE. Then, unknown high-dimensional space Φ is normalized in such way that observed data lies within a unit hyper-sphere, and the calculation of eigenvectors becomes non-sensitive to the length. V.
C ONCLUSION
This work presented a generalized framework for spectral dimensionality reduction approaches based on kernels. Proposed approach incorporates a latent variable model into a relaxed least-squares SVM problem. Doing so, we obtain a kernel version of weighted PCA when solving the problem via a primal-dual formulation. As well, we provided a formal matrix representation for kernel PCA. The generalized formulation allows to explain other approaches such as multidimensional scalling, locally linear embedding and Laplacian eigenmaps. Also, a versatile framework to explain different weighted PCA approaches. Experimentally, we prove that the incorporation of a SVM model improves the performance of kernel PCA. As a future work, more kernel properties will be explored to design kernels that best approximate spectral methods. In addition, we will study other alternatives for the weighting matrix to enhance the performance of weighted kernel PCA. ACKNOWLEDGMENT J.A. Lee is a Research Associate with the FRS-FNRS (Belgian National Scientific Research Fund). This work has been partially funded by FRS-FNRS project 7.0175.13 DRedVis. R EFERENCES [1] I. Borg, Modern multidimensional scaling: Theory and applications. Springer, 2005. [2] M. Belkin and P. Niyogi, “Laplacian eigenmaps for dimensionality reduction and data representation,” Neural computation, vol. 15, no. 6, pp. 1373–1396, 2003. [3] S. T. Roweis and L. K. Saul, “Nonlinear dimensionality reduction by locally linear embedding,” Science, vol. 290, no. 5500, pp. 2323–2326, 2000. [4] G. E. Hinton and S. T. Roweis, “Stochastic neighbor embedding,” in Advances in neural information processing systems, 2002, pp. 833–840. [5] J. A. Lee, E. Renard, G. Bernard, P. Dupont, and M. Verleysen, “Type 1 and 2 mixtures of kullback-leibler divergences as cost functions in dimensionality reduction based on similarity preservation,” Neurocomputing, 2013. [6] L. Wolf and S. Bileschi, “Combining variable selection with dimensionality reduction,” in Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, june 2005, pp. 801 – 806 vol. 2. [7] D. H. Peluffo, J. A. Lee, M. Verleysen, J. L. Rodr´ıguez-Sotelo, and G. Castellanos-Dom´ınguez, “Unsupervised relevance analysis for feature extraction and selection: A distance-based approach for feature relevance,” in International conference on pattern recognition, applications and methods - ICPRAM 2014. [8] L. Wolf and A. Shashua, “Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach,” Journal of machine learning, vol. 6, pp. 1855 – 1887, 2005. [9] “Unsupervised feature relevance analysis applied to improve ecg heartbeat clustering,” Computer Methods and Programs in Biomedicine, 2012. [10] C. Alzate and J. A. K. Suykens, “Multiway spectral clustering with outof-sample extensions through weighted kernel pca,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 2, pp. 335–347, 2010.
60
100RNX (K)
50 40 30 20 10 0 0 10
48.9 CMDS 48.9 K-CMDS 48.9 WK-CMDS 1
2
10
10
3
10
K
(a) CMDS
(b) K-CMDS
(c) WK-CMDS
(d) RNX (K) for CMDS 60
100RNX (K)
50 40 30 20 10 0 0 10
47.9 LLE 47.9 K-LLE 47.9 WK-LLE 1
2
10
10
3
10
K
(e) LLE
(f) K-LLE
(g) WK-LLE
(h) RNX (K) for LLE 60
100RNX (K)
50 40 30 20 10 0 0 10
53.2 LE 53.2 K-LE 53.2 WK-LE 1
2
10
10
3
10
K
(i) LE
(j) K-LE
(k) WK-LE
(l) RNX (K) for LE
Figure 2. Results for Spherical shell. Performances are shown regarding the quality measure RNX (K). The curves and their AUC are depicted. As well, as the resultant embedding data are scattered. Notice that a vertical line is plotted to highlight the selected number neighbors. In this case, it was set to be K = 50.
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
S. A. Nene, S. K. Nayar, and H. Murase, “Columbia object image library (coil-20),” Dept. Comput. Sci., Columbia Univ., New York.[Online] http://www. cs. columbia. edu/CAVE/coil-20. html, vol. 62, 1996. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. J. Ham, D. D. Lee, S. Mika, and B. Sch¨olkopf, “A kernel view of the dimensionality reduction of manifolds,” in Proceedings of the twentyfirst international conference on Machine learning. ACM, 2004, p. 47. Z. Xu, “On the second-order statistics of the weighted sample covariance matrix,” Signal Processing, IEEE Transactions on, vol. 51, no. 2, pp. 527–534, 2003. C. Alzate and J. A. Suykens, “Sparse kernel models for spectral clustering using the incomplete cholesky decomposition,” in Neural Networks, 2008. IJCNN 2008.(IEEE World Congress on Computational Intelligence). IEEE International Joint Conference on. IEEE, 2008, pp. 3556–3563. J. Cook, I. Sutskever, A. Mnih, and G. E. Hinton, “Visualizing similarity data with a mixture of maps,” in International Conference on Artificial Intelligence and Statistics, 2007, pp. 67–74. C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral grouping using the nystrom method,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 26, no. 2, pp. 214–225, 2004. L. Chen and A. Buja, “Local multidimensional scaling for nonlinear dimension reduction, graph drawing, and proximity analysis,” Journal of the American Statistical Association, vol. 104, no. 485, pp. 209–219, 2009. J. Venna, J. Peltonen, K. Nybo, H. Aidos, and S. Kaski, “Information retrieval perspective to nonlinear dimensionality reduction for data visualization,” The Journal of Machine Learning Research, vol. 11, pp. 451–490, 2010. J. A. Lee and M. Verleysen, “Quality assessment of dimensionality reduction: Rank-based criteria,” Neurocomputing, vol. 72, no. 7, pp. 1431–1443, 2009. S. France and D. Carroll, “Development of an agreement metric based upon the rand index for the evaluation of dimensionality reduction techniques, with applications to mapping customer data,” in Machine
[22]
[23]
[24]
learning and data mining in pattern recognition. Springer, 2007, pp. 499–517. C. Alzate and J. A. K. Suykens, “Multiway spectral clustering with outof-sample extensions through weighted kernel pca,” Pattern Analysis and Machine Intelligence, IEEE Transactions on, vol. 32, no. 2, pp. 335–347, Feb 2010. D. H. Peluffo-Ord´on˜ ez, J. A. Lee, and M. Verleysen, “Short review of dimensionality reduction methods based on stochastic neighbour embedding,” in Advances in Self-Organizing Maps and Learning Vector Quantization. Springer, 2014, pp. 65–74. ——, “Recent methods for dimensionality reduction: A brief comparative analysis,” in European Symposium on Artificial Neural Networks (ESANN). Citeseer, 2014.
60
100RNX (K)
50 40 30 20 10 0 0 10
39.8 CMDS 38.8 K-CMDS 44.9 WK-CMDS 1
10
10
2
10
3
K
(a) CMDS
(b) K-CMDS
(c) WK-CMDS
(d) RNX (K) 50
100RNX (K)
40
30
20
10
0 0 10
28.7 LLE 32.6 K-LLE 32.6 WK-LLE 10
1
10
2
K
(e) LLE
(f) K-LLE
(g) WK-LLE
(h) RNX (K) 60
100RNX (K)
50 40 30 20 10 0 0 10
35.9 LE 35.9 K-LE 36.9 WK-LE 10
1
10
2
3
10
K
(i) LE Figure 3.
(j) K-LE
(k) WK-LE
(l) RNX (K)
Results for COIL-20.
45 40
100RNX (K)
35 30 25 20 15 10 17.4 CMDS 17.2 K-CMDS 18.2 WK-CMDS
5 0 0 10
1
2
10
3
10
10
K
(a) CMDS
(b) K-CMDS
(c) WK-CMDS
(d) RNX (K)
100RNX (K)
15
10
5
0 0 10
6.9 LLE 6.9 K-LLE 6.9 WK-LLE 1
2
10
10 K
(e) LLE
(f) K-LLE
(g) WK-LLE
(h) RNX (K) 45 40
100RNX (K)
35 30 25 20 15 10 20.2 LE 20.2 K-LE 20.9 WK-LE
5 0 0 10
1
2
10
10 K
(i) LE Figure 4.
Results for MNIST.
(j) K-LE
(k) WK-LE
(l) RNX (K)
10
3