Distance Preserving Dimension Reduction for Manifold ... - CiteSeerX

Distance Preserving Dimension Reduction for Manifold Learning Hyunsoo Kim, Haesun Park, and Hongyuan Zha College of Computing Georgia Institute of Technology 266 Ferst Drive, Atlanta, GA 30332, USA {hskim,hpark,zha}@cc.gatech.edu Abstract Manifold learning is an effective methodology for extracting nonlinear structures from high-dimensional data with many applications in image analysis, computer vision, text data analysis and bioinformatics. The focus of this paper is on developing algorithms for reducing the computational complexity of manifold learning algorithms, in particular, we consider the case when the number of features is much larger than the number of data points. To handle the large number of features, we propose a preprocessing method, distance preserving dimension reduction (DPDR). It produces t-dimensional representations of the high-dimensional data, where t is the rank of the original dataset. It exactly preserves the Euclidean L2 -norm distances as well as cosine similarity measures between data points in the original space. With the original data projected to the t-dimensional space, manifold learning algorithms can be executed to obtain lower dimensional parameterizations with substantial reduction in computational cost. Our experimental results illustrate that DPDR significantly reduces computing time of manifold learning algorithms and produces low-dimensional parameterizations as accurate as those obtained from the original datasets. 1 Introduction Dimension reduction is imperative for efficiently manipulating the massive quantity of high dimensional data. We consider the following general dimension reduction problem. Given a high dimensional dataset A = (a1 , . . . , an ) where ai ∈ Rm for 1 ≤ i ≤ n, how can we compute lower ddimensional representations zi ∈ Rd with d ≪ m that can preserve certain structures of the original dataset? To this end, we briefly review several dimension reduction methods. Principal component analysis (PCA) and multidimensional scaling (MDS) are classic methods for linear dimensionality reduction. For these methods, the data points ai are centered at the origin, i.e. Ac e = 0, where Ac is the centered data matrix, e is the n-dimensional column vec-

tor whose elements are all ones, and 0 is the m-dimensional zero column vector. PCA can be computed from the singular value decomposition (SVD) [7] of Ac = Uc Σc VcT . The eigendecomposition of ATc Ac , i.e. ATc Ac = Vc Σ2c VcT , also generates the principal components, i.e. the column vectors of Vc ∈ Rn×n , which are the right singular vectors v1 , . . . , vn of Ac . The d-dimensional representation can be obtained from the first d right singular vectors v1 , . . . , vd , i.e. zi (j) = vj (i) for 1 ≤ i ≤ n and 1 ≤ j ≤ d. MDS tries to preserve the inner products between two data points as much as possible. The solution is obtained from the symmetric eigendecomposition of the Gram matrix ATc Acp . The low dimensional representations are given by zi (j) = λj vj (i) for 1 ≤ i ≤ n and 1 ≤ j ≤ d, where λj and vj are the j-th eigenvalues and eigenvectors, respectively. There are several recently proposed nonlinear dimensional reduction algorithms (NLDR) including Isomap [13], locally linear embedding (LLE) [11, 12], Hessian LLE [5], Laplacian eigenmaps [3], and local tangent space alignment (LTSA) [15]. In Isomap, nearby data points are mapped to nearby low dimensional representations, while faraway data points are mapped to faraway low dimensional representations. In particular, Isomap builds a connectivity graph from the k-nearest neighbors, assigns edge-weights by the Euclidean distances between nearest neighbors, computes pairwise distances ∆ij between all nodes (i, j) along shortest paths through the graph which approximate the geodesic distance in the manifold, and then performs MDS with the distance matrix ∆. LLE builds a similar graph, assigns weights to the edges in the graph by finding the optimal local convex/linear combinations of the k-nearest neighbors to represent each original data point, and obtains the low dimensional representations that preserve the local convex/linear combinations. Laplacian eigenmaps starts with the graph, assigns weights to the edges of the graph by Wij = exp(−kai − aj k22 /σ 2 ), where σ is a scale parameter, and obtains to the low dimensional representations such that nearby data points are mapped to nearby low dimensional representations. Isomap needs to assume that the embed-

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited

ded manifold is sampled on a convex region. Donoho and Grimes [5] proved that the convex conditions in the previous methods can be relaxed by minimizing the Hessian of the function f that is a map from the manifold to R. LTSA uses the k-nearest neighbors to form a graph and approximates local tangent spaces for each neighborhood. NLDR algorithms have found applications in a variety of areas such as image recognition, speech recognition, and text data mining. One impediment to its wider adoption is their high computational cost. In this paper, we consider the case where there are more features than the number of data points. Specifically, we introduce a preprocessing approach, distance preserving dimension reduction (DPDR), for NLDR. DPDR can produce t-dimensional representations where rank(A) = t, which exactly preserve Euclidean L2 -norm distances as well as cosine similarity measures between data points in the original m-dimensional space. DPDR projects the original dataset into a potentially much small space without any loss of the pairwise Euclidean distance information. Then, NLDR algorithms can be applied to obtain lower dimensional parameterizations from the tdimensional representations. The rest of this paper is organized as follows. In Section 2, we introduce DPDR. Section 3 presents experimental results illustrating properties of the proposed DPDR method. Summary is given in Section 4. 2 Distance Preserving Dimension Reduction

from Y1 as kai − aj k2 = kU1T (ai − aj )k2 = kyi − yj k2 , where yj is the j-th column of Y1 ∈ Rn×n . The computing cost of the distance in the reduced n-dimensional space is much less than that in the full m-dimensional space when m ≫ n. The Σ1 and V1 can be efficiently computed without computing U1 from the symmetric eigendecomposition of AT A since AT A = V1 Σ1 U1T U1 Σ1 V1T = V1 Σ21 V1T . This can be computed from the Golub-Reinsch SVD [6] of AT A, which requires mn2 + 12n3 flops [7]. The R-SVD using R-bidiagonalization [4] requires 2mn2 + 11n3 flops, while the Golub-Reinsch SVD needs 4mn2 + 8n3 flops when we need only Σ and V of A = U ΣV T ∈ Rm×n s.t. m ≫ n. However, since we are dealing with the SVD of AT A ∈ Rn×n , the flop count of the R-SVD, i.e. 13n3 , is more expensive than that of the Golub-Reinsch SVD, which is 12n3 . When t = rank(A), we obtain Yt Y1 = , 0(n−t)×n where Yt ∈ Rt×n is the t-dimensional representations of n data points and 0(n−t)×n is a zero matrix of size (n − t) × n. Let yj and y ˜j denote the j-th column of Y1 and Yt , respectively. Then, according to Theorem 2.1, L2 -norm distances are preserved in the t-dimensional space since

Let us deal with n data points whose dimension is m s.t. T yi − y ˜j k 2 . m ≫ n. The QR decomposition of a centroid matrix can kai − aj k2 = kU1 (ai − aj )k2 = kyi − yj k2 = k˜ preserve the order of distances and the order of cosine simiIn addition, according to Theorem 2.2, cosine similarities are larities [10]. DPDR can be designed by the QR decomposipreserved in the t-dimensional space owing to tion, the UTV decomposition [7], or the SVD. In this paper, we focus on DPDR based on the SVD (DPDR/SVD) for the (U1T ai )T (U1T aj ) cos(ai , aj ) = sake of simple presentation. kU1T ai k2 kU1T aj k2 The thin SVD of A is yiT yj y ˜iT y ˜j = = . T kyi k2 kyj k2 k˜ yi k2 k˜ yj k2 A = U1 Σ1 V1 , where U1 ∈ Rm×n has only n basis vectors, Σ1 ∈ Rn×n is a diagonal matrix, and V1 ∈ Rn×n is an orthogonal matrix. Although L2 -norm is invariant under orthogonal transformation using U1T , this does not hold for U1T in general since U1 U1T 6= I although U1T U1 = I. If we apply dimension reduction by U1T ∈ Rn×m , Euclidean L2 norm distances and cosine similarities can be preserved in the reduced n-dimensional space according to Theorems 2.12.2. The n-dimensional representations that can preserve distances is Y1 = U1T A = Σ1 V1T . By Theorem 2.1, we can compute Euclidean distances in the full dimensional space between any two vectors (ai and aj )

We refer this method to as DPDR/SVD. Theorems 2.1-2.2 are based on the SVD. Similarly, Theorems of DPDR using the QR decomposition or the UTV decomposition can also be constructed. T HEOREM 2.1. [Distance preserving in L2 norm] The distance D(ai , aj ) in L2 norm in the full dimensional space between two vectors (ai and aj ) is completely preserved in the reduced space obtained from the thin SVD. That is D(ai , aj ) = D(ˆ ai , ˆ aj ), where ˆ ai = U1T ai and ˆ aj = U1T aj , T and the thin SVD of A is U1 Σ1 V1 . T

Proof. Premultiplying U T = (U1 U2 ) to A = U ΣV T gives U1T A = Y1 = Σ1 V1T and U2T A = 0. Since


kai − aj k22 = kU T (ai − aj )k22

= kU1T (ai − aj )k22 + kU2T (ai − aj )k22

where DDR denotes the matrix of Euclidean distances DDR (i, j) = kzi − zj k22 of the lower d-dimensional repand U2T ai = U2T aj = 0, resentations (z1 , z2 , . . . , zn ) obtained from a dimensionality reduction algorithm, and DDP DR+DR is the matrix of Eukai − aj k22 = kU1T (ai − aj )k22 . clidean distances DDP DR+DR (i, j) = kˇ zi − ˇ zj k22 of the lower d-dimensional representations (ˇ z , ˇ z , . .., ˇ zn ) obT HEOREM 2.2. [Similarity preserving in cosine measure] 1 2 tained from a dimensionality reduction algorithm after apThe similarity S(ai , aj ) with cosine measure in the full plying DPDR. The lower value of E means that dimensional space between any two vectors (ai and aj ) is DP DR+DR completely preserved in the reduced space obtained from the DDP DR+DR is more similar to DDR . thin SVD. That is S(ai , aj ) = S(ˆ ai , ˆ aj ), where ˆ ai = U1T ai T 3.2 Datasets Description The first dataset is the cropped and ˆ aj = U1 aj , and the thin SVD of A is U1 Σ1 V1T . UMist faces dataset [8]. It consists of 575 images of 20 Proof. Let cos(ai , aj ) be cosine between two vectors (ai people, which are 112 × 92 size in 256 shades of gray, T and aj ). Since U2T aj = 0, kU T aj k2 = k (U1 U2 ) aj k2 manually cropped by Daniel Graham at UMist. Each covers T a range of poses from profile to frontal views. Subjects cover = kU1 aj k2 . Then, a range of race, sex, and appearance. We built a data matrix (U T ai )T (U T aj ) AUMist of size (112 × 92) × 149 using the first five persons. T T cos(ai , aj ) = cos(U ai , U aj ) = kU T ai k2 kU T aj k2 The second dataset is the ORL database of faces [1] containing ten different images of each of 40 distinct subjects. aTi U1 U1T aj For some subjects, the images were taken at different times, = = cos(U1T ai , U1T aj ). T kU1 ai k2 kU1T aj k2 varying the lighting, facial expressions (open/closed eyes, smiling/not smiling) and facial details (glasses/no glasses). 3 Experiment Results and Discussion All the images were taken against a dark homogeneous backAll algorithms were implemented and executed in MAT- ground. The faces are in an upright, frontal position with LAB 6.5 [9]. We used MATLAB codes of the tolerance for some side movement. The files are in PGM manifold learning MATLAB toolbox (available from format. The size of each image is 112 × 92 pixels, with 256 http://www.math.umn.edu/ ˜wittman/mani/). In the toolbox, grey levels per pixel. We built a big matrix AORL of size except for PCA, all MATLAB codes for the nonlinear dimen- (112 × 92) × 400. The third dataset is the MEDLINE term-document masionality reduction algorithms were written by the original authors of the methods. All our experiments were performed trix AMEDLIN E of size 15, 018 × 250 in the form of MATLAB sparse arrays generated by Text to Matrix Generator on a P3 600MHz machine with 512MB memory. (TMG) [14]. TMG applies common filtering techniques (e.g. 3.1 Quality Measures DPDR aims to find the lower t- removal of common words, removal of words that are too dimensional representations that preserve distances in the infrequent or frequent, removal of words that are too short original m-dimensional space, where t ≪ m, in order or too long, etc) to reduce the size of the term dictionary. to preserve distance information. We defined a distance Stemming was not applied. The m × n term-by-document matrix AMEDLIN E = [aij ] was provided by using the folpreserving measure: lowing weighting scheme. The elements of AMEDLIN E are (3.1) X 1 assigned as aij = lij ∗ gi , where lij is the local weight for 2 (DA (i, j) − DDP DR (i, j)) , EDP DR = 2 (n − n)/2 i,j>i the i-th term in the j-th gene-document, and gi is the global weight for the i-th term. The local weight lij and the global where DA denotes the matrix of Euclidean distances weight gi can be computed as DA (i, j) = kai − aj k22 , for 1 ≤ i, j ≤ n, in the original mlij = fij , dimensional space, and DDP DR is the matrix of Euclidean 2 distances DDP DR (i, j) = k˜ ri − ˜ rj k2 of the t-dimensional n X representations obtained from DPDR. The lower distance gi = log2 (n/ δ(fij )), preserving measure EDP DR means the lower loss of distance j=1 information. The second distance preserving measure is where fij is the frequency of the i-th term in the j-th genedocument, gi is the inverse document frequency, which is the 1 (3.2) EDP DR+DR = 2 ξDR , ratio between the total number of documents and the number (n − n)/2 of documents containing the term, δ(v) is a delta function X 2 (if v = 0, then δ(v) = 0, otherwise δ(v) = 1), and n is ξDR = (DDR (i, j) − DDP DR+DR (i, j)) , the number of documents in the collection. The document i,j>i


Table 1: Computing times (in seconds) of dimensionality reduction algorithms with/without DPDR/SVD on the cropped UMist faces dataset (AUMist ∈ R10304×149 , rank(AUMist ) = 139). We applied various dimensionality reduction algorithms to obtain two-dimensional representations. Except for PCA, all MATLAB codes for the nonlinear dimensionality reduction algorithms were written by the original authors of the methods. We obtained the distance preserving representations from DPDR/SVD in 1.582 seconds and EDP DR = 4.4876 × 10−13 . Algorithms

k

Original

with DPDR (1.582 sec.)

EDP DR+DR

PCA

-

11.377

0.371

1.2570 × 10−30

Isomap

8

4.076

1.492

3.4488 × 10−10

LLE

8

> 10 min.

> 10 min.

n/a

Hessian LLE

8

out of memory

1.852

n/a

Laplacian

8

2.754

0.341

8.8541 × 10−34

LTSA

8

7.520

1.031

4.7410 × 10−10

Table 2: Computing times (in seconds) of dimensionality reduction algorithms with/without DPDR/SVD on the MEDLINE dataset (AMEDLIN E ∈ R15018×250 , rank(AMEDLIN E ) = 247) for various k values. We applied various dimensionality reduction algorithms to obtain two-dimensional representations. We obtained the distance preserving representations from DPDR/SVD in 2.814 seconds and EDP DR = 4.5393 × 10−30 . Algorithms

k

Original

with DPDR (2.814 sec.)

EDP DR+DR

PCA

-

34.590

2.915

5.0635 × 10−32

Isomap

8

12.718

6.149

5.9241 × 10−19

LTSA

8

21.360

4.997

4.8351 × 10−8

Isomap

10

12.668

6.740

6.3032 × 10−18

LTSA

10

25.497

5.147

4.5415 × 10−10

Isomap

12

12.287

6.319

3.8736 × 10−19

LTSA

12

29.943

5.698

3.5124 × 10−10

vectors are normalized to make kai k2 = 1. The MEDLINE not generate two-dimensional representations due to out of dataset contains 50 documents for each class (heart attack, memory. However, when we used Hessian LLE after applycolon cancer, glycemic, oral cancer, tooth decay). ing DPDR (DPDR+Hessian LLE), the total dimension of input matrix was substantially reduced and there was no mem3.3 Experimental Results Table 1 shows computing ory problem any more. This is a critical benefit of DPDR times (in seconds) of dimensionality reduction algo- for Hessian LLE. The values of EDP DR+DR for dimenrithms with/without DPDR/SVD on the cropped UMist sionality reduction algorithms we tested were very small, faces dataset. We applied various dimensionality reduc- which means that the results with/without DPDR were altion algorithms to obtain two-dimensional representations. most the same. Thus, it is possible to reduce computation We obtained the distance preserving representations from complexity of dimensionality reduction algorithms by using DPDR/SVD in 1.582 seconds. The average of distance t-dimensional representations obtained from DPDR. For exdifferences between DA and DDP DR for the UMist faces ample, LTSA took 7.520 seconds, while LTSA with DPDR dataset was EDP DR = 4.4876 × 10−13 , which shows that (DPDR+LTSA) took only 2.613 seconds (1.582 seconds for distances between data points in the original 10,304 dimen- DPDR and 1.031 seconds for LTSA). More than 60% of sional space are preserved in the lower 139 dimensional computing time was reduced. space obtained from DPDR/SVD. LLE did not produce reFor the ORL faces dataset of size 10, 304 × 400, the valsults within 10 minutes. Hessian LLE without DPDR did ues of EDP DR = 5.3238 × 10−13 was very small. For PCA,


Figure 1: Two dimensional representations obtained from Isomap with/without DPDR/SVD on the ORL faces dataset (AORL ∈ R10304×400 ) with k = 8. (a) Isomap (b) DPDR + Isomap

the distances between data points in the two-dimensional space obtained directly from PCA were almost the same as those in two-dimensional space obtained from PCA using 400 dimensional representations obtained from DPDR methods, i.e. DP CA ≈ DDP DR+P CA . Two-dimensional representations obtained from Isomap with/without DPDR using k = 8 were illustrated in Figure 1. Using Isomap after applying DPDR, we could obtain the same two-dimensional representations as Isomap without DPDR. This is attributed to the fact that DPDR can generate t-dimensional representations that can preserve distances of data points in the mdimensional space. Table 2 shows the computing times and distance preserving measures for various k values on the MEDLINE dataset. The MEDLINE dataset has the largest number of features among all datasets we tested. Specifically, it is a large sparse matrix. We obtained the distance preserving representations from DPDR/SVD in 2.814 seconds and EDP DR = 4.5393 × 10−30 . We could significantly reduce computing times of dimensionality reduction algorithms. For example, LTSA with k = 12 took 29.943 seconds for the MEDLINE dataset, while LTSA with DPDR (DPDR+LTSA) took only 8.512 seconds (2.814 seconds for DPDR and 5.698 seconds for LTSA). More than 70% of computing time was reduced. Using a k value of the number of nearest neighbors, we can obtain lower dimensional representations. Using different k values, we may obtain different lower dimensional representations. One can think of a method for choosing k [2]. If some domain knowledge is available, we can use it to choose a proper k value. Then, we may try to measure

the quality of lower dimensional representations for various k values by using domain knowledge. In this case, we need to compute lower dimensional representations from NLDR for various k values. This is a time consuming procedure. However, if we use DPDR, we can only compute DPDR just once and use the t-dimensional representation for repetitive executing an NLDR algorithm for various k values. Thus, DPDR can save a lot of computing time when we need to choose an appropriate k value by using domain knowledge. For example, let us suppose that we need to choose a k value among 8, 10, and 12 for data visualization. Then, we need to compute two-dimensional representations from LTSA for three k values. Table 2 shows the effectiveness of DPDR when we wish to search for a optimal k. LTSA took 76.8 seconds (21.360 seconds for LTSA with k = 8, 25.497 seconds for LTSA with k = 10, and 29.943 seconds for LTSA with k = 12). On the other hand, LTSA with DPDR took only 18.656 seconds (2.814 seconds for DPDR, 4.997 seconds for LTSA with k = 8, 5.147 seconds for LTSA with k = 10, and 5.698 seconds for LTSA with k = 12). More than 70% of computing time was reduced. 4 Conclusion We develop a preprocessing method based on distance preserving dimension reduction for substantially reducing the computational cost of manifold learning algorithms. We confirmed that DPDR can generate lower dimensional representations that preserve the distances among data points in the original dimensional space by using face recognition datasets, a text mining dataset, and a gene expression dataset. DPDR is essential to reduce computational


costs and memory requirement of many dimensionality reduction algorithms including PCA, Isomap, and LTSA, while at the same time it gives low-dimensional representations/parameterizations as accurate as those based on the original datasets. DPDR can be widely applied to many data analysis problems where the number of data points is much smaller than the number of features. Acknowledgments We would like to thank J. B. Tenenbaum for his helpful comments. The work of the first two authors is supported in part by the National Science Foundation Grants ACI0305543 and CCF-0621889. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

geometric framework for nonlinear dimensionality reduction. Science, 290(5500):2319–2323, 2000. [14] D. Zeimpekis and E. Gallopoulos. Design of a MATLAB toolbox for term-document matrix generation. In I. S. Dhillon, J. Kogan, and J. Ghosh, editors, Proc. Workshop on Clustering High Dimensional Data and its Applications at the 5th SIAM Int’l Conf. Data Mining (SDM’05), pages 38–48, Newport Beach, CA, April 2005. [15] Z. Zhang and H. Zha. Principal manifolds and nonlinear dimension reduction via tangent space alignment. SIAM Journal of Scientific Computing, 26(1):313–338, 2004.

References [1] AT&T Laboratories Cambridge. The ORL Database of Faces, 1994. [2] M. Balasubramanian, E. L. Schwartz, J. B. Tenenbaum, V. de Silva, and J. C. Langford. The isomap algorithm and topological stability. Science, 295:7a, 2002. [3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6):1373–1396, 2003. [4] T. F. Chan. An improved algorithm for computing the singular value decomposition. ACM Trans. Math. Soft., 8:72– 83, 1982. [5] D. L. Donoho and C. E. Grimes. Hessian eigenmaps: locally embedding techniques for high-dimensional data. Proc. Natl Acad. Sci. USA, 100:5591–5596, 2003. [6] G. H. Golub and C. Reinsch. Singular value decomposition and least squares solutions. Numer. Math., 14:403–420, 1970. [7] G. H. Golub and C. F. van Loan. Matrix Computations, third edition. Johns Hopkins University Press, Baltimore, 1996. [8] D. B. Graham and N. M. Allinson. Characterizing virtual eigensignatures for general purpose face recognition. In H. Wechsler, P. J. Phillips, V. Bruce, F. Fogelman-Soulie, and T. S. Huang, editors, Face Recognition: From Theory to Applications, volume 163, pages 446–456, 1998. NATO ASI Series F: Computer and Systems Sciences. [9] MATLAB. User’s Guide. The MathWorks, Inc., Natick, MA 01760, 1992. [10] H. Park, M. Jeon, and J. B. Rosen. Lower dimensional representation of text data based on centroids and least squares. BIT Numerical Mathematics, 42(2):1–22, 2003. [11] S. T. Roweis and L. K. Saul. Nonlinear dimensionality reduction by locally linear embedding. Science, 290:2323– 2326, 2000. [12] L. K. Saul and S. T. Roweis. Think globally, fit locally: Unsupervised learning of low dimensional manifolds. Journal of Machine Learning Research, 4:119–155, 2003. [13] J. B. Tenenbaum, V. de Silva, and J. C. Langford. A global