LOCAL LINEAR SMOOTHING FOR NONLINEAR MANIFOLD LEARNING ZHENYUE ZHANG∗ AND HONGYUAN ZHA† Abstract. In this paper, we develop methods for outlier removal and noise reduction based on weighted local linear smoothing for a set of noisy points sampled from a nonlinear manifold. The methods can be used by manifold learning methods such as Isomap, LLE and LTSA as a preprocessing procedure so as to obtain a more accurate reconstruction of the underlying nonlinear manifolds. Weighted principal component analysis is used as a building block of our methods and we develop an iterative weight selection scheme that leads to robust local linear fitting. We also develop an efficient and effective bias-reduction method to deal with the trim the peak and fill the valley phenomenon in local linear smoothing. Several illustrative examples are presented to show that nonlinear manifold learning methods combined with weighted local linear smoothing give more accurate reconstruction of the underlying nonlinear manifolds.
1. Introduction. In many real-world applications with high-dimensional data, the components of the data points tend to be correlated with each other, and in many cases the data points lie close to a low-dimensional nonlinear manifold. Discovering the structure of the manifold from a set of data points sampled from the manifold possibly with noise represents a very challenging unsupervised learning problem [2, 3, 6, 7, 10, 11, 14]. Furthermore, the discovered low-dimensional structures can be used for classification, clustering, and data visualization. Traditional dimension reduction techniques such as principal component analysis (PCA), factor analysis and multidimensional scaling usually work well when the underlying manifold is a linear (affine) subspace in the input space [5]. They can not, in general, discover nonlinear structures embedded in the set of data points. Recently, there have been much renewed interests in developing efficient algorithms for constructing nonlinear low-dimensional manifolds from sample data points in high-dimensional spaces emphasizing simple algorithmic implementation and avoiding optimization problems prone to local minima [11, 14]. Two main approaches are 1) the one exemplified by [3, 14] where pairwise geodesic distances of the data points with respect to the underlying manifold are estimated, and the classical multi-dimensional scaling is used to project the data points into a low-dimensional space that best preserves the geodesic distances; 2) the other one follows the long tradition starting with self-organizing maps (SOM) [7], principal curves/surfaces [4] and topology-preserving networks [9, 2]. The key idea is that the information about the global structure of a nonlinear manifold can be obtained from a careful analysis of the interactions of the overlapping local structures. In particular, the local linear embedding (LLE) method constructs a local geometric structure that is invariant to translations and orthogonal transformations in a neighborhood of each data points and seeks to project the data point into a low-dimensional space that best preserves those local geometries [11, 12]. In [15], we have also developed an improved local algorithm LTSA based on using the tangent space in the neighborhood of a data point to represent the local geometry, ∗ Department of Mathematics, Zhejiang University, Yuquan Campus, Hangzhou, 310027, P. R. China.
[email protected]. The work of this author was done while visiting Penn State University and was supported in part by the Special Funds for Major State Basic Research Projects (project G19990328), Foundation for University Key Teacher by the Ministry of Education, China, and NSF grants CCR-9901986. † Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802,
[email protected]. The work of this author was supported in part by NSF grants CCR-9901986.
1
2
ZHENYUE ZHANG AND HONGYUAN ZHA
and then aligning those local tangent spaces to construct the global coordinate system for the nonlinear manifold. As is well-known, PCA is quite sensitive to outliers (this will also be shown in an example in section 2). Both Isomap and the local methods such as LLE and LTSA have similar sensitivity problems with regard to outliers. The focus of this paper is the development of an outlier removal and noise reduction method that can be used by Isomap, LLE and LTSA as a preprocessing procedure so as to obtain a more accurate reconstruction of the underlying nonlinear manifold from a set of noisy sample points (one of the building blocks of the method, weighted PCA can be considered as a robust version of PCA that is less sensitive to outliers). More importantly, outlier removal and noise reduction tend to reduce the chances of having short-circuit nearest neighbor connections and therefore help to better preserve the topological structures of the nonlinear manifold in a neighborhood graph constructed from a finite set of sample points. We used the basic idea of weighted local linear smoothing for outlier removal and noise reduction, and the techniques we used are similar in spirit to local polynomial smoothing employed in nonparametric regression [1, 8]. However, since in our context, we do not have the response variables, local smoothing needs to employ techniques other than least squares fitting. Since we are interested in modeling high-dimensional data, and therefore in local smoothing we try to avoid more expensive operations such as approximating the Hessian matrices when we carry out bias reduction. Furthermore, we apply local smoothing in an iterative fashion. The rest of the paper is organized as follows: in section 2 we develop a weighted version of PCA which will be used as the building block for local linear smoothing. In section 3, we discuss an iterative method for selecting the weights used in weighted PCA, and we also give some informal arguments as to why the method can have the effects of outlier removal. In section 4, we present the weighted linear local smoothing method and illustrate its performance using some simple examples. In section 5, we discuss how to deal with the trim the peak and fill the valley phenomenon in local linear smoothing, we do it in a way to avoid using higher order polynomials. In section 6, we present some examples to show that local smoothing can improve accuracy in manifold learning. The last section concludes the paper, and points out some topics for further research. Notation. Throughout the paper, we use k · k2 to denote the Euclidean norm P of a vector, and k · kF to denote the Frobenius of a matrix, i.e., kAk2F = i,j a2ij , for A = [aij ]. In some of the numerical examples, we will use MATLAB scripts to show how the data points are generated. 2. Weighted PCA. We assume that F is a d-dimensional manifold in a mdimensional space with unknown generating function f (τ ), τ ∈ R d , and we are given a data set consists of N m-dimensional vectors X = [x1 , . . . , xN ], xi ∈ Rm generated from the following model, xi = f (τi ) + i ,
i = 1, . . . , N,
where τi ∈ Rd with d < m, and i ’s represent noise. In the local smoothing method we will discuss, the nonlinear manifold F is locally approximated by an affine subspace. Specifically, for a sample point xi , let Xi = [xi1 , . . . , xik ]
3
Local Linear Smoothing for Nonlinear Manifold Learning
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0 0.15
0.2
0.25
0 0.15
0.3
0.2
0.25
0.3
Fig. 1. First principal component direction (solid line): (left) without weighting and (right) with weighting.
be a matrix consisting of its k-nearest neighbors including x i , say in terms of the Euclidean distance. We can fit a d-dimensional affine subspace to the columns in X i by applying PCA to Xi [5, Section 14.5]. However, PCA is quite sensitive to outliers, as is illustrated in the left panel of Figure 1. There, we have a set of 2D points sampled from a line segment, and we also add a few outliers to the data set. It is clearly seen that the computed first principal component direction is almost orthogonal to the direction of the straight line segment. Since one of the goals of local smoothing is to reduce the effects of outliers, we need to develop more robust version of PCA. Our basic idea is to incorporate weighting to obtain a weighted version of PCA. Now consider a set of k points x1 , · · · , xk in Rm , we want to fit them using an affine subspace parameterized as c + U t, where U ∈ Rm×d , and forms the orthonormal basis of the affine subspace, c ∈ R m gives the displacement of the affine subspace and t ∈ Rd represents the local coordinate of a point in the affine subspace. To this end, we formulate the following weighted least squares problem (2.1)
min
c, {ti }, U T U =Id
k X i=1
wi kxi − (c + U ti )k22 ,
where w1 , · · · , wk are a given set of weights the choice of which will be discussed in the next section. We denote X = [x1 , · · · , xk ],
T = [t1 , · · · , tk ],
and form the weight vector and a diagonal matrix, respectively, as √ √ w = [w1 , · · · , wk ]T , D = diag( w1 , · · · , wk ). The following theorem gives the optimal solution to the weighted least squares problem in (2.1). P P Theorem 2.1. Let x¯w = i wi xi / i wi be the weighted mean of x1 , · · · , xk , and let u1 , · · · , uk be the largest left singular vectors of (X − x ¯w eT )D, where e is the k-dimensional column vector of all ones. Then an optimal solution of the problem (2.1) is given by c = x¯w ,
U = [u1 , · · · , uk ],
ti = U T (xi − x ¯w ).
4
ZHENYUE ZHANG AND HONGYUAN ZHA
Proof. Let E = (X − (ceT + U T ))D be the weighted reconstruction error matrix. Denoting by v the normalized vector of √ √ √ √ √ w = [ w1 , · · · , wk ]T , v = w/k wk2 , we rewrite the error matrix E as E = Evv T + E(I − vv T ). Because Dvv T = weT D/(eT w), we have that
and
Evv T = (X − (ceT + U T ))Dvv T = (¯ xw − c)eT − U T weT /(eT w)) D E(I − vv T ) = (X − (ceT + U T ))D − Evv T = X − (¯ xw eT + U T˜) D
˜ = E(I − vv T ) is the reconstruct error matrix with T˜ = T (I − weT /(eT w)). Clearly, E ˜ F ≤ kEkF and corresponding to the feasible solution (¯ xw , T˜) to (2.1). Since kEk √ ˜ E w = 0, we can conclude that an optimal solution of (2.1) should have an error √ matrix E satisfying E w = 0. With this condition, we have √ √ Xw = XD w = (ceT + U T )D w = (ceT + U T )w. It follows that c = x¯w − U α,
α = T w/eT w,
and X − (ceT + U T ) = X − (¯ xw + U (T − αeT )). With an abuse of notation, denoting (T − αe T ) by T , the optimization problem (2.1) reduces to min
{ti },U T U =Id
k(X − x ¯w eT )D − U T DkF .
It is well known that the optimal solution is given by the SVD of matrix (X − x ¯ w eT )D, U = Ud ,
T D = UdT (X − x ¯w eT )D.
It follows that T = UdT (X − x¯w eT ), completing the proof. 3. Selecting Weights. In this section, we consider how to select the weights to be used in the weighted PCA developed in the previous section. Since the objective of introducing the weights is to reduce the influence of the outliers as much as possible when fitting an affine subspace to the data set, ideally the weights should be chosen such that wi is small if xi is considered as an outlier. Specifically, we can let the weight wi to be inversely proportional to the distance between x i and an ideal center x∗ , here x∗ is defined as the mean of the subset of sample points in which outliers have been removed.1 The ideal center x∗ is not known but can be approximated using 1 Recall that we are dealing with a subset of sample points that form k nearest neighbors of a given sample points.
Local Linear Smoothing for Nonlinear Manifold Learning
5
an iterative algorithm as we now show. For concreteness we consider weights based on isotropic Gaussian density function defined as wi = c0 exp(−γkxi − x ¯k22 ),
(3.2)
where γ > 0 is a constant, x ¯ an approximation of x ∗ , and c0 the normalization constant such that n X
wi = 1.
i=1
Other types of weights discussed in [8] can also be used. Initially, we choose x ¯ as the mean of all the k sample points {xi } as an approximation to the ideal center x∗ . The existence of outliers can render the mean x ¯ quite far away from x ∗ . Then we can update x ¯ by the following weighted mean X x ¯w = wi xi , using the current set of weights given by (3.2), and then a new set of weights can be computed by replacing x ¯ by x¯w in (3.2), and the above process can be carried out in several iterations. We now present an informal analysis as to why the iterative weight selection process can be effective at downweighting the outliers. Consider the simple case where the data set has several outliers x1 , · · · , x` that are far away from the remaining set of sample points x`+1 , · · · , x`+k which are relatively close to each other, we also assume ` k. Denote x ¯0 the mean of all ` + k sample points and x ¯ the mean of the sample points x`+1 , · · · , x`+k . It is easy to see that x¯0 =
` X 1 xi + k¯ x =x ¯+δ ` + k i=1
with δ=
` 1 X (xi − x ¯). ` + k i=1
Because kxi − x ¯k kxj − x ¯k ≈ 0 for i ≤ ` and j > `, it follows that kδk kx j − x ¯k for j = ` + 1, · · · , ` + k, and hence kxj − x ¯0 k22 = kxj − x¯k22 + kδk22 − 2δ T (xj − x ¯) ≈ kδk22 . We therefore have for j = ` + 1, · · · , ` + k, wj = c0 exp(−γkxj − x¯0 k22 ) ≈ c0 exp(−γkδ||22 ). On the other hand, since kxi − x¯||2 kδk2 for i ≤ `, we have wi = c0 exp(−γkxi − x ¯0 ||22 ) ≈ 0. P Recalling that the weight set is normalized such that wi = 1, we conclude that w0 ≈ 0,
i = 1, · · · , `,
wi ≈
1 , k
i = ` + 1, · · · , ` + k.
6
ZHENYUE ZHANG AND HONGYUAN ZHA −1
10
−2
error norm
10
−3
10
−4
10
1
2
3
4
5 6 iteration step
7
8
9
10
Fig. 2. Weight iteration with different γ = 1, 3, 5, 7, 9.
Therefore x¯w ≈ x ¯ and the updated weights wi have the effect of downweighting the outliers. In summary, we define the following iterative weight selection procedure for a given set of sample points x1 , . . . , xk : Choose the initial vector x ¯w(0) as the mean of the k vectors x1 , . . . , xk , iterate until convergence, 1. Compute the current weights, (j)
wi
= exp(−γkxi − x ¯w(j−1) k22 ).
2. Compute a new weighted center x¯w(j) =
k X
(j)
wi xi .
i=1
We have not carried out a formal analysis of the above iterative process. Based on several simulations, the sequence {¯ xw(j) }, in general, seem to converge fairly fast. In Figure 2, we plot the errors kw (j) − ck2 for several choices of γ’s for the data set shown in Figure 1, where c, eT c = 1, is a vector with zero components corresponding to outliers and constant components else. On the right panel of Figure 1, we also show the computed weighted principal component direction using the above iterative weight selection procedure. The direction of the straight line is recovered very well despite of the presence of the outliers. 4. Local smoothing. With the above preparation, we are now ready to present our main local linear smoothing algorithm. We assume as before that a given set of N sample points is generated from xi = f (τi ) + i ,
i = 1, . . . , N,
with the existence of outliers, some of the errors i can be quite large. Local smoothing will project each xi to an approximate tangent space at xi constructed using the weighted PCA. This approach consists of, for each iteration, the following steps: For each xi , i = 1, . . . , N , Step 1 Determine the k nearest neighbors of xi including xi , say xi1 , · · · , xik .
7
Local Linear Smoothing for Nonlinear Manifold Learning 2
2
2
1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5
−1.5
−1.5
−2 −2
0
2
−2 −2
0
2
−2 −2
0
2
Fig. 3. Quadratic curve data set: (left) the original data set, (middle) reconstruction with one local smoothing iteration, (right) reconstruction with two iterations of local smoothing.
Step 2 Compute the weight vector w using the iterative weight selection procedure discussed in the previous section. The weight vector is normalized such that P j wi,j = 1. Step 3 Compute an approximation of the tangent space at x i by a weighted PCA, X (i) 2 wi,j kxij − (¯ xw i + Ui θj )k2 j
=
min
c, U, θj
X j
wi,j kxij − (c + U θj )k22 ,
Step 4 Project xi to the approximate tangent subspace to obtain xˆi = x¯w i + U i θi ,
θi = UiT (xi − x ¯w i ).
There are two strategies for ordering the projection/updating of the sample points in each iteration: one is the Jacobi-type and the other the Gauss-Seidel-type. In the Jacobi-type updating, updating of x ˆi ’s always uses neighboring xi ’s from the previous iteration, even though some of its neighbors have already been updated in the current iteration. In the Gauss-Seidel-type updating, the updating of x ˆ i makes uses of the most recently available neighbors. In general, Gauss-Seidel-type updating tends to converge faster while clearly Jacobi-type method has better parallelism. Example 1. As an illustrative example of our local smoothing method, we first sample 100 points on the quadratic curve f (τ ) = τ 2 with unformly randomly chosen τi ∈ (0, 1). Then we add 20 outliers in the square [−2, 2] × [−2, 2]. These outliers are chosen to lie away from the curve f . See the left panel of Figure 3 for the total of 120 points together with the 20 outliers marked by red circles. We use Jacobi-type iteration for local linear smoothing with neighborhood size k = 30 and the Gaussian constant γ = 20 in weight computation (cf Equation (3.2)). After only two steps of local smoothing, the noises are reduced and all outliers are moved close to the curve. We plot the local smoothing results for the first two steps in the middle and right panels of Figure 3. 5. Bias reduction. Local-smoothing based on linear local fitting does suffer from the well-known trim the peak and fill the valley phenomenon for regions of the
8
ZHENYUE ZHANG AND HONGYUAN ZHA
Fig. 4. Local linear fit of the tangent line with large bias.
manifold where the curvature is large [8], as is illustrated in Figure 4. When we increase the number of iterations in local linear smoothing, the net effect of this is that the set of projected smaple points tend to shrink, and eventually converge to a single point. One remedy is to use high-order polynomials instead of the linear one [8, 13], but this can be very expensive for high-dimensional data because we need to approximate quantities such as the Hessian matrix of f (τ ). Clearly, if we know an approximation δi for the difference f (τi ) − x ˆi of the updated sample point xˆi , x ˆi can be improved by adding δi to xˆi to reduce the bias in x ˆi , T xˆi = x ¯w ¯w i + Ui Ui (xi − x i ) + δi .
Following the approach in [13], we propose a simple method for bias correction. ¯w Specifically, we use as an approximation of f (τi ) − xˆi the vector δi = x ¯w i − xi , where w w ¯ i denotes the weighted mean of x ¯ij corresponding to the neighborhood of xi , where x ¯w x i =
X
wi,j x ¯ ij .
j
This way, the bias-corrected approximation becomes ¯w xˆi = 2¯ xw i − xi + Ui θi ,
(5.3)
θi = UiT (xi − x ¯w i ).
For some applications, this approach works well. However, it is also possible to overshoot the correction. To overcome this shortcoming, one may introduce a scaling ¯w constant ηi and take δi = ηi (¯ xw i − xi ). Unfortunately, it is not easy to choose a suitable scaling constant ηi . We suggest to use the following bias-correcting estimation T ˆ ˆ T xw − x ˆ¯w ˆ¯w δi = x¯w ¯w i + Ui Ui (xi − x i ) − x i + Ui Ui (¯ i i ) , w
ˆi are the weighted mean and orthogonal basis matrix of the approxiˆ¯i and U where x xw mate tangent space of x ¯w i using the data set {¯ i }. This gives the following updating formula, T ˆ ˆ T xw ˆ¯w ˆ¯w x ˆi = 2 x¯w ¯w i + Ui Ui (xi − x i ) − x i + Ui Ui (¯ i −x i ) . To show the the ffectiveness of the above updating formula, we consider two data sets sampled with noise from a unit circle and a parabola, respectively. t = rand(1,n)*(2*pi); X = [cos(t); sin(t)]+0.1*randn(2,n);
9
Local Linear Smoothing for Nonlinear Manifold Learning 1.5
1.5
1.5
1
1
1
0.5
0.5
0.5
0
0
0
−0.5
−0.5
−0.5
−1
−1
−1
−1.5 −2
0
2
1.5
−1.5 −2
0
2
5
−1.5 −2
0
2
0
2
1.5
4 1
1 3
0.5
2
0.5
1 0
0 0
−0.5 −2
0
2
−1 −4
−2
0
2
−0.5 −2
Fig. 5. Shifting inward (left) and compensations by (5.3) (middle) and (5.3) (right) for the circle date (top) and parabola data (bottom).
and t = linspace(-1,1,n); X = [t; t.^2]+0.1*randn(2,n); For both two sets, we set n = 200, the neighborhood size k = 20, and the Gaussian constant γ = 20. The weights are simply computed by one iteration of the iterative weight selection method. In Figure 5, we plot the numerical results obtained by 20 iterative smoothing steps. The left column shows the shrinking phenomenon of the local linear smoothing approach without compensation techniques. The simple updating formula defined by (5.3) works well for the circle date but results in overshooting for the parabola data as are illustrated in the the middle column in Figure 5. However, for the both data sets, the updating formula (5.4) is very effective at bias-reduction without much shrinking or overshooting, we plot the smoothing results in the right column Figure 5. 6. Application to nonlinear manifold learning. In this section we present two examples to show that nonlinear manifold learning methods combined with weighted local linear smoothing give more accurate reconstruction of the underlying nonlinear manifolds. As an illustration we use the Local Tangent Space Alignment (LTSA) method discussed in detail in [15]. Example 1. The test data points representing noisy samples from a spiral curve are constructed as, t = linspace(0,4*pi,m); X = [t.*cos(t); t.*sin(t)].*(1 + eta*randn(2,m)); and m = 500, η = 0.2. We first apply the local linear smoothing method with ˆ The weights are computed bias-reduction (5.3) to obtain a smoothed data set X. using two iterations of the iterative weight selection process in section 3 with γ = 0.2 and neighborhood size k = 30. The smoothed data are plotted in the top row of Figure 6. Then we apply LTSA with neighborhood size k LT SA = 20 to recover the unknown lower dimensional coordinates {ti }. In the bottom row of Fig 6, we plot the
10
ZHENYUE ZHANG AND HONGYUAN ZHA iter 1
iter 2
iter 3
iter 4
10
10
10
10
10
5
5
5
5
5
0
0
0
0
0
−5
−5
−5
−5
−5
−10
−10
−10
−10
−10
−10
0
10
−10
0
10
−10
0
10
−10
0
10
−10
0
10
0.2 0.02 0
0.15
−0.02 0.1
−0.04
0.05
0
10
0.08
0.04
0.02
0.06
0.02
0
0.04
0
0.02
−0.02
−0.06
−0.02
−0.08
−0.04
0 −0.05
0.04
−0.1
−0.06
−0.12
−0.08
20
0
5
10
0
5
0
−0.04
−0.02
−0.06
−0.04
−0.08
10
0
5
10
0
5
10
Fig. 6. Spiral data. iter 0
iter 1
8
8
6
iter 2
iter 3
iter 4
8
8
8
6
6
6
4
4
4
2
2
2
0
0
0
−2
−2
−2
−4
−4
−4
−6
−6
−6
−8
−8
−8
−10
−10
−10
6 4 4
2
0 2 −2 0 −4
−6
−2
−8 −4 −10
−12 −5
0
5
−6 −10
0
−12 10 −20
0
−12 20 −20
0
−12 20 −20
0
Fig. 7. Reconstructed spiral curves.
computed one-dimensional coordinates {t¯i } against the true {ti } for each smoothing step. Ideally, the (t¯i , ti ) should form a straight line with either a π/4 or −π/4 slope. In Figure 7, we plot the corresponding reconstructed spiral curves using the lowess nonparametric smoothing method. The improvements in the computed coordinates {t¯i } and the reconstructed curves are quite significant as more local linear smoothing iterations are carried out. Example 2. We use the S-curve data set with 2000 points [11] which are generated as follows. m = 2000; t = (1.5*pi)*rand(2,m/2); s = rand(1,m); t = (1.5*pi)*rand(2,m/2); X=[cos(t(1,:)) -cos(t(2,:)); s; 1+sin(t(1,:)) ... -1-sin(t(2,:))].*(1+0.075*rand(3,m)); For weight selection, we choose γ = 0.2 and two iterations of the iterative weight selection method are used with neighborhood size k = 20. For LTSA, we also use neighborhood size kLT SA = 20, and in Figure 8, we plot the original 3D data points, and three recovered 2D coordinates: the second panel uses the original data points, the smoothed data points with one and two local linear smoothing are plotted in the
20
11
Local Linear Smoothing for Nonlinear Manifold Learning 0.04
0.04
0.05
0.03
0.03
0.04
2
0.03
0.02 0.02
1.5
0.02
1
0.01 0.01
0.5
0.01 0
0 0
−0.5
0 −0.01
−1
−0.01
−0.01
−1.5
−0.02 −0.02
−2 −0.02 −0.03
1 0.8
1 −0.03
0.6 0.4
−0.03
−0.04
−0.04
0
0.2 −1
−0.04 −0.05
0
−0.05 0.05 −0.05
0
−0.05 0.05 −0.05
0
0.05
Fig. 8. S-curve data.
third and fourth panel, respectively. The improvement is quite dramatic. 7. Concluding remarks. The focus of this paper is outlier removal and noise reduction for a set of smaple points using local linear smoothing. The methods we developed can be used by manifold learning methods such as Isomap, LLE and LTSA as a preprocessing procedure to obtain a more accurate reconstruction of the underlying nonlinear manifolds. Several issues deserve further investigation: 1) a more formal analysis of the convergence behavior of the iterative weight selection method; 2) an analysis of the bias-variance property of the proposed local linear smoothing method, this can potentially lead to a method for selecting k, the number of nearest neighbors which can be considered as a smoothing parameter, in the local linear smoothing method. REFERENCES [1] C. Atkeson, A. Moore and S. Schaal. Locally weighted learning. AI Review, 11: 11–73, 1997. [2] C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: the generative topographic mapping Neural Computation, 10:215–234, 1998. [3] D. Donoho and C. Grimes. When Does ISOMAP Recover the Natural Parametrization of Families of Articulated Images? Technical Report 2002-27, Department of Statistics, Stanford University, 2002 [4] T. Hastie and W. Stuetzle. Principal curves. J. Am. Statistical Assoc., 84: 502–516, 1988. [5] T. Hastie, R. Tibshirani and J. Friedman. The Elements of Statistical Learning. Springer, New York, 2001. [6] G. Hinton and S. Roweis. Stochastic Neighbor Embedding. To appear in Advances in Neural Information Processing Systems, 15, MIT Press (2003). [7] T. Kohonen. Self-organizing Maps. Springer-Verlag, 3rd Edition, 2000. [8] C. Loader. Local Regression and Likelihood. Springer, New york, 1999. [9] T. Martinetz and K. Schulten. Topology representing networks. Neural Networks, 7: 507–523, 1994. [10] J.O. Ramsay and B.W. Silverman. Applied Functional Data Analysis. Springer, 2002. [11] S. Roweis and L. Saul. Nonlinear dimension reduction by locally linear embedding. Science, 290: 2323–2326, 2000. [12] L. Saul and S. Roweis. Think globally, fit locally: unsupervised learning of nonlinear manifolds. Technical Reports, MS CIS-02-18, Univ. Pennsylvania, 2002. [13] T. Schreiber and M. Richter. Nonlinear projective filtering in a data stream. Int. J. Bifurcat. Chaos, 9:2039 1999. [14] J. Tenenbaum, V. De Silva and J. Langford. A global geometric framework for nonlinear dimension reduction. Science, 290:2319–2323, 2000. [15] Z. Zhang and H. Zha. Principal Manifolds and Nonlinear Dimension Reduction via Local Tangent Space Alignment. Technical Report, CSE-02-019, Department of Computer Science & Engineering, Pennsylvania State University, 2002.