IET Signal Processing Research Article
Incremental algorithm for finding principal curves
ISSN 1751-9675 Received on 5th September 2014 Revised on 18th April 2015 Accepted on 11th May 2015 doi: 10.1049/iet-spr.2014.0347 www.ietdl.org
Youness Aliyari Ghassabeh 1 ✉, Frank Rudzicz 1,2 Toronto Rehabilitation Institute – UHN, 550 University Avenue, Toronto, Ontario, Canada Department of Computer Science, University of Toronto, Toronto, Ontario, Canada ✉ E-mail:
[email protected]
1 2
Abstract: Principal curves are a non-linear generalisation of principal components. They are smooth curves that pass through the middle of a data set to provide a new representation of those data to make tasks, such as visualisation and dimensionality reduction easier and more accurate. The subspace constrained mean shift (SCMS) algorithm is a recently proposed technique to find principal curves. The algorithm assumes that the complete data set is available in advance and that new data points cannot be added to the data set during the process. The algorithm finds the points on the principal curves by using the complete data set. In this paper, the authors investigate the situation where the entire data set is not available in advance and instead are sampled sequentially. They propose an incremental version of the SCMS algorithm that trains using a sequence of observations. Simulation results show the effectiveness of the proposed algorithm to find a principal curve using a stream of observations.
1
Introduction
Principal curves can be interpreted as a non-linear generalisation of principal component analysis [1]. They are informally defined as curves that pass through the middle of a data set to provide a non-linear summary of those data [2]. By mapping the high-dimensional observations onto a low-dimensional manifold, embedded in the high-dimensional space, they provide a new representation of the input data that makes tasks, such as visualisation and dimensionality reduction easier and more accurate. The first formal definition of principal curves was provided by Hastie and Stuetzle [3] (hereafter HSPC) who define a principal curve as a smooth, self-consistent, and parameterised curve that does not intersect itself and has finite length inside any open ball. The self-consistency property means that for every point selected on the principal curve, the average of the collection of data points that project onto that point will coincide with the selected point on the curve [3]. Since Hastie and Stuetzle’s groundbreaking work, many different definitions and algorithms have been proposed to estimate principal curves (e.g. [4–8] among others). In fact, one difficulty with principal curves is that there are several different notions of them in the literature. The existence of the HSPC for special cases, such as the uniform distribution on rectangles and annuli, was investigated in [2]. The authors in [9] extend the proposed algorithm by Hastie and Stuetzle to closed curves. Tibshirani proposed a semi-parametric model that estimates a principal curve via the expectation maximisation algorithm [4]. Unfortunately, it is not known if the HSPC exists for a large class of distribution functions in general. To address this problem, the authors in [7] defined principal curves as a minimiser of the expected squared distance over a class of curves. The authors in [10] proposed a probabilistic approach using a cubic spline over a mixture of Gaussians to find a principal curve. Recently, an interesting new definition of principal curves has been proposed by Ozertem and Erdogmus [11]. According to this definition, given a smooth (at least twice continuously differentiable) probability density function (pdf) f on ℝD, a principal curve is the collection of all points where the gradient of f is an eigenvector of the Hessian of f, and the eigenvalues corresponding to the rest of eigenvectors are negative. To estimate
IET Signal Process., pp. 1–8 & The Institution of Engineering and Technology 2015
principal curves based on the new definition, the authors in [11] proposed the so-called subspace constrained mean shift (SCMS) algorithm. It is a generalisation of the well-known mean shift (MS) algorithm [12, 13], which iteratively tries to find modes of an estimated pdf in a local subspace. The SCMS algorithm has been successfully used for applications, such as non-linear independent component analysis [11], time-varying multiple-input–multiple-output channel equalisation [11], quantisation of noisy sources [14], and non-linear dimensionality reduction in the presence of noise [15]. For the SCMS algorithm, it is assumed that the entire data set is given in advance and that the algorithm runs on the whole data set. However, this is not a realistic assumption – in many real applications, the entire data set is not available beforehand [16, 17]. Thus, it is desirable to train the algorithm incrementally and to investigate the effect of new observations on the outputs. In this paper, we try to find a relation between the output of the SCMS algorithm and the new observation such that we can update the output points by simply considering the effect of the new observations. In other words, we are trying to develop an incremental version of the SCMS algorithm that can start from a small number of samples and train itself gradually by observing the new data. In the following section, we briefly review the SCMS algorithm. The proposed incremental version of the SCMS algorithm is given in Section 3. Section 4 is devoted to the simulation results, where we test the performance of the proposed algorithm to find a principal curve incrementally. Discussion and concluding remarks are given in Sections 5 and 6, respectively.
2
SCMS algorithm
The SCMS algorithm was proposed to find points that satisfy the definition of the principal curves given by Ozertem and Erdogmus [11]. The SCMS algorithm starts from one of the data points and iteratively moves it to a point on the principal curve. The SCMS algorithm continues updating the output point until the norm of the difference between two consecutive outputs becomes less than some predefined threshold e. Let xi [ RD , i = 1, . . . , n be a sequence of n independent and identically distributed (iid) random variables. The kernel density
1
estimate fˆ at an arbitrary point x using a kernel K(x) and bandwidth h is given by [18] n x − x 1 i K . fˆh,K (x) = D nh i=1 h
(1)
A special class of kernels, called radially symmetric kernels, has been widely used for pdf estimation. Radially symmetric kernels are defined by K(x) = ck,D k x2 , where ck,D is a normalisation factor that causes K(x) to integrate to one and k:[0, ∞) → [0, ∞) is called the profile of the kernel. The profile of a kernel is assumed to be a non-negative, convex, non-increasing, and 1 piecewise continuous function that satisfies 0 k(x) dx , 1. For example, for the widely used Gaussian kernel the profile function is given by kN(x) = exp ( − x/2), x ≥ 0. [Based on this profile function, the Gaussian is defined by kernel KN (x) = (2p)−D/2 exp − x2 /2 , x [ RD ]. By using the profile k instead of the kernel K, the estimated pdf changes to the following well-known form [18] c fˆh,k (x) = k,DD nh
n
x − xi 2 k
. h i=1
(2)
Assuming that k is differentiable with derivative k′, taking the gradient of (2) yields
n
2ck,D
x − xi 2 ˆ ∇f h,k (x) = D+2 g
nh h i=1 ⎡n ⎤
x − x /h 2 i i=1 xi g ×⎣ (3) − x⎦, n
x − x /h 2 g i i=1 where g(x) = − k′(x). The first term in the above equation is proportional to the density estimate at x using kernel G(x) = cg,D g(x2 ), where cg,D is normalisation factor that causes G(x) to integrate to one. The second term in (3) is called the mean shift vector mh,g (x) [19]. The modes of the estimated density function are located at the zeroes of the gradient function, i.e. ∇fˆ x∗ = 0. Equating (3) to zero, reveals that mh,g x∗ = 0. Therefore, the modes of the estimated pdf satisfy 2 n xi g x∗ − xi /h
i=1 x∗ = (4) . n
x∗ − x /h 2 i i=1 g The MS algorithm is a non-parametric iterative technique that estimates modes of the estimated pdf [19]. To achieve this, the MS algorithm initialises the mode estimate sequence to be a datum of the observed data. The mode estimate yj in the jth iteration is updated as [For simplicity, we dropped the subscripts h, g, and used m(y) instead of mh,g (y).] 2 n
x g y − x /h
i j i i=1 (5) y j+1 = yj + m yj = 2 . n
g y − x /h
j i i=1
constrained subspace, but the authors in [23] showed that the eigenvectors of the Hessian matrix also can be used to construct the constrained subspace.]. Similar to the MS algorithm, the SCMS algorithm starts from one of the data points and finds the projection of the MS vector into the subspace in each iteration. Literally, the algorithm tries to find the modes of the estimated pdf in the constrained subspace spanned by D − 1 eigenvectors of H. The algorithm stops when the norm of the difference between two consecutive outputs becomes negligible or when the projection of the gradient estimate in the constrained subspace becomes less than some predefined epsilon. The outline of the SCMS algorithm is given in Algorithm 1 (see Fig. 1) [23].
3
Incremental SCMS algorithm
In the standard SCMS algorithm, it is assumed that the entire data set is given in advance and that new observations cannot be added to the data set during the process. However, when the SCMS algorithm is used over data sets in real-world applications, we may confront difficult situations where a complete set of the observations is not available in advance. For example, in applications, such as mobile robotics, data are presented as a stream and the algorithm is required to update the outputs as soon as a new observation is available [24]. Consider the situation where the SCMS algorithm has converged, and then a new observation is available. Running the algorithm with the augmented data set will change the location of the output points. Thus, the effect of the new observations on the output needs to be studied. In other words, we first need to find the output corresponding to the new observed data and then we need to update the locations of the previous outputs. Running the SCMS algorithm on the entire data set is time consuming and increases the complexity, which prevents the algorithm from quickly responding to the new incoming data. To take into account the effect of the new sample, we propose to run the SCMS algorithm on part of the data set. Specifically, we propose to apply the SCMS algorithm to the input points that associated with the output points which are close to the new incoming sample and leave the rest of the outputs unchanged. Two points can be considered close to each other if they are neighbours, i.e. if they are either within some fixed radius r each other, or if they are one of the other’s κ nearest neighbours. In other words, when a new observation is available, it will have insignificant effect on the output points that are far from it and plays important role on updating output points that are close to it. To justify this, we consider the SCMS algorithm with the popular Gaussian kernel KN (x) = (2p)D/2 exp −x2 /2 . The profile function kN for the Gaussian kernel is given by kN(x) = exp ( − 1/2x), x ≥ 0. Let xi [ RD , i = 1, . . . , n denote the iid input samples. The pdf estimate, the gradient vector, and the Hessian matrix using the Gaussian kernel KN, and the bandwidth h at an arbitrary point x are given by fˆ (x) =
∇fˆ (x) = The MS algorithm iterates this step until the norm of the difference between two consecutive modes estimates becomes less than some predefined threshold. Although the MS algorithm has been used in different applications, a rigorous proof for the convergence of the algorithm in the high-dimensional space with widely used kernels (e.g. Gaussian kernel) has not been given [20–22]. The SCMS algorithm generalises the idea behind the MS algorithm by finding the modes of an estimated pdf in a constrained subspace. The constrained subspace is spanned by D − 1 eigenvectors corresponding to the D − 1 smallest eigenvalues of the Hessian matrix, H [23] [The authors in [11] used the eigenvectors of the so called local inverse covariance matrix to construct the
H(x) =
n
1
x − xi 2 kN
, h n(2p)D/2 hD i=1 1
n(2p)D/2 hD+2
n i=1
x − xi 2 xi − x kN
, h
(6)
(7)
1 n(2p)D/2 hD+2 n n
x − x 2 T 1
i
× −I + 2 xi − x xi − x kN
. h h i=1 i=1 (8)
When a new sample xn+1 is observed, from the definition of the MS vector and Algorithm 1 the mode estimate y j+1 at the constrained IET Signal Process., pp. 1–8
2
& The Institution of Engineering and Technology 2015
Fig. 1 Algorithm 1: SCMS algorithm
subspace will have the following form: y j+1 = V j V Tj m(yj ) + yj n+1 T = V jV j xi wi (yj ) − yj + yj i=1
n+1 = I − V j V Tj yj + V j V Tj xi wi yj
V j V Tj
i=1
= I− yj + n xi wi yj + xn+1 wn+1 yj , V j V Tj
(9)
i=1
2 where I is the identity matrix, wi yj = gN xi − yj /h / n+1
x − y /h 2 , i, j = 1, . . . , n + 1, and gN(x) = 1/2exp i j i=1 gN (− 1/2x) = 1/2kN(x). Since the function gN(x) is decreasing, it can IET Signal Process., pp. 1–8 & The Institution of Engineering and Technology 2015
be observed from (9) that the weight wn+1 yj will be negligible
if xn+1 − yj is too big, i.e. if the new sample xn+1 is far from the output point yj . From (8), and since kN is a decreasing function, the same argument is true for the Hessian matrix H yj – if the norm of the difference between the new incoming sample xn+1 and the output point yj is big then the new point minimally affects computation of the Hessian matrix. As a result, the eigenvectors of the Hessian matrix that are used to construct the projection matrix V j do not change significantly due to existence of the new sample xn+1 . Therefore, we just need to study the effect of the new sample xn+1 on the output points close to it and leave the rest of the outputs unchanged. For each new observation, we find the nearest neighbours (either by finding points in some fixed radius r, or by finding κ nearest neighbours) among the output points and then we run the SCMS algorithm on the input points associated to these outputs. The proposed incremental SCMS algorithm is summarised in Algorithm 2 (see Fig. 2).
3
4
Simulation results
We test the effectiveness of the proposed incremental SCMS algorithm for estimating a principal curve on a noisy circle and a
two-dimensional (2D) noisy spiral. The observations have an additive form x + e, where x is a uniformly selected point on a unit circle/spiral and ε is additive Gaussian noise with independent components having zero mean. The stopping threshold for all
Fig. 2 Algorithm 2: Incremental SCMS algorithm
IET Signal Process., pp. 1–8
4
& The Institution of Engineering and Technology 2015
simulation is empirically set to 0.01. In all experiments, the number of nearest neighbours κ for incoming data is empirically set to 25. In other words, when an observation is available, the SCMS algorithm is run for the new sample and 25 input samples corresponding to 25 nearest neighbours in the output set. For the first simulation, 500 samples are uniformly selected from a circle of radius 5. The input samples are corrupted by adding Gaussian noise with independent components having zero mean and variance 0.7. The initial sample size is 30 and as soon as a new observation is available the SCMS algorithm is run only on the 25 input samples corresponding to 25 closest outputs. The observations are made sequentially and each time a new datum is available, the closest outputs are updated and the rest of the outputs remain unchanged. The outputs of the proposed incremental algorithm at certain times are shown in Fig. 3. The
black stars are the current input data, the grey circles represent the outputs of the algorithm, and the white square is the new observed data. [We assume that at each iteration just one new observation is available, therefore, we only have one white point in each figure. In the other words, the white point in each figure is the last sample that is recently added to the data set]. It can be observed that as the number of the observed data increases, the output points move onto or near the generative circle. In the second experiment, 600 points are uniformly selected from a 2D spiral, and zero-mean Gaussian noise with variance 0.9 is added to each sample. The size of the initial data set is 30 and the new samples are given to the proposed incremental SCMS algorithm sequentially. Fig. 4 shows the performance of the proposed incremental SCMS algorithm for finding the underlying principal curve (2D spiral in this case) at certain times. Same as before, the
Fig. 3 Performance of the proposed incremental SCMS algorithm for finding a circle. The new incoming data in each iteration is shown with the white square. The previous input data are shown with black stars and the outputs of the algorithm are shown with grey circles
IET Signal Process., pp. 1–8
& The Institution of Engineering and Technology 2015
5
Fig. 4 Performance of the proposed incremental SCMS algorithm for finding a 2D spiral. The new incoming data in each iteration is shown with the white square. The previous input data are shown with black stars and the outputs of the algorithm are shown with grey circles
original noisy data set is shown by black stars, the grey circles are the outputs of proposed algorithm, and the white square represent the new incoming sample. It is clear from Fig. 4 that the output points gradually construct the underlying 2D spiral, although the proposed algorithm simply updates 25 samples [Input samples that are associated with 25 nearest neighbours of the incoming sample in the output set.] from the input set in each step. Figs. 5 and 6 compare the performance of the proposed incremental algorithm and the SCMS algorithm to estimate a principal curve. In both figures, the observed noisy data are shown by black stars and the grey circles represent the output of two algorithms. It can be observed from Figs. 5 and 6 that the outputs of the proposed incremental algorithm are very close to the outputs of the original SCMS algorithm. Note that the SCMS
algorithm has access to the whole data set in advance, and the output points are computed at once, but the proposed algorithm observes data points sequentially.
5
Discussion
Running the SCMS algorithm on part of the input samples does not change the asymptotic complexity and the complexity is still O(n 3) (due to the eigendecomposition step). However, in real-world applications where we have a finite number of data samples, running the SCMS algorithm on a fraction of the whole set can improve the efficiency of the updating process. For example, if for each new incoming sample we simply consider n/10 nearest
IET Signal Process., pp. 1–8
6
& The Institution of Engineering and Technology 2015
Fig. 5 SCMS algorithm and the proposed incremental version for finding a circle. The left figure shows the output of the SCMS algorithm and the right figure is the output of the proposed algorithm after 498 iterations. The sample size is 500, the bandwidth h is set to 1.6 and the stopping criteria is 0.01. The grey points in both figures represent the outputs, and the black stars are the observed data
Fig. 6 SCMS algorithm and the proposed incremental version for finding a 2D spiral. The left figure shows the output of the SCMS algorithm and the right figure is the output of the proposed algorithm after 599 iterations. The sample size is 600, the bandwidth h is set to 1.7 and the stopping criteria is 0.01. The grey points in both figures represent the outputs, and the black stars are the observed data
neighbours, then the computational cost can be approximately decreased by 90%. We used the KNN technique to find the κ nearest neighbours in Fig. 2. Alternatively, we can define a sphere centred at the new sample xn+i with a fixed radius r and consider all output points inside that sphere as the neighbours of the new sample. Setting κ for KNN or finding the appropriate radius r to find the nearest neighbours is critical. Assigning a large value to κ or r increases the computational complexity whereas small values may ignore output points that should be updated. Finding the optimal value of κ or r depends on the specific data and the distribution of classes, if applicable. In this paper, we empirically set the value of κ to 25 (square root of the data size). Note that, if there are p > 1 observations available at once, the proposed incremental SCMS algorithm can be run in parallel on the input data corresponding to the closest output points of the observed p new samples.
6
Conclusion
The SCMS algorithm is a recently proposed technique to find principal curves or surfaces. The algorithm uses the complete data
set to estimate the points on a principal curve/surface and the new data points cannot be added to the data set during the process. The availability of the complete data set is not a realistic assumption in many real-world applications and in many situations (e.g. mobile robotics) the data points are observed sequentially. In this paper, we proposed an incremental version of the SCMS algorithm to update the output points partially due to the presence of the new observation. The proposed incremental SCMS algorithm starts from a small size data set and gradually updates the output points on the principal curve as the new observations become available. We first theoretically justified the proposed algorithm and then through the simulations showed the effectiveness of the proposed algorithm to estimate the underlying principal curve incrementally.
7
References
1 Jolliffe, I.T.: ‘Principal component analysis’ (Springer-Verlag, 2002) 2 Duchamp, T., Stuetzle, W.: ‘The geometry of principal curves in the plane’ (Department of Statistics, University of Washington, No. 250, 1993) 3 Hastie, T., Stuetzle, W.: ‘Principal curves’, J. Am. Stat. Assoc., 1989, 84, pp. 502–516 4 Tibshirani, R.: ‘Principal curves revisited’, Stat. Comput., 1992, 2, (4), pp. 183–190
IET Signal Process., pp. 1–8
& The Institution of Engineering and Technology 2015
7
5 Delicado, P.: ‘Another look at principal curves and surfaces’, J. Multivariate Anal., 2001, 77, (1), pp. 84–116 6 Sandilya, S., Kulkarni, S.R.: ‘Principal curves with bounded turn’, IEEE Trans. Inf. Theory, 2000, 48, (10), pp. 2789–2793 7 Kégl, B., Krzyzak, A., Linder, T., Zeger, K.: ‘Learning and design of principal curves’, IEEE Trans. Pattern Anal. Mach. Intell., 2000, 22, (3), pp. 281–297 8 Biau, G., Fischer, A.: ‘Parameter selection for principal curves’, IEEE Trans. Inf. Theory, 2012, 58, (3), pp. 1924–1939 9 Banfield, J.D., Raftery, A.E.: ‘Ice floe identification in satellite images using mathematical morphology and clustering about principal curves’, J. Am. Stat. Assoc., 1992, 87, (417), pp. 7–16 10 Chang, K., Grosh, J.: ‘A unified model for probabilistic principal surfaces’, IEEE Trans. Pattern Anal. Mach. Intell., 2002, 24, (1), pp. 59–74 11 Ozertem, U., Erdogmus, D.: ‘Locally defined principal curves and surfaces’, J. Mach. Learn. Res., 2011, 12, (4), pp. 1249–1286 12 Fukunaga, K., Hostetler, L.D.: ‘Estimation of the gradient of a density function, with applications in pattern recognition’, IEEE Trans. Inf. Theory, 1975, 21, (1), pp. 32–40 13 Cheng, Y.: ‘Mean shift, mode seeking and clustering’, IEEE Trans. Pattern Anal. Mach. Intell., 1995, 17, (8), pp. 790–799 14 Aliyari Ghassabeh, Y., Linder, T., Takahara, G.: ‘On noisy source vector quantization via a subspace constrained mean shift algorithm’. Proc. 26th Biennial Symp. on Communications, Kingston, Canada, May 2012, pp. 107–110
Aliyari Ghassabeh, Y., Linder, T., Takahara, G.: ‘On the convergence and applications of mean shift type algorithms’. Proc. 25th IEEE Canadian Conf. on Electrical & Computer Engineering (CCECE), Montreal, Canada, May 2012, pp. 1–5 16 Aliyari Ghassabeh, Y., Abrishami Moghaddam, H.: ‘Adaptive linear discriminant analysis for online feature extraction’, Mach. Vis. Appl., 2013, 24, (4), pp. 777–794 17 Aliyari Ghassabeh, Y., Rudzicz, F., Abrishami Moghaddam, H.: ‘Fast incremental LDA feature extraction’, Pattern Recognit., 2015, 48, (6), pp. 1999–2012 18 Silverman, B.W.: ‘Density estimation for statistics and data analysis’ (Chapman and Hall, 1986) 19 Comaniciu, D., Meer, P.: ‘Mean shift: a robust approach toward feature space analysis’, IEEE Trans. Pattern Anal. Mach. Intell., 2002, 24, (5), pp. 603–619 20 Aliyari Ghassabeh, Y.: ‘On the convergence of the mean shift algorithm in the one-dimensional space’, Pattern Recognit. Lett., 2013, 34, (12), pp. 1423–1427 21 Aliyari Ghassabeh, Y.: ‘A sufficient condition for the convergence of the mean shift algorithm with Gaussian kernel’, J. Multivariate Anal., 2015, 135, pp. 1–10 22 Aliyari Ghassabeh, Y.: ‘Asymptotic stability of equilibrium points of mean shift algorithm’, Mach. Learn., 2015, 98, (3), pp. 359–368 23 Aliyari Ghassabeh, Y., Linder, T., Takahara, G.: ‘On some convergence properties of the subspace constrained mean shift’, Pattern Recognit., 2013, 46, (11), pp. 3140–3147 24 Ozawa, S., Toh, S.L., Abe, S., Pang, S., Kasabov, N.: ‘Incremental learning of feature space and classifier for face recognition’, Neural Netw., 2005, 18, (5–6), pp. 575–584 15
IET Signal Process., pp. 1–8
8
& The Institution of Engineering and Technology 2015