Convergence and Rates for Fixed-Interval Multiple-Track Smoothing ...

3 downloads 0 Views 201KB Size Report
Nov 3, 2016 - Pittsburgh, PA 15213, United States. Adam M. Johansen. Department of Statistics, University of Warwick, Coventry, CV4 7AL, United Kingdom.
Rate of Convergence for Smoothing Splines with Unknown Data Association Matthew Thorpe and Adam M. Johansen

arXiv:1506.02676v1 [math.ST] 8 Jun 2015

University of Warwick, Coventry, CV4 7AL, United Kingdom

Abstract The problem of estimating multiple trajectories from unlabeled data comprises two coupled problems. The first is a data association problem: how to map data points onto individual trajectories. The second is, given a solution to the data association problem, to estimate those trajectories. We construct estimators as a solution to a variational problem which uses smoothing splines under a k-means like framework and show that, as the number of data points increases, we have stability. More precisely, we show that these estimators converge weakly in an appropriate Sobolev space with probability one. Furthermore, we show that the estimators converge in probability with rate √1n in the L2 norm (strongly).

1 Introduction Given observations from multiple moving targets we face two (coupled) problems. The first is associating observations to targets: the data association problem. The second is estimating the trajectory of each target given the appropriate set of observations. When there is one target then the data association problem is trivial. However, when the number of targets is greater than one (even when the number of targets is known) the set of data association hypothesis grows combinatorially with the number of data points. Very quickly it becomes infeasible to check every possibility. Hence the value of fast approximate solutions. In this paper we interpret the problem of estimating multiple trajectories with unknown data association (see Figure 1) in such a way that the k-means method may be applied to find a solution. As in [34], this is a non-standard application of the k-means method where we generalize the notion of a ‘cluster center’ to partition finite dimensional data using infinite dimensional cluster centers. In this paper the cluster centers are trajectories in some function space and the data are space-time observations. Let Θ ⊂ (H s )k where H s is the Sobolev space of degree s. Given a data set {(ti , yi )}ni=1 ⊂ [0, 1] × Rd and a model for the observation process yi = µ†ϕ(i) (ti ) + ǫi iid

iid

(1)

where ǫi ∼ φ0 and ti ∼ φT for densities φ0 and φT on [0, 1] and Rd respectively. We assume that the index of the cluster responsible for any given observation is an independent random variable with a categorical distribution of parameter vector p = (p1 , . . . , pk ), writing ϕ(i) ∼ Cat(p) to mean P(ϕ(i) = j) = pj . This assumptions allow us to write the density of y given t, which we denote by φY (y|t), as φY (y|t) =

k X j=1

pj φ0 (µ†j (t) − y). 1

y

t

Figure 1: Unlabeled data is generated from three targets and using minimizers of (2) we can find a partitioning of the data set and non-parametrically estimate each trajectory using the k-means algorithm.

We can summarize the data generating process as follows. A cluster is selected at random: P(ϕ = j) = pj , the time and observation error are drawn independently from their respective distributions, t ∼ φT , and ǫ ∼ φ0 ; and we observe (t, y = µ†ϕ (t) + ǫ). The aim is to estimate µ† = (µ†1 , . . . , µ†k ) ∈ Θ. In particular the data association ϕ : {1, 2, . . . , n} → {1, 2, . . . , k} is unknown. With a single trajectory (k = 1) the problem is precisely the spline smoothing problem, see for example [38]. For k > 1 trajectories there is an additional data association problem coupled to the spline smoothing problem. We call this the smoothing-data association (SDA) problem. We assume k is fixed and known. The aim of this paper is to construct a sequence of estimators µn of µ† from the data {(ti , yi )}ni=1 and study the asymptotic behavior as n → ∞. For each n our estimate is given as the minimizer of fn : Θ → R defined by n

k

k

X 1X^ k∇s µj k2L2 fn (µ) = |yi − µj (ti )|2 + λ n

(2)

j=1

i=1 j=1

V where | · | is the Euclidean norm on Rd , kj=1 zj = min{z1 , . . . , zk } and λ is a positive constant. Penalizing the sth derivative ensures that the problem is well posed. Optimizing this function can be interpreted as seeking a hard data association: given µ ∈ Θ each observation (ti , yi ) is associated with the trajectory closest to it so the corresponding data association solution is given by ϕµ (i) = argmin |µj (ti ) − yi |. j=1,2,...,k

Here, we focus upon characterizing the solution of this problem rather than computational methods to obtain this solution. However, a variant of the k-means method would be readily applicable — for this reason we term the µj cluster centers. The choice of regularization scheme and, in particular, of λ is not straightforward. For k = 1 there are many results in the spline literature on the selection of λ = λn and the resulting asymptotic behavior as n → ∞, see for example [1, 9–11, 20, 24, 28, 30–32, 35–37, 39]. In this case one has λn → 0 and can expect µn to converge to µ† . Convergence is either with respect to a Hilbert scale, e.g. L2 , or in the dual space, i.e. weak convergence. Using a Hilbert scale in effect measures the convergence in a norm weaker than H s . The approach we take is to penalize the sth derivative. By Taylor’s Theorem we can write H s = H0 ⊕ H1 where   ti H0 = span ζi (t) = : i = 0, 1, . . . , s − 1 , i!  i H1 = g ∈ H : ∇ g(0) = 0 for all i = 0, 1, . . . , s − 1 . 2

We use k · k1 = k∇s · kL2 as the norm on H1 and denote the H0 norm by k · k0 . Since H0 is finite dimensional we are free to use any norm we choose without changing the topology. We can view H s = H0 ⊕ H1 as a multiscale decomposition of H s . The polynomial component represents a coarse approximation. The regularization penalizes oscillations on the fine scale, i.e. in H1 . In the case k = 1, fn is quadratic and one can find an explicit representation of µn , i.e. there exists a function Gn,λ such that µn = Gn,λ µ† . When k > 1 the problem is no longer convex and the methodology used in the k = 1 case fails. The authors know of no method which would allow λ → 0 for k > 1 and therefore we treat λ as a constant in this paper. A consequence of this regularization is that we cannot expect to recover the true cluster centroids, even in the large data limit. The first result of this paper (Theorem 3.1) is to show that there exists µ∞ ∈ Θ such that (up to subsequences) µn ⇀ µ∞ a.s. in H s and µ∞ is a minimizer of f∞ defined by f∞ (µ) =

Z

0

1Z

Rd

k ^

j=1

|y − µj (t)|2 φY (y|t)φT (t) dydt + λ

k X j=1

k∇s µj k2L2

(3)

Considering the law of large numbers the limit f∞ is natural. The functional f∞ can be seen as a limit of fn , the nature of which will be made rigorous in Section 3. We recall that the motivation for the minimization problem (2) is to embed the problem into a framework that allows the application of the k-means method. Large data limits for the k-means have been studied extensively in finite dimensions, see for example [2,3,7,16,25–27]. There are fewer results for the infinite dimensional case, with [5, 18, 19, 21, 34] the only results known to the authors. Of these, only [34] can be applied to finite dimensional data and infinite dimensional cluster centers but required bounded noise. The first contribution of this paper is to extend this convergence result to unbounded noise for the SDA problem. By the compact embedding of H s into L2 we have that (upto subsequences) µn → µ∞ a.s. in L2 . The second result of this paper is to show in Theorem 4.1 that the rate of convergence in L2 is of order √1 in probability. I.e. n   1 n ∞ kµ − µ kL2 = Op √ . n

This is closely related to the central limit theorem first proved for the k-means method by Pollard [27] for Euclidean data. We extend his methodology to cluster centers in H s to prove our rate of convergence result. In the next section we remind the reader of some preliminary material which underpins our main results. Section 3 contains the convergence results. The rate of convergence results are in Section 4.

2 Preliminaries 2.1 Notation The set of probability measures on [0, 1] × Rd is denoted P([0, 1] × Rd ) and the Borel σ-algebra by B([0, 1] × Rd ). Our main results concern sequences of data {(ti , yi )}∞ i=1 sampled independently from a d measure P ∈ P([0, 1] × R ) with Lebesgue density φ((t, y)) = φY (y|t)φT (t). We assume throughout the existence of a probability space (Ω, F, P), where (ti , yi ) : Ω → [0, 1] × Rd , rich enough to support a countably infinite sequence of such observations. All random elements are defined upon this common probability space and all stochastic quantifiers are to be understood as acting with respect to P unless otherwise stated. With a small abuse of notation we say (ti , yi ) ∈ [0, 1] × Rd . We will define the space Θ ⊂ (H s )k in Section 3. The Sobolev space H s is given by o n H s := µ : [0, 1] → Rd s.t. ∇i µ is absolutely continuous for i = 0, 1, . . . , s − 1 and ∇s µ ∈ L2 . 3

Note that data is of the form {(ti , yi )}ni=1 ⊂ [0, 1] × Rd . We denote weak convergence by ⇀. I.e. if ν n , ν ∈ H s satisfies F (µn ) → F (ν) for all F ∈ (H s )∗ then ν n ⇀ ν. A sequence of probability measures Pn is said to weakly converge to P if for all bounded and continuous functions h we have Pn h → P h. R Where we write P h = h(x) P (dx). If Pn weakly converges to P then we write Pn ⇒ P . We use the following standard definitions for rates of convergence. Definition 2.1. We define the following. (i) For deterministic sequences an and rn , where rn are positive and real valued, we write an = O(rn ) if arnn is bounded. If arnn → 0 as n → ∞ we write an = o(rn ). (ii) For random sequences an and rn , where rn are positive and real valued, we write an = Op (rn ) if arnn is bounded in probability: for all ǫ > 0 there exist deterministic constants Mǫ , Nǫ such that   an ≥ Mǫ ≤ ǫ P ∀n ≥ Nǫ . rn If

an rn

→ 0 in probability: for all ǫ > 0   an ≥ǫ →0 P rn

as n → ∞

we write an = op (rn ).

2.2 Γ-Convergence Our proof of convergence will use a variational approach. In particular the natural convergence for a sequence of minimization problems is Γ-convergence. The Γ-limit can be understood as the ‘limiting lower semi-continuous envelope’. It is particular useful when studying highly oscillatory functionals when there will often be no strong limit and the weak limit (if it exists) will average out oscillations and therefore change the behavior of the minimum and minimizers. See [6, 14] for an introduction to Γ-convergence. We will apply the following definition and theorem to H = Θ ⊂ (H s )k . Definition 2.2 (Γ-convergence [6, Definition 1.5]). A sequence fn : H → R ∪ {±∞} is said to Γconverge on the Banach space H to f∞ : H → R ∪ {±∞} with respect to weak convergence on H, and we write f∞ = Γ- limn fn , if for all ν ∈ H we have (i) (lim inf inequality) for every sequence (ν n ) weakly converging to ν f∞ (ν) ≤ lim inf fn (ν n ); n

(ii) (recovery sequence) there exists a sequence (ν n ) weakly converging to ν such that f∞ (ν) ≥ lim sup fn (ν n ). n

When it exists the Γ-limit is always weakly lower semi-continuous [6, Proposition 1.31] and therefore achieves its minimum on any weakly compact set. An important property of Γ-convergence is that it implies the convergence of minimizers. In particular, we will make use of the following result which can be found in [6, Theorem 1.21]. 4

Theorem 2.3 (Convergence of Minimizers). Let fn : H → R ∪ {±∞} be a sequence of functionals on a Banach space (H, k · k). Assume there exists a weakly compact subset K ⊂ H with inf fn = inf fn H

K

∀n ∈ N.

If f∞ = Γ- limn fn and f∞ is not identically ±∞ then min f∞ = lim inf fn . n

H

Furthermore if

µn

H

∈ K minimizes fn then any weak limit point is a minimizer of f∞ .

2.3 The Gˆateaux Derivative As in Section 2.2 we will apply the following to H = Θ ⊂ (H s )k . Definition 2.4. We say that f : H → R is Gˆateaux differentiable at µ ∈ H in direction ν ∈ H if the limit f (µ + rν) − f (µ) ∂f (µ; ν) = lim r→0 r exists. We may define second order derivatives by ∂ 2 f (µ; ν, ω) = lim

r→0

∂f (µ + rω; ν) − ∂f (µ; ν) r

for µ, ν, ω ∈ H. And higher order derivatives may be defined similarly. To simplify notation, we write: ∂ s f (µ; ν) := ∂ s f (µ; ν, . . . , ν ). | {z } s−times

Theorem 2.5 (Taylor’s Theorem). If f : H → R is s times continuously Gˆateaux differentiable on a convex subset K ⊂ H then f (ν) = f (µ) + ∂f (µ; ν − µ) + +

1 2 ∂ f (µ; ν − µ, ν − µ) + . . . 2!

1 ∂ s−1 f (µ; ν − µ, . . . , ν − µ) + Rs (s − 1)!

where 1 Rs (µ, ν − µ) = (s − 1)!

Z

1 0

(1 − t)s−1 ∂ s f ((1 − t)µ + tν; ν − µ) dt.

3 Convergence To show convergence we apply Theorem 2.3. The following two subsections prove that the conditions required to apply this theorem, i.e. that f∞ is the Γ-limit of fn and that the minimizers µn are uniformly bounded, hold with probability one. Using the compact embedding of H s into L2 we are able to conclude that upto subsequences convergence is strong in L2 . For a fixed δ > 0 we define the space Θ to be the space of functions in (H s )k which have minimum separation distance of δ: n o Θ = µ ∈ (H s )k : |µj (t) − µl (t)| ≥ δ ∀t ∈ [0, 1] and j 6= l . (4)

For d = 1 this is a strong assumption as we restrict ourselves to trajectories that do not intersect. However, for larger d the assumption is much less stringent. 5

First let us show that Θ is weakly closed in (H s )k . Take any sequence µn ∈ Θ such that µn ⇀ µ ∈ (H s )k . We have to show µ ∈ Θ. Pick t ∈ [0, 1], j 6= l and define F : Θ → Rd by F : ν → νj (t) − νl (t), note that F is in the dual space of (H s )k . Hence δ ≤ |µnj (t) − µnl (t)| = |F (µn )| → |F (µ)| = |µj (t) − µl (t)|. Therefore µ ∈ Θ. Furthermore we can show that fn , f∞ are weakly lower semi-continuous [34, Propositions 4.8 and 4.9] hence they obtain their minimizers on Θ. We now state our assumptions. Assumptions. 1. The data sequence (ti , yi ) is independent and identically distributed in accordance with the model (1), with ϕ(i) ∼ Cat(p), ǫi ∼ φ0 , ti ∼ φT and ϕ(i), ǫj , tk are independent for all i, j and k. We assume φ0 and φT are densities with respect to the Lebesgue measure on Rd and [0, 1] respectively and use the same symbols to refer to these densities and to their associated measures. 2. The density φ0 is centered and with finite second moment. 3. For all ǫ ∈ Rd we assume φ0 (ǫ) > 0. 4. We can bound φT away from 0, i.e. inf t∈[0,1] φT (t) > 0. Observe that n

f∞ (µ† ) = ≤ =

k

k

X 1X^ † k∇s µ†j k2L2 |µj (ti ) − yi |2 + λ n j=1

i=1 j=1 n

k

i=1

j=1

X 1X k∇s µ†j k2L2 |yi − µ†ϕ(i) (ti )|2 + λ n n

k

i=1

j=1

X 1X 2 k∇s µ†j k2L2 ǫi + λ n

→ Var(ǫi ) + λ

k X j=1

k∇s µ†j k2L2 =: α < ∞

where the convergence is almost surely by the strong law of large numbers. Hence Assumption 2 implies that there exists N such that minµ∈Θ fn (µ) < α + 1 for n ≥ N and N < ∞ with probability one (although N could depend on the sequence {(ti , yi )}ni=1 and so we could have supω∈Ω N = ∞). Assumption 3 is not necessary however it makes our proofs notationally cleaner. This assumption is used in bounding the minimizers of fn . Clearly if φ0 has bounded support then each yi is uniformly bounded (a.s.) and once can show that |µn (t)| is bounded uniformly in n and t (a.s.). When the support is unbounded, but Assumption 3 does not hold, our proofs hold with some trivial but notationally messy modifications. Assumption 4 will be used in the rate of convergence section. This is used to show that f∞ is positive definite. We now state the main result for this section. The proof is an application of Theorem 2.3 once we have shown the Γ-limit (Theorem 3.2) and the uniform bound on the set of minimizers Theorem 3.4. Theorem 3.1. Define fn , f∞ : Θ → R by (2) and (3), where Θ ⊂ (H s )k for s ≥ 1 is given by (4), respectively. Under Assumptions 1-3 any sequence of minimizers µn of fn are, with probability one, weakly compact and any weak limit µ∞ is a minimizer of f∞ . Furthermore if µnm ⇀ µ∞ in H s then µnm → µ∞ in L2 . 6

3.1 The Γ-Limit We claim the Γ-limit of (fn ) is given by (3). Theorem 3.2. Define fn , f∞ : Θ → R by (2) and (3) respectively where Θ ⊂ (H s )k for s ≥ 1 is given by (4). Under Assumptions 1-2 f∞ = Γ- lim fn n

for almost every sequence of observations (t1 , y1 ), (t2 , y2 ), . . . . Proof. We are required to show that the two inequalities in Definition 2.2 hold with probability 1. In order to do this we follow [34] and consider a subset of Ω of full measure, Ω′ , and show that both statements hold for every data sequence obtained from that set. V (ω) For clarity let P (d(t, y)) = φY (dy|t)φT (dt). Define gµ (t, y) = kj=1 (y − µ(t))2 . Let Pn be the associated empirical measure arising from the particular elementary event ω, whichwe definevia it’s P (ω) (ω) (ω) where action on any continuous bounded function h : [0, 1] × Rd → R: Pn h = n1 ni=1 h ti , yi   (ω) (ω) emphasizes that these are the observations associated with elementary event ω. To highlight ti , y i (ω)

the dependence of fn on ω we write fn . We can write fn(ω) (µ) = Pn(ω) gµ + λ

k X j=1

k∇s µj k2L2

and

f∞ = P gµ + λ

k X j=1

k∇s µj k2L2 .

We define o n o n Ω′ = ω ∈ Ω : Pn(ω) ⇒ P ∩ ω ∈ Ω : Pn(ω) (B(0, q))c → P (B(0, q))c ∀q ∈ N ) ( Z Z 2 2 (ω) |y| P (d(t, y)) ∀q ∈ N |y| Pn (dt, y) → ∩ ω∈Ω: (B(0,q))c

(B(0,q))c

then P(Ω′ ) = 1 by the almost sure weak convergence of the empirical measure and the strong law of large numbers. Fix ω ∈ Ω′ and we start with the lim inf inequality. Let µn ⇀ µ. By Theorem 1.1 in [15] we have Z Z ′ ′ gµn (t, y) Pn(ω) (d(t, y)). lim inf gµn ((t , y )) P (d(t, y)) ≤ lim inf n→∞

′ ′ [0,1]×Rd n→∞,(t ,y )→(t,y)

[0,1]×Rd

By the same argument as in Proposition 4.8.ii in [34] we have 2 y ′ − µnj (t′ ) ≥ (y − µj (t))2 . lim inf n→∞,(t′ ,y ′ )→(t,y)

Taking the minimum over j we have

lim inf

n→∞,(t′ ,y ′ )→(t,y)

gµn (t′ , y ′ ) ≥ gµ (t, y).

And as a consequence of the Hahn-Banach theorem lim inf n→∞ k∇s µnj k2L2 ≥ k∇2 µj k2L2 . Therefore lim inf fn(ω) (µn ) ≥ f∞ (µ) n→∞

as required. We now establish the existence of a recovery sequence for every ω ∈ Ω′ and every µ ∈ Θ. Let µn = µ ∈ Θ. Let ζq be a C ∞ (Rd+1 ) sequence of functions such that 0 ≤ ζq (t, y) ≤ 1 for all 7

(t, y) ∈ Rd+1 , ζq (t, y) = 1 for (t, y) ∈ B(0, q − 1) and ζq (t, y) = 0 for (t, y) 6∈ B(0, q). Then the function ζq (t, y)gµ (t, y) is continuous for all q. We also have, for any (t, y) ∈ [0, 1] × Rd , ζq (t, y)gµ (t, y) ≤ ζq (t, y)|y − µ1 (t)|2

 ≤ 2ζq (t, y) |y|2 + |µ1 (t)|2   ≤ 2ζq (t, y) |y|2 + kµ1 k2L∞ ([0,1]) ≤ 2|q|2 + 2kµ1 k2L∞ ([0,1]) < ∞

(ω)

so ζq gµ is a continuous and bounded function, hence by the weak convergence of Pn

to P we have

Pn(ω) ζq gµ → P ζq gµ as n → ∞ for all q ∈ N. For all q ∈ N we have lim sup |Pn(ω) gµ − P gµ | ≤ lim sup |Pn(ω) gµ − Pn(ω) ζq gµ | + lim sup |Pn(ω) ζq gµ − P ζq gµ | n→∞

n→∞

n→∞

+ lim sup |P ζq gµ − P gµ | n→∞

=

lim sup |Pn(ω) gµ n→∞

− Pn(ω) ζq gµ | + |P ζq gµ − P gµ |.

Therefore, lim sup |Pn(ω) gµ − P gµ | ≤ lim inf lim sup |Pn(ω) gµ − Pn(ω) ζq gµ | q→∞

n→∞

n→∞

by the dominated convergence theorem. We now show that the right hand side of the above expression is equal to zero. We have |Pn(ω) gµ − Pn(ω) ζq gµ | ≤ Pn(ω) I(B(0,q−1))c gµ Z I(B(0,q−1))c (t, y)|y − µ1 (t)|2 Pn(ω) (d(t, y)) ≤ d [0,1]×R Z I(B(0,q−1))c (t, y)|y|2 Pn(ω) (d(t, y)) ≤2 d [0,1]×R Z 2 I(B(0,q−1))c (t, y) Pn(ω) (d(t, y)) + 2kµ1 kL∞ ([0,1]) [0,1]×Rd Z I(B(0,q−1))c (t, y)|y|2 P (d(t, y)) →2 d [0,1]×R Z 2 I(B(0,q−1))c (t, y) P (d(t, y)) as n → ∞ + 2kµ1 kL∞ ([0,1]) [0,1]×Rd

→ 0 as q → ∞

where the last limit follows by the monotone convergence theorem and Assumption 2. We have shown lim |Pn(ω) gµ − P gµ | = 0.

n→∞

Hence fn(ω) (µ) → f∞ (µ) as required.

8

3.2 Boundedness The aim of this subsection is to show that the minimizers of fn are uniformly bounded in n for almost every sequence of observations. We divide this into two parts; bounding each of the H0 and H1 norms. The H1 bound follows easily from the regularization. For the H0 bound we exploit the equivalence of norms on finite-dimensional vector spaces to choose a convenient norm on H0 . By the argument which follows the assumptions we have, for n sufficiently large and with probability ˆ ⊂Ω one, minµ∈Θ fn (µ) ≤ α + 1 < ∞. Now we let µn be a sequence minimizers. Then there exists Ω ˆ ˆ such that P(Ω) = 1 and for all ω ∈ Ω we have fn (µ† ) = Pn(ω) gµ† + λ

k X j=1

k∇s µ†j k2L2 → P gµ† =: α.

ˆ there exists N = N (ω) such that for n ≥ N we have Therefore for all ω ∈ Ω λkµn k21 ≤ fn (µn ) ≤ fn (µ† ) ≤ α + 1. Therefore kµn k1 is bounded almost surely. We are left to show the corresponding result for kµn k0 . The following lemma will be used to establish the main result of this subsection, Theorem 3.4. It √ shows that, if for some sequence ν n ∈ H s with k∇s ν n kL2 ≤ α and kν n k0 → ∞, then we have that |ν n (t)| → ∞ with the exception of at most finitely many t ∈ [0, 1]. When applied to µnj this will be used to show that in the limit, if any center is unbounded, then the minimization can be achieved over k − 1 clusters — and hence to provide a contradiction. √ Lemma 3.3. Let ν ∈ H s satisfy k∇s ν n kL2 ≤ α and kν n k0 → ∞. Then, with the exception of at most finitely many t ∈ [0, 1] we have |ν n (t)| → ∞. Furthermore for each t ∈ (0, 1) with |ν n (t)| → ∞ there exists c > 0 such that |ν n (r)| → ∞ uniformly for r ∈ [t − c, t + c]. Proof. Let the norm on H0 be given by kνk0 :=

s−1 X |∇i ν(0)| i=0

i!

.

(5)

k∇s ν n kL2

we have By Taylor’s theorem and the bound on s−1 i ν n (0) X ∇ √ n ti ≤ α. ν (t) − i! i=0

Ps−1

∇i ν n (0)

i n

ti . If kν n k0 → ∞ then at least one of | ∇ νi! (0) | → ∞. If all but one of Now let Qn (t) = i=0 i! the terms stay bounded then it is easy to see that for t > 0 we have |Qn (t)| → ∞ and the convergence i1 n i2 n is uniform over all intervals [t, 1]. Now if ∇ iν1 ! (0) → ∞ and ∇ iν2 ! (0) → −∞ and all other terms are i n bounded by M , i.e. ∇ νi! (0) ti ≤ M for all i 6= i1 , i2 , then γn (t) =

∇i1 ν n (0) i1 ∇i2 ν n (0) i2 t + t i1 ! i2 !

can remain bounded for at most two values of t. Assume two values exist for which γn is bounded which we denote by t∗1 = 0 and t∗2 > 0. Without loss of generality we assume that γn (t) → ∞ for t > t∗2 and γn (t) → −∞ for 0 < t < t∗2 (equivalently we assume that i1 > i2 ). One can show (by differentiating γn ) that the stationary points τin of γn satisfy τ1n = 0

τ2n →

and 9

i2 ∗ t < t∗2 . i1 2

Pick t ∈ (t∗2 , 1), we immediately have that |ν n (t)| → ∞ from √ |ν n (t) − γn (t)| ≤ (s − 2)M + α. Furthermore we choose c such that t − c > t∗2 then since γn is eventually increasing on [t∗2 , 1] for n sufficiently large we have √ ν n (r) ≥ γn (t − c) − (s − 2)M − α for all r ∈ [t − c, 1]. Hence ν n (r) → ∞ uniformly on [t − c, 1]. Now we pick t ∈ (0, t∗2 ) and find c such that [t − c, t + c] ⊂ (0, t∗2 ). The same argument implies √ ν n (r) ≤ min{γn (t − c), γn (t + c)} + (s − 2)M + α.

And therefore ν n (r) → −∞ uniformly on [t − c, t + c]. Similar arguments hold if γn has one or zero bounded values. An analogous argument can be employed if there are three unbounded terms in (5), in which case we consider γn of the form: γn (t) =

∇i1 ν n (0) i1 ∇i2 ν n (0) i2 ∇i3 ν n (0) i3 t + t + t . i1 ! i2 ! i3 !

This sequence of functions, can be bounded at at most three values of the argument, t∗1 , t∗2 , t∗3 and we repeat the previous argument. Such a process can be continued iteratively until we have considered the case where γn has s unbounded terms. We proceed to the main result of this subsection. Theorem 3.4. Define fn , f∞ : Θ → R, where Θ ⊂ (H s )k for s ≥ 1 is given by (4), by (2) and (3), respectively. Let µn be the minimizer of fn then, under Assumptions 1-3, for almost every sequence of observations there exists a constant M < ∞ such that kµn kH s ≤ M for all n. Proof. As in the proof of Theorem 3.2 we let ω ∈ Ω′′ where ( )       n \ X δ 1 δ ′′ ′ 2 ′ (ω) B(c, ) → P B(x, ) Ω = ω∈Ω : ǫi → Var(ǫ1 ) ∩c∈Q ω ∈ Ω : Pn n 4 4 i=1

where Ω′ is defined in the proof of Theorem 3.2. We have P(Ω′′ ) = 1. For the remainder of the proof (ω) we assume ω ∈ Ω′′ . Then there exists N (ω) < ∞ such that fn (µn ) ≤ α + 1 for all n ≥ N (ω) . Hence λkµn k21 ≤ fn(ω) (µn ) ≤ α + 1. It remains to show the H0 bound. The structure of the proof is similar to [19, Lemma 2.1]. We will argue by contradiction. In particular we argue that if a cluster center is unbounded then in the limit the minimum is achieved over the remaining k − 1 cluster centers. Step 1: The minimization is achieved over k − 1 cluster centers. Assume there exists j ∗ such that kµnj∗ k0 → ∞, then by Lemma 3.3 |µnj∗ (t)| → ∞ for all but finitely many t ∈ [0, 1] and for each t with |µnj∗ (t)| → ∞ there exists c such that |µnj∗ (r)| → ∞ uniformly for r ∈ [t − c, t + c]. Pick t such that |µnj∗ (t)| → ∞ and find c. Let tm → t, then there exists an M > 0 such that for all m > M we have |tm − t| < c. It follows that lim

m→∞,n→∞

|µnj∗ (tm )| = ∞. 10

This easily implies lim

n→∞,(t′ ,y ′ )→(t,y)

for any y ∈ Rd . Therefore lim inf

n→∞,(t′ ,y ′ )→(t,y)



n ′ µ ∗ (t ) − y ′ 2 = ∞ j

 k ^ ^ n ′ µj (t ) − y ′ 2 − µnj (t′ ) − y ′ 2  = 0.  j6=j ∗

j=1

Note that the above expression holds for P -almost every (t, y) ∈ [0, 1] × Rd . By Theorem 1.1 in [15] and the above we have   Z k ^ ^ lim inf  |µnj (t) − y|2 − |µnj (t) − y|2 Pn(ω) (dt, dy) ≥ 0. n→∞

Hence

[0,1]×Rd j=1

j6=j ∗

  lim inf fn(ω) (µn ) − fn(ω) ((µnj )j6=j ∗ ) − λk∇s µnj∗ k2L2 ≥ 0 n→∞

where we interpret

(ω) fn ((µnj )j6=j ∗ )

accordingly. So,

  lim inf fn(ω) (µn ) − fn(ω) ((µnj )j6=j ∗ ) ≥ 0. n→∞

Step 2: The contradiction. If we can show that there exists ǫ > 0 such that the following holds (i.e. we can do strictly better by fitting k centers than fitting k − 1 centers) then we can conclude:   lim inf fn(ω) (µn ) − fn(ω) ((µnj )j6=j ∗ ) ≥ −ǫ. n→∞

Now,

n

fn(ω) (µn )



fn(ω) (ˆ µn )

k

X 1X^ n |ˆ µj (ti ) − yi |2 + λ k∇s µ ˆnj k2L2 , = n ∗ i=1 j=1

where µ ˆnj (t)

=



j6=j

µnj (t) for j 6= j ∗ cn for j = j ∗

for a constant cn . Now each µ ˆnj must have a minimum separation distance of δ. For now we assume that we can choose cn such that this criterion is fulfilled. So if |yi − cn | ≤ 4δ then |yi − cn | +

δ ≤ |µnj (ti ) − yi | 4

11

for all j 6= j ∗ . And therefore |yi − cn |2 + fn(ω) ((µnj )j6=j ∗ ) =

δ2 16

≤ |µnj (ti ) − yi |2 which implies

n X 1X ^ n k∇s µj k2L2 |µj (ti ) − yi |2 + λ n ∗ ∗

1 = n

i=1 j6=j n ^ X

i=1 j6=j ∗



j6=j

|µnj (ti ) − yi |2 I(ti ,yi )≁j ∗

X

j6=j ∗

k∇

s

µj k2L2

n 1X ^ n + |µj (ti ) − yi |2 I(ti ,yi )∼n j ∗ n ∗ i=1 j6=j

n n 1X ^ n 1X |cn − yi |2 I(ti ,yi )∼n j ∗ |µj (ti ) − yi |2 I(ti ,yi )≁j ∗ + n n i=1 j6=j ∗ i=1    2 X δ δ +λ + Pn(ω) B cn , k∇s µj k2L2 16 4 j6=j ∗    2 δ δ . = fn(ω) (ˆ µn ) + Pn(ω) B cn , 16 4



Where (ti , yi ) ∼n j means coordinate (ti , yi ) is associated to center µ ˆnj in the sense that (t, y) ∼n j ⇔ n j = argmini=1,...,k |y − µ ˆi (t)| (and if the minimum is not uniquely achieved then we take the smallest j  (ω) such that j ∈ argmini=1,...,k |y − µ ˆni (t)|). If we can show that Pn B cn , 4δ is bounded away from zero, then the result follows. Since we assumed ǫ1 has unbounded support on R if we can show that |cn | ≤ M for a constant M and n sufficiently large (a.s.) then      δ δ Pn(ω) B cn , ≥ inf Pn(ω) B(c, 4 4 c∈[−M,M ]∩Q   δ → inf P B(c, 4 c∈[−M,M ] Z 1Z I|y−c|≤ δ φY (y|t)φT (t) dydt. = inf c∈[−M,M ] 0

Rd

4

By Assumption 3, there exists ǫ′ > 0 such that φY (y|t) ≥ ǫ′ for all y ∈ [−M, M ] and t ∈ [0, 1]. Hence we may bound the final expression above by Z 1Z δ I|y−c|≤ δ φY (y|t)φT (t) dydt ≥ ǫ′ . inf 4 4 c∈[−M,M ] 0 Rd

We are left to show such an M exists. If there exists Mk−1 such that for all j 6= j ∗ we have ≤ Mk−1 then we know each center has size of range at most 2Mk−1 . So k − 1 centers have size of range at most 2Mk−1 (k − 1) + δ(k − 2) =: C. Hence there exists cn ∈ [0, C + δ]d such that ˆn ∈ Θ. µ ˆnj∗ (t) = cn and µ Now if no such Mk−1 exists then there exists a second cluster such that kµnj∗∗ kH s → ∞ where ∗∗ j 6= j ∗ . By the same argument   lim inf fn(ω) (µn ) − fn(ω) ((µnj )j6=j ∗ ,j ∗∗ ) ≥ 0 n→∞       δ δ2 (ω) δ δ2 (ω) ′ (ω) n (ω) n B cn , − Pn B cn , fn (µ ) − fn ((µj )j6=j ∗ ,j ∗∗ ) ≤ − Pn 16 4 16 4

kµnj kH s

for a constant c′n . By induction it is clear that we can find Mk−l such that k − l cluster centers are bounded. The result then follows 12

Remark 3.5. Note that in the above theorem we did not need to assume a correct choice of k. If the true number of cluster centers is k′ and we incorrectly use k 6= k′ , then the resulting cluster centers are still bounded. In fact for all the results of this paper the correct choice of k is not necessary: although the minimizers of f∞ may no longer make physical sense, the problem is still robust in that the conclusions of Theorems 3.1 and 4.1 hold.

4 Rate of Convergence In Pollard [27], a Central Limit Theorem was shown to hold for the k-means method in Euclidean spaces. We extend that argument to show that the L2 difference between the minimizers µn and µ∞ of fn and f∞ respectively is of order n1 . We state the main result of this section now but leave the proof to the end. Theorem 4.1. Define fn , f∞ : Θ → R, where Θ is given by (4), by (2) and (3), respectively. Under Assumptions 1–4 we have:   1 n ∞ 2 , kµ − µ kL2 = Op n where µn and µ∞ minimize fn and f∞ , respectively. √ We let Yn (µ) = n(fn (µ) − f∞ (µ)) and then, by Taylor expanding around µ∞ , we have Yn (µn ) = Yn (µ∞ ) + ∂Yn (µ∞ ; µn − µ∞ ) + h.o.t. In Lemma 4.5 using Chebyshev’s inequality we bound the Gˆateaux derivative of Yn in probability. Similarly one can Taylor expand f∞ around µ∞ . After some manipulation of the Taylor expansion one has   1 1 2 ∞ n ∞ n ∞ n ∞ ∂ f∞ (µ ; µ − µ ) = fn (µ ) − fn (µ ) + Op √ kµ − µ kL2 2 n where we leave the details until the proof of Theorem 4.1 at the end of the section. We note that fn (µn ) − fn (µ∞ ) ≤ 0. Therefore 1 κ n kµ − µ∞ k2L2 ≤ Op ( √ kµn − µ∞ kL2 ) 2 n where κ > 0 exists as a consequence of f∞ being positive definite (Lemma 4.4). Lemmata 4.2 and 4.4 provide the first and second Gˆateaux derivatives of f∞ . It is also shown that the second derivative is positive definite. Lemma 4.2. Define f∞ by (3) and Θ ⊂ (H s )k for s ≥ 1 by (4). Then for µ ∈ Θ, ν ∈ (H s )k we have that f∞ is Gˆateaux differentiable at µ in the direction ν with Z 1Z  y − µj(t,y) (t) · νj(t,y) (t)φY (y|t)φT (t) dydt ∂f∞ (µ; ν) = − 2 + 2λ

0

R

k X

(∇s νj , ∇s µj )

j=1

where j(t, y) is chosen arbitrarily from the set j(t, y) ∈ argmin |y − µj (t)|.

(6)

j

Remark 4.3. Since µj are continuous the boundary between each element of the resulting partition is itself continuous and has Lebesgue measure zero. The set on which j(t, y) is not uniquely defined therefore has measure zero. Hence we will treat j(t, y) as though it was uniquely defined. 13

Proof of Lemma 4.2: Fix µ ∈ Θ, ν ∈ (H s )k and r > 0. Define jr (t, y) = argmin |y − µj (t) − rνj (t)|,

j(t, y) = j0 (t, y).

j

Then for (t, y) in the interior of the partition associated with µj we have jr (t, y) = j(t, y)

for r sufficiently small.

Since µ ∈ Θ the boundary of each partition is a d − 1 manifold which has mass 0 with respect to the Lebesgue measure. Therefore we only need consider points in the interior of each partition. We have f∞ (µ + rν) − f∞ (µ) (Z Z r 1 1 |y − µjr (t,y) (t)|2 − |y − µj(t,y) (t)|2 + r 2 |νjr (t,y) (t)|2 = lim r→0 r d 0 R !  − 2r y − µjr (t,y) (t) · νjr (t,y) (t) φY (y|t)φT (t) dydt

∂f∞ (µ; ν) = lim

r→0



k X

2r(∇s νj , ∇s µj ) + r 2 k∇s νj k2L2

j=1

= −2

Z

0

1Z



)

 y − µj(t,y) (t) · νj(t,y) (t)φY (y|t)φT (t) dydt

Rd

+ 2λ

k X j=1

(∇s νj , ∇s µj )

which proves the result. Lemma 4.4. Under the conditions of Lemma 4.2 and additionally Assumption 4, we have that f∞ has a second Gˆateaux derivative at µ ∈ Θ with respect to ν, η ∈ (H s ([0, 1]))k given by ∂ 2 f∞ (µ; ν, η) = 2

Z

0

1Z

R

ηj(t,y) (t) · νj(t,y) (t)φY (y|t)φT (t) dydt + 2λ

k X (∇s νj , ∇s ηj ) j=1

where j(t, y) is defined by (6). Furthermore, under Assumption 4, ∂ 2 f∞ is positive definite at µ∞ , the minimizer of f∞ in Θ. I.e. there exists κ > 0 such that for all ν ∈ (H s ([0, 1]))k we have ∂ 2 f∞ (µ∞ ; ν) ≥ κkνk2L2 . Proof. The proof of the first statement follows the same reasoning as that of Lemma 4.2. To establish the second, observe that: 2



∂ f∞ (µ ; ν) = 2 ≥κ ˆ ≥κ

k Z X j=1

0

1Z

k Z 1 X j=1

k X j=1

0

2

R

|νj (t)| I{(t,y)∼j} (t, y)φY (y|t)φT (t) dydt + 2λ

|νj (t)|2 φT (t) dt

kνj k2L2

14

k X j=1

k∇s νj k2L2

where κ ˆ = 2 inf

min

t∈[0,1] j=1,2,...,k

κ=κ ˆ inf φT (t)

Z

R

I{(t,y)∼j} (t, y)φY (y|t) dy

t∈[0,1]

where we recall (t, y) ∼ j means coordinate (t, y) is associated with center µ∞ j . It remains to show κ ˆ > 0. Recall that there exists a uniform bound on µ∞ so that |µ∞ (t)| ≤ M for all t ∈ [0, 1]. Define ǫ′ by ǫ′ = inf min φY (y|t). t∈[0,1] |y|≤M

By Assumptions 1,3,4, we have

ǫ′

> 0. So κ ˆ ≥ 2 inf

min

t∈[0,1] j=1,2,...,k

Vol(Ajt )ǫ′ M

where Ajt is the partition of Rd corresponding to center µ∞ j at time t. The minimum separation distance j δ ˆ > 0. The result then assumptions on Θ imply that Vol(At ) ≥ Vol(B(0, 2 )). Which implies that κ follows. We now consider Yn . In particular we want to bound ∂Yn (µ∞ ; µn − µ∞ ). Lemma 4.5. Define fn , f∞ : Θ → R by (2) and (3) respectively where Θ is given by (4). Take Assumption 1 and define √ Yn : Θ → R, Yn (µ) = n (fn (µ) − f∞ (µ)) . Then for µ ∈ Θ, ν ∈ (H s )k we have that Yn is Gˆateaux differentiable at µ in the direction ν with Z 1 Z  √ y − µj(t,y) (t) · νj(t,y) (t)φY (y|t)φT (t) dydt ∂Yn (µ; ν) = 2 n 0 R ! n  1X − yi − µj(ti ,yi ) (ti ) · νj(ti ,yi ) (ti ) n i=1

where j(t, y) is defined by (6). Furthermore, for a sequence ν n with kν n kL2 = op (1)

and

kν n kH s = Op (1)

we have ∂Yn (µ; ν n ) = Op (kν n kL2 ). Proof. Calculating the Gˆateaux derivative is similar to Lemma 4.2 and is omitted. By linearity and continuity of ∂Yn we can write  X n  (ν , em ) νn ∂Yn (µ; em ) = ∂Yn µ; n kν kL2 kν n kL2 m where em is the Fourier basis for (L2 )k (we assume em = (ˆ em1 , . . . , eˆmk ) where eˆm is a Fourier basis for L2 ). Let Vm = E (∂Yn (µ; em ))2 and Zi = (yi − µj(ti ,yi ) (ti )) · eˆm , then !2 n X 4 Vm = E (Zi − EZi ) n i=1

= 4E (Z1 − EZ1 )2

≤ 4E |ǫ1 |2 =: C. 15

Where C is finite by Assumption 2. Therefore, by Chebyshev’s inequality for any M > 0 P (|∂Yn (µ; em )| ≥ M ) ≤

C . M2

Hence   P ∂Yn µ;

 X n  νn |(ν , em )| ≥M ≤ P (|∂Yn (µ; em )| ≥ M ) n kν kL2 kν n kL2 m X |(ν n , em )| C ≤ kν n kL2 M 2 m ≤

C . M2

  n Which implies ∂Yn µ; kν nν k 2 = Op (1). L

We now have the necessary pieces in place to prove Theorem 4.1

Proof of Theorem 4.1. By Theorem 3.1 we have that (up to subsequences) kµn − µ∞ kL2 = op (1) and kµn kH s = Op (1). By Taylor’s Theorem and the fact that f∞ is quadratic we have 1 f∞ (µn ) = f∞ (µ∞ ) + ∂f∞ (µ∞ ; µn − µ∞ ) + ∂ 2 f∞ (µ∞ ; µn − µ∞ ) . 2 Since µ∞ minimizes f∞ the linear term above must be zero. Hence 1 f∞ (µn ) = f∞ (µ∞ ) + ∂ 2 f∞ (µ∞ ; µn − µ∞ ) . 2 Similarly, and using Lemma 4.5 Yn (µn ) = Yn (µ∞ ) + Op (∂Yn (µ∞ ; µn − µ∞ )) = Yn (µ∞ ) + Op (kµn − µ∞ kL2 ) . From the definition of Yn we also have 1 fn (µn ) = f∞ (µn ) + √ Yn (µn ). n Substituting into the above we obtain   1 2 1 1 ∞ n ∞ ∞ n ∞ fn (µ ) = f∞ (µ ) + ∂ f∞ (µ ; µ − µ ) + √ Yn (µ ) + Op √ kµ − µ kL2 2 n n   1 1 = fn (µ∞ ) + ∂ 2 f∞ (µ∞ ; µn − µ∞ ) + Op √ kµn − µ∞ kL2 . 2 n n



Rearranging for ∂ 2 f∞ and using fn (µn ) ≤ fn (µ∞ ) we have

1 ∂ 2 f∞ (µ∞ ; µn − µ∞ ) = 2 (fn (µn ) − fn (µ∞ )) + Op ( √ kµn − µ∞ kL2 ) n 1 ≤ Op ( √ kµn − µ∞ kL2 ). n

Recall that by Lemma 4.4 we have that ∂ 2 f∞ is positive definite at µ∞ . Therefore: 1 κkµn − µ∞ k2L2 ≤ Op ( √ kµn − µ∞ kL2 ). n Which implies

1 kµn − µ∞ kL2 = Op ( √ ). n

16

Acknowledgments MT is part of MASDOC at the University of Warwick and was supported by an EPSRC Industrial CASE Award PhD Studentship with Selex ES Ltd.

References [1] M. Aerts, G. Claeskens, and M. P. Wand. Some theory for penalized spline generalized additive models. Journal of Statistical Planning and Inference, 103(1-2):455–470, 2002. [2] A. Antos. Improved minimax bounds on the test and training distortion of empirically designed vector quantizers. Information Theory, IEEE Transactions on, 51(11):4022–4032, 2005. [3] P. L. Bartlett, T. Linder, and G. Lugosi. The minimax distortion redundancy in empirical quantizer design. Information Theory, IEEE Transactions on, 44(5):1802–1813, 1998. [4] S. Ben-David, D. P´al, and H. U. Simon. Stability of k-means clustering. In Proceedings of the Twentieth Annual Conference on Computational Learning, pages 20–34, 2007. [5] G. Biau, L. Devroye, and G. Lugosi. On the performance of clustering in Hilbert spaces. Information Theory, IEEE Transactions on, 54(2):781–790, 2008. [6] A. Braides. Γ-Convergence for Beginners. Oxford University Press, 2002. [7] P. A. Chou. The distortion of vector quantizers trained on n vectors decreases to the optimum as Op (1/n). In Information Theory, Proceedings of 1994 IEEE International Symposium on, page 457, 1994. [8] J. B. Conway. A Course in Functional Analysis. Graduate Texts in Mathematics. Springer, 1990. [9] D. D. Cox. Asymptotics for M -type smoothing splines. The Annals of Statistics, 11(2):530–551, 1983. [10] D. D. Cox. Approximation of method of regularization estimators. The Annals of Statistics, 16(2):694–712, 1988. [11] P. Craven and G. Wahba. Smoothing noisy data with spline functions. Numerische Mathematik, 31(4):377–403, 1979. [12] J. A. Cuesta and C. Matran. The strong law of large numbers for k-means and best possible nets of Banach valued random variables. Probability Theory and Related Fields, 78(4):523–534, 1988. [13] J. A. Cuesta-Albertos and R. Fraiman. Impartial trimmed k-means for functional data. Computational Statistics & Data Analysis, 51(10):4864–4877, 2007. [14] G. Dal Maso. An Introduction to Γ-Convergence. Springer, 1993. [15] E. A. Feinberg, P. O. Kasyanov, and N. V. Zadoianchuk. Fatou’s lemma for weakly converging probabilities. Theory of Probability & Its Applications, 58(4):683–689, 2014. [16] J. Hartigan. Asymptotic distributions for clustering criteria. The Annals of Statistics, 6(1):117–131, 1978. [17] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2008. 17

[18] T. Lalo¨e. L1 -quantization and clustering in Banach spaces. Mathematical Methods of Statistics, 19(2):136–150, 2010. [19] J. Lember. On minimizing sequences for k-centres. Journal of Approximation Theory, 120:20–35, 2003. [20] K.-C. Li. Asymptotic optimality for Cp , CL , cross-validation and generalized cross-validation: Discrete index set. The Annals of Statistics, 15(3):958–975, 1987. [21] T. Linder. Principles of Nonparametric Learning, chapter Learning-Theoretic Methods in Lossy Data Compression, pages 163–210. Springer, 2002. [22] T. Linder, G. Lugosi, and K. Zeger. Rates of convergence in the source coding theorem, in empirical quantizer design, and in universal lossy source coding. Information Theory, IEEE Transactions on, 40(6):1728–1740, 1994. [23] S. Lloyd. Least squares quantization in PCM. 28(2):129–137, 1982.

IEEE Transactions on Information Theory,

[24] D. W. Nychka and D. D. Cox. Convergence rates for regularized solutions of integral equations from discrete noisy data. The Annals of Statistics, 17(2):556–572, 1989. [25] D. Pollard. Strong consistency of k-means clustering. The Annals of Statistics, 9(1):135–140, 1981. [26] D. Pollard. A central limit theorem for empirical processes. Journal of the Australian Mathematical Society (Series A), 33(2):235–248, 1982. [27] D. Pollard. A central limit theorem for k-means clustering. The Annals of Statistics, 4(10):919– 926, 1982. [28] D. L. Ragozin. Error bounds for derivative estimates based on spline smoothing of exact or noisy data. Journal of Approximation Theory, 37(4):335–355, 1983. [29] H. L. Royden. Real Analysis. Macmillan, 3rd edition, 1988. [30] P. Speckman. Spline smoothing and optimal rates of convergence in nonparametric regression models. The Annals of Statistics, 13(3):970–983, 1985. [31] P. L. Speckman and D. Sun. Asymptotic properties of smoothing parameter selection in spline smoothing. Technical report, Department of Statistics, University of Missouri, 2001. [32] C. J. Stone. Optimal global rates of convergence for nonparametric regression. The Annals of Statistics, 10(4):1040–1053, 1982. [33] T. Tarpey and K. K. J. Kinateder. Clustering functional data. Journal of Classification, 2003. [34] M. Thorpe, F. Theil, A. Johansen, and N. Cade. Convergence of the k-means minimization problem using Γ-convergence. arXiv:1501.01320v2, 2015. [35] F. I. Utreras. Optimal smoothing of noisy data using spline functions. SIAM Journal on Scientific and Statistical Computing, 2(3):349–362, 1981. [36] F. I. Utreras. Natural spline functions, their associated eigenvalue problem. Numerische Mathematik, 42(1):107–117, 1983.

18

[37] F. I. Utreras. Smoothing noisy data under monotonicity constraints existence, characterization and convergence rates. Numerische Mathematik, 47(4):611–625, 1985. [38] G. Wahba. Spline models for observational data. Society for Industrial and Applied Mathematics (SIAM), 1990. [39] M. P. Wand. On the optimal amount of smoothing in penalised spline regression. Biometrika, 86(4):936–940, 1999. [40] C. F. J. Wu. On the convergence properties of the EM algorithm. The Annals of Statistics, 11(1):95– 103, 1083.

19

Suggest Documents