Sequential Learning of Principal Curves: Summarizing Data Streams

0 downloads 0 Views 1MB Size Report
May 22, 2018 - tistical and machine learning techniques. ..... (resp., Si:j and vi:j) to abbreviate the sequence of partitions Vi,...,Vj (resp., Si,...,Sj and vi,...,vj).
Sequential Learning of Principal Curves: Summarizing Data Streams on the Fly Benjamin G UEDJ* and Le L I†

arXiv:1805.07418v1 [stat.ML] 18 May 2018

May 22, 2018

Abstract When confronted with massive data streams, summarizing data with dimension reduction methods such as PCA raises theoretical and algorithmic pitfalls. Principal curves act as a nonlinear generalization of PCA and the present paper proposes a novel algorithm to automatically and sequentially learn principal curves from data streams. We show that our procedure is supported by regret bounds with optimal sublinear remainder terms. A greedy local search implementation that incorporates both sleeping experts and multi-armed bandit ingredients is presented, along with its regret bound and performance on a toy example and seismic data.

keywords sequential learning, principal curves, data streams, regret bounds, greedy algorithm, sleeping experts. MSC 2010: 68T10, 62L10, 62C99.

1

Introduction

Numerous methods have been proposed in the statistics and machine learning literature to sum up information and represent data by condensed and simpler to understand quantities. Among those methods, Principal Component Analysis (PCA) aims at identifying the maximal variance axes of data. This serves as a way to represent data in a more compact fashion and hopefully reveal as well as possible their variability. PCA has been introduced by Pearson (1901) and Spearman (1904) and further developed by Hotelling (1933). This is one of the most widely used procedures in multivariate exploratory analysis targeting dimension reduction or features extraction. Nonetheless, PCA is a linear procedure and the need for more sophisticated nonlinear techniques has led to the notion of principal curve. Principal curves may be seen as a nonlinear generalization of the first principal component. The goal is to obtain a curve which passes "in the middle" of data, as illustrated by Figure 1. This notion has been at the heart of numerous applications in many different domains, such as physics (Brunsdon, 2007; Friedsam and Oren, 1989), character and speech recognition (Kégl and Krzyz˙ ak, 2002; Reinhard and Niranjan, 1999), mapping and geology (Banfield and Raftery, 1992; Brunsdon, 2007; Stanford and Raftery, 2000), to name but a few.

1.1

Earlier works on principal curves

The original definition of principal curve dates back to Hastie and Stuetzle (1989). A principal curve is a smooth (C ∞ ) parameterized curve f(s) = ( f 1 (s), . . . , f d (s)) in Rd which does not intersect itself, has finite length inside any bounded subset of Rd and is self-consistent. * Inria, France. Web: https://bguedj.github.io, email [email protected], corresponding author. † Université d’Angers & iAdvize, France. Email [email protected].

1

Figure 1: An example of principal curve.

Figure 2: A principal curve and projections of data onto it. This last requirement means that f(s) = E[X | s f (X ) = s], where X ∈ Rd is a random vector and the so-called projection index s f (x) is the largest real number s minimizing the squared Euclidean distance between f(s) and x, defined by n o s f (x) = sup s : k x − f(s)k22 = inf k x − f(τ)k22 . τ

Self-consistency means that each point of f is the average (under the distribution of X ) of all data points projected on f, as illustrated by Figure 2. However, an unfortunate consequence of this definition is that the existence is not guaranteed in general for a particular distribution, let alone for an online sequence for which no probabilistic assumption is made. Kégl (1999) proposed a new concept of principal curves which ensures the existence for a large class of distributions. Principal curves f? are defined as the curves minimizing the expected squared distance over a class FL of curves whose length is smaller than L > 0, namely, f? ∈ arg inf ∆(f), f∈FL

where

h i ∆(f) = E [∆ (f, X )] = E inf kf(s) − X k22 . s

Ek X k22

?

If < ∞, f always exists but may not be unique. In practical situation where only i.i.d copies X 1 , . . . , X n of X are observed, Kégl (1999) considers classes Fk,L of all polygonal lines with k segments and length not exceeding L, and chooses an estimator fˆ k,n of f? as the one within Fk,L which minimizes the empirical counterpart

∆n (f) =

n 1X ∆ (f, X i ) n i=1

of ∆(f). It is proved in Kégl et al. (2000) that if X is almost surely bounded and k is proportional to n1/3 , then ³ ´ ¡ ¢ ¡ ¢ ∆ fˆ k,n − ∆ f? = O n−1/3 . As the task of finding a polygonal line with k segments and length at most L that minimizes ∆n (f) is computationally costly, Kégl et al. (2000) proposes the Polygonal Line algorithm. This iterative algorithm proceeds by fitting a polygonal line with k segments and considerably speeds up the exploration part by resorting to gradient descent. The two steps (projection and optimization) are similar to what is done by the k-means algorithm. However, 2

(a) A too small k.

(c) A too large k.

(b) An appropriate k.

Figure 3: Principal curves with different number k of segments. the Polygonal Line algorithm is not supported by theoretical bounds and leads to variable performance depending on the distribution of the observations. As the number k of segments plays a crucial role (a too small k leads to a rough summary of data while a too large k yields overfitting, see Figure 3), Biau and Fischer (2012) aim to fill the gap by selecting an optimal k from both theoretical and practical perspectives. Their approach relies strongly on the theory of model selection by penalization introduced by Barron et al. (1999) and further developed by Birgé and Massart (2007). By considering countable classes {Fk,` }k,` of polygonal lines with k segments, total length ` ≤ L and whose ˆ `ˆ) is obtained as the minimizer of the criterion vertices are on a lattice, the optimal ( k, ¡ ¢ crit(k, `) = ∆n fˆ k,` + pen(k, `),

where

s

pen(k, `) = c 0

k ` 1 + c 1 + c 2 p + δ2 n n n

r

wk,` 2n

is a penalty function where δ stands for the diameter of observations, wk,` denotes the weight attached to class Fk,` and with constants c 0 , c 1 , c 2 depending on δ, the maximum length L and the dimension of observations. Biau and Fischer (2012) then prove that h i n £ o δ2 Σ r π ¤ ? ? ˆ ˆ E ∆(fk, , ˆ `ˆ ) − ∆(f ) ≤ inf E ∆(f k,` ) − ∆(f ) + pen(k, `) + 3/2 k,` n 2

(1)

where Σ is a numerical constant. The expected loss of the final polygonal line fˆ k, ˆ ˆ is close to p ` the minimal loss achievable over Fk,` up to a remainder term decaying as 1/ n.

1.2

Motivation

The big data paradigm—where collecting, storing and analyzing massive amounts of large and complex data becomes the new standard—commands to revisit some of the classical statistical and machine learning techniques. The tremendous improvements of data acquisition infrastructures generates new continuous streams of data, rather than batch datasets. This has drawn a large interest to sequential learning. Existing theoretical works and practical implementations of principal curves are designed for the batch setting (Biau and Fischer, 2012; Kégl, 1999; Kégl and Krzyz˙ ak, 2002; Kégl et al., 2000; Sandilya and Kulkarni, 2002). To the best of our knowledge, very little effort has been put so far into extending principal curves algorithms to the sequential context (to the exception of Laparra and Malo, 2016, in a fairly different setting and with no theoretical results). The present paper aims at filling this gap: our goal is to propose an online perspective to principal curves by automatically and sequentially learning the best principal curve summarizing a data stream. This represents a clear improvement over the batch setting as when data streams are collected,

3

running a batch algorithm at each collect time is likely to be resources-demanding. Sequential learning takes advantage of the latest collected (set of) observations and therefore suffers a much smaller computational cost. Sequential learning operates as follows: a blackbox reveals at each time t some deterministic value x t , t = 1, 2, . . . , and a forecaster attempts to predict sequentially the next value based on past observations (and possibly other available information). The performance of the forecaster is no longer evaluated by its generalization error (as in the batch setting) but rather by a regret bound which quantifies the cumulative loss of a forecaster in the first T rounds with respect to some reference minimal loss. In sequential learning, the velocity of algorithms may be favored over statistical precision. An immediate use of aforecited techniques (Biau and Fischer, 2012; Kégl et al., 2000; Sandilya and Kulkarni, 2002) at each time round t (treating data collected until t as a batch dataset) would result in a monumental algorithmic cost. Rather, we propose a novel algorithm which adapts to the sequential nature of data, i.e., which takes advantage of previous computations. We refer the reader to the monograph Cesa-Bianchi and Lugosi (2006) for a thorough introduction to sequential learning. The contributions of the present paper are twofold. We first propose a sequential principal curves algorithm, for which we derive regret bounds. We then move towards an implementation, illustrated on a toy dataset and a real-life dataset (seismic data). The sketch of our algorithm procedure is as follows. At each time round t, the number of segments of k t is chosen automatically and the number of segments k t+1 in the next round is obtained by only using information about k t and a small amount of past observations. The core of our procedure relies on computing a quantity which is linked to the mode of the so-called Gibbs quasi-posterior and is inspired by quasi-Bayesian learning. The use of quasi-Bayesian estimators is especially advocated by the PAC-Bayesian theory which originates in the machine learning community in the late 1990s, in the seminal works of Shawe-Taylor and Williamson (1997) and McAllester (1999a,b). In a recent preprint, Li et al. (2016) discuss the use of PAC-Bayesian tools for sequential learning. The rest of the paper proceeds as follows. section 2 presents our notation and our online principal curve algorithm, for which we provide regret bounds with sublinear remainder terms in section 3. A practical implementation is proposed in section 4 and we illustrate its performance on a toy dataset and seismic data in section 5. Finally, we collect in section 6 proofs to original results claimed in the paper.

2

Notation

A parameterized curve in Rd is a continuous function f : I −→ Rd where I = [a, b] is a closed interval of the real line. The length of f is given by ( ) M X L (f) = lim sup kf(s i ) − f(s i−1 )k2 . M →∞

a= s 0 < s 1 0. Let Qδ be a grid over B(0, dR), i.e., Qδ = B(0, dR) ∩ Γδ where Γδ is a lattice in Rd with spacing δ > 0. Let L > 0 and define for each k ∈ ‚1, pƒ the collection Fk,L of polygonal lines f with k segments whose vertices are in Qδ p and such that L (f) ≤ L. Denote by F p = ∪k=1 Fk,L all polygonal lines with a number of segments ≤ p, whose vertices are in Qδ and whose length is at most L. Finally, let K (f) denote the number of segments of f ∈ F p . This strategy is illustrated by Figure 4.

4

Figure 4: An example of a lattice Γδ in R2 with δ = 1 (spacing between blue points) and B(0, 10) (black circle). The red polygonal line is composed with vertices in Qδ = B(0, 10) ∩ Γδ . Our goal is to learn a time-dependent polygonal line which passes through the "middle" of data and gives a summary of all available observations x1 , . . . , x t−1 (denoted by (xs )1:( t−1) hereafter) before time t. Our output at time t is a polygonal line fˆ t ∈ F p depending on past information (xs )1:( t−1) and past predictions (fˆ s )1:( t−1) . When x t is revealed, the instantaneous loss at time t is computed as ¡ ¢ ∆ fˆ t , x t = inf kfˆ t (s) − x t k22 .

(2)

s∈ I

In what follows, we investigate regret bounds for the cumulative loss based on (2). Given a measurable space Θ (embedded with its Borel σ-algebra), we let P (Θ) denote the set of probability distributions on Θ, and for some reference measure π, we let P π (Θ) be the set of probability distributions absolutely continuous with respect to π. For any k ∈ ‚1, pƒ, let πk denote a probability distribution on Fk,L . We define the prior π on p F p = ∪k=1 Fk,L as X π(f) = wk πk (f)1{f∈Fk,L } , f ∈ F p , k∈‚1,pƒ

where w1 , . . . , w p ≥ 0 and

P

k∈‚1,pƒ w k

= 1.

A quasi-Bayesian procedure would now consider the Gibbs quasi-posterior (note that this is not a proper posterior in all generality, hence the term "quasi") ρˆ t (·) ∝ exp(−λS t (·))π(·),

where S t (f) = S t−1 (f) + ∆(f, x t ) +

λ¡

¢2 ∆(f, x t ) − ∆(fˆ t , x t ) ,

2 as advocated by Audibert (2009) and Li et al. (2016) who then consider realisations from this quasi-posterior. In the present paper, we will rather consider a quantity linked to the mode of this quasi-posterior. Indeed, the mode of the quasi-posterior ρˆ t+1 is ½ t ¾ t ¡ X ¢2 ln π(f) λX ˆ arg min ∆(f, xs ) + ∆(f, x t ) − ∆(f t , x t ) + , f∈F p s=1 2 s=1 λ } | {z | {z } | {z } ( i)

( ii )

( iii )

where (i) is a cumulative loss term, (ii) is a term controlling the variance of the prediction f to past predictions fˆ s , s ≤ t, and (iii) can be regarded as a penalty function on the complexity 5

of f if π is well chosen. This mode hence has a similar flavor to follow the best expert or follow the perturbed leader in the setting of prediction with experts (see Hutter and Poland, 2005 or Cesa-Bianchi and Lugosi, 2006, Chapters 3 and 4) if we consider each f ∈ F p as an expert which always delivers constant advice. These remarks yield Algorithm 1. Algorithm 1 An algorithm to sequentially learn principal curves 1: Input parameters: p > 0, η > 0, π(z) = e− z 1{ z>0} and penalty function h : F p → R+ 1 2: Initialization: For each f ∈ F p , draw zf ∼ π and ∆f,0 = η (h(f) − zf ) 3: For t = 1, . . . , T 4: Get the data x t 5: Obtain ( ) tX −1 fˆ t = arg inf ∆f,s , f∈F p

6:

3

s=0

where ∆f,s = ∆(f, xs ), s ≥ 1. End for

Regret bounds for sequential learning of principal curves

We now present our main theoretical results. p Theorem 1. For any sequence (x t )1:T ∈ B(0, dR), R ≥ 0 and any penalty function h : F p → R+ , let π(z) = e− z 1{ z>0} . Let 0 < η ≤ d (2R1+δ)2 , then the procedure described in Algorithm 1 satisfies à ! T X X −h(f) £ ¤ 1 Eπ ∆(fˆ t , x t ) ≤ (1 + c 0 (e − 1)η)S T,h,η + 1 + ln e , η t=1 f∈F p where c 0 = d(2R + δ)2 and

S T,h,η = inf

  

(

inf

k∈‚1,pƒ   f∈F p K (f)= k

T X

∆(f, x t ) +

 ) h(f) 

t=1

η

.

 

The expectation of the cumulative loss of polygonal lines fˆ1 , . . . , fˆ T is upper-bounded by the smallest penalised cumulative loss over all k ∈ {1, . . . , p} up to a multiplicative term (1 + c 0 (e − 1)η) which can be made arbitrarily close to 1 by choosing a small enough η. However, P this will lead to both a large h(f)/η in S T,h,η and a large η1 (1 + ln f∈F p e−h(f) ). In addition, another important issue is the choice of the penalty function h. For each f ∈ F p , h(f) should P be large enough to ensure a small f∈F p e−h(f) while not too large to avoid overpenalization and a larger value for S T,h,η . We therefore set ¯ ¯ ¯ ¯ h(f) ≥ ln(pe) + ln ¯{f ∈ F p , K (f) = k}¯

(3)

for each f with k segments (where | M | denotes the cardinality of a set M) since it leads to X f∈F p

e−h(f) ) =

X

X

k∈‚1,pƒ f∈F p K (f)= k

e−h(f) ≤

X k∈‚1,pƒ

1 1 ≤ . pe e

The penalty function h(f) = c 1 K (f) + c 2 L + c 3 satisfies (3), where c 1 , c 2 , c 3 are constants depending on R, d, δ, p (this is proven in Lemma 3). We therefore obtain the following corollary. 6

Corollary 1. Under the assumptions of Theorem 1, let ( ) s 1 c1 p + c2 L + c3 η = min , . P d(2R + δ)2 c 0 (e − 1) inff∈F p T t=1 ∆(f, x t ) Then T X

  

£ ¤ E ∆(fˆ t , x t ) ≤ inf

(

inf

k∈‚1,pƒ   f∈F p K (f)= k

t=1

T X

∆(f, x t ) +

q

t=1

 

+

where r T,k,L = inff∈F p

c 0 (e − 1)r T,k,L

 ) 

q

c 0 (e − 1)r T,p,L + c 0 (e − 1)(c 1 p + c 2 L + c 3 ),

PT

t=1 ∆(f, x t )(c 1 k + c 2 L + c 3 ).

Proof. Note that T T X X £ ¤ E ∆(fˆ t , x t ) ≤ S T,h,η + η c 0 (e − 1) inf ∆(f, x t ) + c 0 (e − 1)(c 0 p + c 2 L + c 3 ), f∈F p t=1

t=1

and we conclude by setting s η=

c1 p + c2 L + c3 . P c 0 (e − 1) inff∈F p T t=1 ∆(f, x t )

However Corollary 1 is not usable in practice since the optimal value for η depends on P inff∈F p T t=1 ∆(f, x t ) which is obviously unknown, even more so at time t = 0. We therefore provide an adaptive refinement of Algorithm 1 in the following Algorithm 2. Algorithm 2 An adaptive algorithm to sequentially learn principal curves p c 1 p+ c 2 L+ c 3 p 1: Input parameters: p > 0, L > 0, π, h and η 0 = c 0 e−1 1 2: Initialization: For each f ∈ F p , draw zf ∼ π, ∆f,0 = (h(f) − zf ) and fˆ 0 = arg inf ∆f,0 η0

3: 4: 5: 6:

For t = 1, . . . , T Compute η t =

p

c 1 p+ c 2 L+ c 3 p c 0 (e−1) t

³ ´ Get data x t and compute ∆f,t = ∆(f, x t ) + η1t − η t1−1 (h(f) − zf ) Obtain ( ) tX −1 fˆ t = arg inf ∆f,s . f∈F p

7:

f∈F p

(4)

s=0

End for

p Theorem 2. For any sequence (x t )1:T ∈ B(0, dR), R ≥ 0, let h(f) = c 1 K (f) + c 2 L + c 3 where c 1 , c 2 , c 3 are constants depending on R, d, δ, ln p. Let π(z) = e− z 1{ z>0} and p p c1 p + c2 L + c3 c1 p + c2 L + c3 η0 = , t ≥ 1, , ηt = p p c 0 (e − 1)t c0 e − 1

where c 0 = d(2R + δ)2 . Then the procedure described in Algorithm 2 satisfies   ( )    T T p X X £ ¤ inf ∆(f, x t ) + c 0 (e − 1)T(c 1 k + c 2 L + c 3 ) E ∆(fˆ t , x t ) ≤ inf  k∈‚1,pƒ   f∈F p t=1  t=1 K (f)= k p + 2c 0 (e − 1)T(c 1 p + c 2 L + c 3 ). 7

The message of this regret bound is that the expected cumulative loss of polygonal lines fˆ1 , . . . , fˆ T is upper-bounded by the minimal cumulative loss over all k ∈ {1, . . . , p}, up to an additive term which is sublinear in T. The actual magnitude of this remainder term is p kT. When L is fixed, the number k of segments is a measure of complexity of the retained polygonal line. This bound therefore yields the same magnitude than (1) which is the most refined bound in the literature so far (Biau and Fischer, 2012, where the optimal values for k and L are obtained in a model selection fashion).

4

Implementation p

The argument of the infimum in Algorithm 2 is taken over F p = ∪k=1 Fk,L which has a cardinality of order |Qδ | p , making any greedy search largely time-consuming. We instead turn to the following strategy: given a polygonal line fˆ t ∈ Fk t ,L with k t segments, we consider, with a certain proportion, the availability of fˆ t+1 within a neighbourhood U (fˆ t ) (see the formal definition below) of fˆ t . This consideration is suited for principal curve setting since if observation x t is close to fˆ t , one can expect that the polygonal line which well fits observations xs , s = 1, . . . , t lies in a neighbourhood of fˆ t . In addition, if each polygonal line f is regarded as an action, we no longer assume that all actions are available at all times, and allow the set of available actions to vary at each time. This is a model known as "sleeping experts (or actions)" in prior work (Auer et al., 2003; Kleinberg et al., 2008). In this setting, defining the regret with respect to the best action in the whole set of actions in hindsight remains to be difficult since that action might sometimes be unavailable. Hence it is natural to define the regret with respect to the best ranking of all actions in the hindsight according to their losses or rewards, and at each round one chooses among the available actions by selecting the one which ranks the highest in this ordering. Kleinberg et al. (2008) introduced this notion of regret and studied both the full-information (best action) and partial-information (multi-armed bandit) settings with stochastic and adversarial rewards and adversarial action availability. They pointed out that the EXP4 algorithm (Auer et al., 2003) attains the optimal regret in adversarial rewards case but runs in time exponential in the number of all actions. Kanade et al. (2009) considered full and partial information with stochastic action availability and proposed an algorithm that runs in polygonal time. In what follows, we will realise our implementation by resorting to ”sleeping experts” i.e., a special available set of actions that adapts to the setting of principal curves. Let σ denote an ordering of |F p | actions, and A t an available subset of the actions at round t. We let σ(A t ) denote the highest ranked action in A t . In addition, for any action f ∈ F p we define the reward r f,t of f at round t, t ≥ 0 by r f,t = c 0 − ∆(f, x t ). It is clear that r f,t ∈ (0, c 0 ). The convention from losses to gains is done in order to facilitate the subsequent performance analysis. The reward of an ordering σ is the cumulative reward of the selected action at each time T X r σ(At ),t , t=1

and the reward of the best ordering is maxσ stochastic).

PT

t=0 r σ(A t ),t

£ ¤ P (E maxσ T t=1 r σ(A t ),t when A t is

Our procedure starts with a partition step which aims at identifying the "relevant" neighbourhood of an observation x ∈ Rd with respect to a given polygonal line, and then proceeds with the definition of neighbourhood of an action f, and finally we give details of implementation and show its regret bound. 8

*

Partition For any polygonal line f with k segments, we denote by V = (v1 , . . . , vk+1 ) its vertices and by s i , i = 1, . . . , k the line segments connecting v i and v i+1 . In the sequel, we *

*

use f(V) to represent the polygonal line formed by connecting consecutive vertices in V if no confusion is possible. In addition, for x ∈ Rd , let ∆(x, v i ) = || x − v i ||22 be the squared distance between x and v i , and let ∆(x, s i ) be the squared distance between x and line segment s i , i.e.,  || x − v i ||22 if s s i (x) = v i ,   2 if s s i (x) = v i+1 , ∆(x, s i ) = || x − v i+1 ||2 ³ ´2    2 T v i+1 −v i || x − v || − (x − v ) otherwise, i 2

i

||v i+1 −v i ||2

where s s i (x) is the projection index of x to line segment s i . In this step, Rd can be partitioned into at most (2k + 1) disjoint sets Vi , i = 1, . . . , k + 1 and S i , i = 1, . . . , k defined as n o Vi = x ∈ Rd : ∆(x, v i ) = ∆(f, x), x 6∈ ∪si−=11 Vs , n o i −1 k+1 S i = x ∈ Rd : ∆(x, s i ) = ∆(f, x), x 6∈ ∪m =1 S m ∪ i =1 Vi , where Vi , i = 1, . . . , k + 1 is the set of all x ∈ Rd whose closest vertex of f is v i , and S i , i = 1, . . . , k is the set of all x ∈ Rd whose closest segment of f is s i . In the sequel, we use Vi: j , i ≤ j (resp., S i: j and v i: j ) to abbreviate the sequence of partitions Vi , . . . , V j (resp., S i , . . . , S j and v i , . . . , v j ). For any x ∈ Rd , let us denote by V (x) the set of vertices of f connecting to the projection f (s f (x)) and denote by N (x) the neighbourhood x with respect to f, i.e.,     v1:2 if x ∈ V1 ∪ S 1 , V1:2 ∪ S 1:2 if x ∈ V1 ∪ S 1 ,           v1:3 if x ∈ V2 , V1:3 ∪ S 1:3 if x ∈ V2 ,         v  if x ∈ S i , i = 2, . . . , k − 1, Vi: i+1 ∪ S i−1: i+1 if x ∈ S i , i = 2, . . . , k − 1, i : i +1 V (x) = N (x) =   v i−1: i+1 if x ∈ Vi , i = 3, . . . , k − 1, Vi−1: i+1 ∪ S i−2: i+1 if x ∈ Vi , i = 3, . . . , k − 1,       v V   if x ∈ Vk , if x ∈ Vk , k − 1: k + 1 k−1: k+1 ∪ S k−2: k       v  if x ∈ S ∪ V , V ∪S if x ∈ S ∪ V . k: k+1

k

k+1

k: k+1

k−1: k

k

k+1

Note that those definitions of V (x) and N (x) hold whenever k ≥ 4. If 1 ≤ k < 4, V (x) and N (x) are identical. Next, let n o N t (x) = xs , , s = 1, . . . , t, xs ∈ N (x) be the set of observations x1: t belonging to N (x). We finally let N¯t (x) stand for the average of all observations in N t (x) and D (M) = sup x,y∈ M || x − y||2 denotes the diameter of set M ∈ Rd . We initiate the principal curve fˆ1 as the first component line segment whose vertices are the two farthest projections of data x1: t0 on the first component line. Given fˆ t , t ≥ 1 with k t segments, we obtain firstly the Voronoi-like partition V1:k t +1 and S 1:k t , then identify ¡ ¢ © ª ¡ ¢ the adjacent vertices set V x t+ t0 = v i t : j t and set N t+ t0 x t+ t0 for observations x1: t+ t0 in ¡ ¢ N x t+ t 0 . Neighbourhood For each t, let us first define the local vertices set Qδ,t (x) around x ∈ Rd as ¡ ¢ Qδ,t (x) = B N¯t (x), D (N t (x) ∩ Qδ ,

9

where B(x, r) is a `2 -ball in Rd , centered in x with radius r > 0. Then for k ∈ { k t − 1, k t , k t + 1}, *

*

a candidate V (equivalently a candidate f(V)) with k + 1 vertices is formed as *

¡ ¢ V = v1: i t −1 , v1:m , v j t +1:k t +1 ,

where m = k + 1 − k t + j t − i t and v1:m are distinct m vertices chosen from the local vertices set * ¡ ¢ Qδ,t+ t0 x t+ t0 around x t+ t0 . In other words, a candidate V is built by keeping all the vertices ¡ ¢ ¡ ¢ of fˆ t that do not belong to V x t+ t0 and by updating m vertices drawn from Qδ,t+ t0 x t+ t0 . We define the neighbourhood U (fˆ t ) of fˆ t by ½ ¾ * * ¡ ¢ U (fˆ t ) = f(V), V such that v1:m ∈ Qδ,t+ t0 x t+ t0 for k ∈ { k t − 1, k t , k t + 1} . (5) ¯ ¡ ¢¯ ¯ ¯3 ¯ ¡ ¢¯ The cardinality ¯U fˆ t ¯ is upper bounded by ¯F p ¯ p since all elements in ¯U fˆ t ¯ differ with fˆ t up to at most three vertices.

Finally, we give our implementation with action availability that may vary at each time in Algorithm 3. Algorithm 3 A locally greedy algorithm to sequentially learn principal curves 1: Input parameters: p > 0, R > 0, L > 0, ² > 0, α > 0, 1 > β > 0 and any penalty function h 2:

Initialization: Given (x t )1: t0 , obtain fˆ1 as the first principal component

3:

For t = 2, . . . , T

4: 5:

Draw I t ∼ Bernoull i(²) and zf ∼ π. ³ ´ P t−1 1 1 ˆ Let σˆ t = sort f, r − h(f) + z , i.e., descending sorting all f ∈ F p acs=1 f,s η t−1 η t−1 f cording to their perturbed cumulative reward till t − 1.

6:

If I t = 1, set A t = F p and fˆ t = σˆ t (A t ) and observe r fˆ t ,t

7:

rˆ f,t = r f,t 8:

for f ∈ F p .

If I t = 0, set A t = U (fˆ t−1 ), fˆ t = σˆ t (A t ) and observe r fˆ t ,t

9:

rˆ f,t =

  

r ¡ f,t ¢ P fˆ t =f|H t

 α

if f ∈ U (fˆ t−1 ) ∩ cond(t)

and fˆ t = f,

otherwise,

where H t denotes all the randomness before time t and cond(t) = © ¡ ¢ ª f ∈ F p : P fˆ t = f|H t > β . In particular, when t = 1, we set rˆ f,1 = r f,1 for all f ∈ F p , ¡ ¢ U fˆ0 = ; and rˆ σˆ 1 ¡U ¡fˆ0 ¢¢,1 ≡ 0. 10:

End for

The Algorithm 3 has an exploration phase (when I t = 1) and an exploitation phase (I t = 0). In the exploration phase, it is allowed to observe rewards of all actions and to choose an optimal perturbed action from the set F p of all actions. In the exploitation phase, only rewards of a part of actions can be accessed to and rewards of others are estimated by a ¡ ¢ constant, and we update our action from the neighbourhood U fˆ t−1 of the previous action 10

fˆ t−1 . This local update (or search) can hence reduce computation complexity. In addition, ¡ ¢ this local search will be enough to account for the case when x t locates in U fˆ t−1 . The parameter β needs to be carefully calibrated since on the one hand, β should not be too big to make sure that the condition cond(t) is non-empty, otherwise, all rewards are estimated by the same constant and thus leading to the same descending ordering of tuples for both ¢ ¢ ¡P t ¡P t−1 ˆ s=1 rˆ f,s , f ∈ F p . Therefore, we may face the risk of having f t+1 in s=1 rˆ f,s , f ∈ F p and ˆ the neighbourhood of f t even if we are in the exploration phase at time t + 1; on the other r hand, very small β could result in large bias for estimation ¡ˆ f,t ¢ of r f,t . In addition, P f t =f|H t

the exploitation phase is similar but different to the label efficient prediction (Cesa-Bianchi et al., 2005, Remark 1.1) since we allow an action at time t to be different from the previous one. Moreover, Neu and Bartók (2013) have proposed the Geometric Resampling method to ¢ ¡ estimate the conditional probability P fˆ t = f|H t since this quantity often doesn’t have an explicit form. However, due to the simple exponential distribution of zf chosen in our case, ¡ ¢ an explicit form of P fˆ t = f|H t is straightforward. Theorem 3. Assume that p > 6, T ≥ 2|F p |2 and let ¯ ¯− 1 1 β = ¯F p ¯ 2 T − 4 ,

α=

p η1 = η2 = . . . , = η T =

c0 , β

c1 p + c2 L + c3 , p T(e − 1) cˆ0

cˆ0 =

2c 0 , β

¯ ¯1−3 1 ² = 1 − ¯F p ¯ 2 p T − 4 .

Then the procedure described in Algorithm 3 satisfies the regret bound " # T T X X £ ¡ ¢¤ 3 E ∆ fˆ t , x t ≤ inf E ∆ (f, t) + O (T 4 ). t=1

f∈F p

t=1

Proof. With the given value of parameters, the assumption of Lemma 4, Lemma 5 and Lemma 6 are satisfied; Combining their inequalities lead to the following ( " )# i h T T T ¯ ¡ X X X ¢¯ 1 r σ(At ),t − h (σ (A t )) − 2αβ(1 − ²) ¯U fˆ t−1 ¯ E r fˆ t ,t ≥ E max σ η t=1 t=1 t=1

− cˆ20 (e − 1)ηT − cˆ0 (e − 1) (c 1 p + c 2 L + c 3 ) v " # µ ¶ u ¯ ¯ ¯ ¯ ¢u c20 ¡ 1 2 t 2 ¯ ¯ − ¯F p ¯ β c 0 T − 1 − Fp β 2T + α (1 − β) + (c 0 + 2α) ln β β " ( )# T X ¯ ¯3 1 ≥E max r σ(At ),t − h (σ (A t )) − (1 − ²) ¯F p ¯ p c 0 T σ η t=1 − cˆ20 (e − 1)ηT − cˆ0 (e − 1) (c 1 p + c 2 L + c 3 ) v " # µ ¶ u ¯ ¯ ¢u ¯ ¯ c20 ¡ 1 2 t 2 − 1 − ¯F p ¯ β 2T + α (1 − β) + (c 0 + 2α) ln − ¯F p ¯ β c 0 T β β " ( )# ³¯ ¯ 1 3 ´ T X 1 ≥E max r σ(At ),t − h (σ (A t )) − O ¯F p ¯ 2 T 4 , σ η t=1 ¯ ¡ ¢¯ where the second inequality is due to the fact that the cardinality ¯U fˆ t−1 ¯ is upper bounded ¯ ¯3 by ¯F p ¯ p for t ≥ 1. In addition, using the definition of r f,t that r f,t = c 0 − ∆(f, x t ) terminates

the proof of Theorem 3.

11

(a) t=25

(b) t=50

(c) t=75

(d) t=100

Figure 5: A principal curve (green) sequentially learned (black dots represent data x1: t , red is the new observation x t+1 ). ³¯ ¯ 1 3 ´ One can see that the regret is upper bounded by a term of order ¯F p ¯ 2 T 4 , sublinear in T. ¯ ¯1 3 The term (1 − ²)c 0 T = c 0 ¯F p ¯ 2 T 4 is the price to pay for the local search (with a proportion 1 − ²) of polygonal line fˆ t in neighbourhood of the previous fˆ t−1 . If ² = 1, we would have that cˆ0 = c 0 and the last two terms in the first inequality of Theorem 3 would disappear, hence the upper bound reduces to the one in Theorem 2. In addition, our ¯ ¯algorithm achieves an order that is smaller (from the point of view of both the number ¯F p ¯ of all actions and the total rounds T) than Kanade et al. (2009) since at each time, the availability of actions for our algorithm can be either the whole action set or a neighbourhood of the previous action while Kanade et al. (2009) considers at each time only partial and independent stochastic available set of actions generated from a predefined distribution.

5

Numerical experiments

Finally, we illustrate the performance of Algorithm 3 on a toy example and seismic data (from Engdahl and Villaseñor, 2002). A toy example The data set n

o x t ∈ R2 , t = 1, . . . , 100

attached to this example is generated uniformly along the curve y = −0.05 × (x − 5)3 , x ∈ p [0, 10]. We set p = 20, δ = 0.5, R = 10, d = 2, L = 0.01 × p × R × d. Figure 5 presents the sequentially learned principal curve fˆ t+1 at times t = 25, 50, 75, 100. A zoom on what happens between two consecutive time stamps is presented in Figure 6. We see that collecting a new data point only impacts a local vertex, whereas the rest of the principal curve remains unchanged. The implementation is conducted in the R language and the code is available upon request.

12

(a) t=97

(b) t=98

Figure 6: Zooming in: how a new data point changes the principal curve. Seismic data Seismic data spanning long periods of time are essential for a thorough understanding of earthquakes. The “Centennial Earthquake Catalog” (Engdahl and Villaseñor, 2002) aims at providing a realistic picture of the seismicity distribution on Earth. It consists in a global catalog of locations and magnitudes of instrumentally recorded earthquakes from 1900 to 2008. Figure 7 is taken from the USGS website1 and gives the global earthquakes locations on the period 1900–1999. The seismic data (latitude, longitude, magnitude of earthquakes, etc.) used in the present paper may be downloaded from this website. We focus on a particularly representative seismic active zone (a lithospheric border close to Australia) whose longitude is between E130◦ to E180◦ and latitude between S70◦ to N30◦ . The number of seismic recordings inside this area is T = 218. Figure 8 presents the learned principal curve: our procedure recovers nicely the tectonic plate boundary.

Figure 7: Seismic data from https://earthquake.usgs.gov/data/centennial/

6

Proofs

This section contains the proof of Theorem 2 (note that Theorem 1 is a straightforward consequence, with η t = η, t = 0, . . . , T) and necessary lemmas for Theorem 3. Let us first 1 https://earthquake.usgs.gov/data/centennial/

13

(a) t=50

(b) t=75

(c) t=100

(d) t=125

Figure 8: Seismic data: the learned principal curve (green) recovers the tectonic plate boundary (black dots represent data x1: t , red is the new observation x t+1 ). define for each t = 0, . . . , T the following forecaster sequence (fˆ? t )t ¾ ½ © ª 1 1 h(f) − zf , fˆ? 0 = arg inf ∆f,0 = arg inf η0 η0 f∈F p f∈F p ( ) ( ) t t X X 1 1 ∆f,s = arg inf ∆(f, xs ) + fˆ? h(f) − zf , t = arg inf η t−1 η t−1 f∈F p f∈F p s=0 s=1

t ≥ 1.

Note that fˆ? t is an "illegal" forecaster since it peeks into the future. In addition, denote by ( ) T X 1 f? = arg inf ∆(f, x t ) + h(f) ηT f∈F p t=1 the polygonal line in F p which minimizes the cumulative loss in the first T rounds plus a penalty term. f? is deterministic while fˆ? t is a random quantity (since it depends on zf , f ∈ F p drawn from π). If several f attain the infimum, we choose f? as the one having the T smallest complexity. We now enunciate the first (out of three) intermediary technical result. p Lemma 1. For any sequence x1 , . . . , xT in B(0, dR), T X t=0

∆fˆ? ,t ≤ t

T X t=0

∆fˆ? ,t ,

π-almost surely.

T

(6)

Proof. Proof by induction on T. Clearly (6) holds for T = 0. Assume that (6) holds for T − 1: TX −1 t=0

∆fˆ? ,t ≤ t

TX −1 t=0

∆fˆ?

T −1

,t .

Adding ∆fˆ? ,T to both sides of the above inequality concludes the proof. T

14

By (6) and the definition of fˆ? , for k ≥ 1, we have π-almost surely that T T X

T X

¶ ´ T µ 1 X 1 ˆ? 1 ³ ˆ? 1 ? ˆ ∆(fT , x t ) + − h(f t ) − Zfˆ? h(fT ) − Zfˆ? + t ηT η T T t=0 η t−1 η t t=1 ¶ ´ T T µ 1 X X 1 1 1 ³ ˆ? ∆(f? , x t ) + ≤ h(f? ) − Zf? + − h(f t ) − Zfˆ? t ηT ηT ηt t=1 t=0 η t−1 ) ( µ ¶ ´ T T X X 1 1 1 1 ³ ˆ? ∆(f, x t ) + h(f) − Zf? + − h(f t ) − Zfˆ? , = inf t f∈F p t=1 ηT ηT ηt t=0 η t−1

∆(fˆ? t , xt ) ≤

t=1

where 1/η −1 = 0 by convention. The second and third inequality is due to respectively the definition of fˆ? and f? . Hence T T # ( ) ·µ ¶ ´¸ T T X X ¡? ¢ 1 1 1 1 ³ ∆(f, x t ) + ∆ fˆ t , x t ≤ inf E h(f) − E[Zf? ] + E − − h(fˆ? ) + Z ? t fˆ t T f∈F p t=1 ηT ηT η t η t−1 t=1 t=0 ) " # ( ¶ T µ 1 T X X 1 1 h(f) + − E sup (− h(f) + Zf ) ≤ inf ∆(f, x t ) + f∈F p t=1 ηT η t−1 f∈F p t=1 η t ( ) " # T X 1 1 ∆(f, x t ) + = inf h(f) + E sup (− h(f) + Zf ) , f∈F p t=1 ηT η T f∈F p "

T X

´ ³ where the second inequality is due to E[Zf? ] = 0 and η1t − η t1−1 > 0 for t = 0, 1, . . . , T since η t T is decreasing in t in Theorem 2. In addition, for y ≥ 0, one has P (− h(f) + Zf > y) = e−h(f)− y .

Hence, for any y ≥ 0 Ã

!

P sup (− h(f) + Zf ) > y ≤ f∈F p

where u =

− h(f) . f∈F p e

P

"

X

P (Zf ≥ h(f) + y) =

f∈F p

X

e−h(f) e− y = ue− y ,

f∈F p

Therefore, we have #

"

Ã

!#

E sup (− h(f) + Zf ) − ln u ≤ E max 0, sup (− h(f) + Zf − ln u) f∈F p

f∈F p ∞

Z



0



0



0

!

!

P max 0, sup (− h(f) + Zf − ln u) > y dy Ã

!

P sup (− h(f) + Zf ) > y + ln u dy f∈F p



Z

à f∈F p



Z

Ã

ue−( y+ln u) dy = 1.

We thus obtain " # ( ) Ã ! T T X X X −h(f) ¡? ¢ 1 1 E ∆ fˆ t , x t ≤ inf ∆(f, x t ) + h(f) + 1 + ln e . f∈F p t=1 ηT ηT t=1 f∈F p Next, we control the regret of Algorithm 2.

15

(7)

Lemma 2. Assume that zf is sampled from the symmetric exponential distribution in R, i.e., π(z) = e− z 1{ z>0} . Assume that sup t=1,...,T η t−1 ≤ d (2R1+δ)2 , and define c 0 = d(2R + δ)2 . Then for p any sequence (x t ) ∈ B(0, dR), t = 1, . . . , T, T ¡ T X ¢ £ ¡ ¢¤ £ ¡ ¢¤ X 1 + η t−1 c 0 (e − 1) E ∆ fˆ? E ∆ fˆ t , x t ≤ t , xt .

(8)

t=1

t=1

Proof. Let us denote by à à ! ! tX −1 ¡ ¢ 1 1 F t (Zf ) = ∆ fˆ t , x t = ∆ arg inf ∆(f, xs ) + h(f) − Zf , x t η t−1 η t−1 f∈F s=1

the instantaneous loss suffered by the polygonal line fˆ t when x t is obtained. We have Z ¡? ¢ ¡ ¢ ˆ E[∆ f t , x t ] = F t z − η t−1 ∆ (f, x t ) π(z)dz Z ¡ ¢ = F t (z)π z + η t−1 ∆(f, x t ) dz Z = F t (z)e−( z+η t−1 ∆(f,xt )) dz Z 2 ≥ e−η t−1 d (2R +δ) F t (z)e− z dz ¡ ¢ 2 = e−η t−1 d (2R +δ) E[∆ fˆ t , x t ], where the inequality is due to the fact that ∆(f, x) ≤ d(2R + δ)2 holds uniformly for any f ∈ F p p and x ∈ B(0, dR). Finally, summing on t on both sides and using the elementary inequality e x ≤ 1 + (e − 1)x if x ∈ (0, 1) concludes the proof. © ª Lemma 3. For k ∈ ‚1, pƒ, we control the cardinality of set f ∈ F p , K (f) = k as ! Ãp µ ¶ ´ ¯© ª¯ ³ 3 ln 2 d d(2R + δ) ¯ ¯ ln f ∈ F p , K (f) = k ≤ ln(8peVd ) + 3d 2 − d k + p + L + d ln δ δ d δ ∆

= c1 k + c2 L + c3 ,

where Vd denotes the volume of the unit ball in Rd . Proof. First, let Nk,δ denote the set of polygonal lines with k segments and whose vertices are in Qδ . Notice that Nk,δ is different from {f ∈ F p , K (f) = k} and that à ! ¯ ¯ ¯ ¯ ¯{f ∈ F p , K (f) = k}¯ ≤ p ¯ N k,δ ¯ . k Hence à ! ¯ ¯ ¯ ¯ p ¯ ¯ ln {f ∈ F p , K (f) = k} ≤ ln + ln ¯ N k,δ ¯ k Ãp ! ¶ ³ ´ µ ln 2 3 pe d d(2R + δ) + k ln 8Vd + 3d 2 − d + p + L + d ln ≤ k ln k δ dδ δ Ãp ! µ ¶ ³ ´ 3 ln 2 d d(2R + δ) ≤ k ln(pe) + k ln 8Vd + 3d 2 − d + p + L + d ln , δ dδ δ

where the second inequality is a consequence to the elementary inequality bined with Lemma 2 in Kégl (1999). 16

¡ p¢ k



¡ pe ¢ k k

com-

We now have all the ingredients to prove Theorem 1 and Theorem 2. First, combining (7) and (8) yields that ( ) Ã ! T T X X X −h(f) £ ¤ 1 1 1 ˆ E ∆(f t , x t ) ≤ inf ∆(f, x t ) + h(f) + + ln e f∈F p t=1 ηT ηT 2 t=1 f∈F p + c 0 (e − 1)

T X

£ ¤ η t−1 E ∆(fˆ? t , xt )

t=1

≤ inf

  

(

inf

k∈‚1,pƒ   f∈F p K (f)= k

T X

∆(f, x t ) +

t=1

+ c 0 (e − 1)

T X

 ) h(f)  ηT

à ! X −h(f) 1 1 + + ln e   ηT 2 f∈F p

£ ¤ η t−1 E ∆(fˆ? t , xt ) .

t=1

³ ´ P Assume that η t = η, t = 0, . . . , T and h(f) = c 1 K (f)+ c 2 L+ c 3 for f ∈ F p , then 21 + f∈F p e−h(f) ≤ 0 and moreover à ! T T X X −h(f) X £ ¤ £ ¤ 1 1 ˆ E ∆(f t , x t ) ≤ S T,h,η + + ln e + c 0 (e − 1)η E ∆(fˆ? t , xt ) η 2 t=1 t=1 f∈F p

≤ S T,h,η + c 0 (e − 1)ηS T,h,η ≤ S T,h,η + η c 0 (e − 1) inf

T X

f∈F p t=1

where S T,h,η = inf

  

(

inf

k∈‚1,pƒ   f∈F p K (f)= k

∆(f, x t ) + c 0 (e − 1)(c 1 p + c 2 L + c 3 ),

T X t=1

∆(f, x t ) +

 ) h(f)  η

 

and the second inequality is obtained with Lemma 1. By setting s c1 p + c2 L + c3 η= P c 0 (e − 1) inff∈F p T t=1 ∆(f, x t ) we obtain T X t=1

£ ¤ E ∆(fˆ t , x t ) ≤ inf

  

(

inf

k∈‚1,pƒ   f∈F p K (f)= k

T X

∆(f, x t ) +

q

 

+

where r T,k,L = inff∈F p

c 0 (e − 1)r T,k,L

t=1

 ) 

PT

q

t=1 ∆(f, x t )(c 1 k + c 2 L + c 3 ).

c 0 (e − 1)L T,p,L + c 0 (e − 1)c 1 p + c 2 L + c 3 ,

This proves Theorem 1.

Finally, assume that p p c1 p + c2 L + c3 c1 p + c2 L + c3 η0 = and η t = , t = 1, . . . , T. p p c 0 (e − 1) c 0 (e − 1)t £ ¤ Since E ∆(fˆ? t , x t ) ≤ c 0 for any t = 1, . . . , T, we have   ( ) à !    1 T T T X X X −h(f) X £ ¤ h(f) E ∆(fˆ t , x t ) ≤ inf inf ∆(f, x t ) + + 1 + ln e + c20 (e − 1) η t−1 k∈‚1,pƒ  ηT   f∈F p t=1  ηT t=1 t=1 f∈F p K (f)= k

17

  

≤ inf

(

inf

k∈‚1,pƒ   f∈F p K (f)= k

 )  p ∆(f, x t ) + c 0 (e − 1)T(c 0 k + c 2 L + c 3 )   t=1 p + 2c 0 (e − 1)T(c 0 p + c 2 L + c 3 ), T X

which concludes the proof of Theorem 2. (1−β) c 0

Lemma 4. Under the Algorithm 3, if 1 ≥ ² > 0, 1 > β > 0, α ≥ β ¯ ¡ ¢ ¢¯ ¡ t ≥ 2, where ¯U fˆ t−1 ¯ is the cardinality of U fˆ t−1 , then we have

¯ ¡ ¢¯ and ¯U fˆ t−1 ¯ ≥ 2 for all

h i X T T T ¯ ¡ X X £ ¤ ¢¯ E r fˆ t ,t ≥ E rˆ σˆ t (At ),t − 2(1 − ²)αβ ¯U fˆ t−1 ¯ .

t=1

t=1

t=1

¢ ¡ Proof. First notice that A t = U fˆ t−1 if I t = 0, and that for t ≥ 2 ¯ ¯ h i h i ¯ ¯ E r fˆ t ,t ¯H t , I t = 0 =E r σˆ t (At ),t ¯H t , I t = 0 ¯ ´ ³ X ¯ = r f,t P σˆ t (A t ) = f ¯H t + f∈A t ∩ cond ( t)

X



f∈A t ∩ cond ( t) c

f∈A t ∩ cond ( t)

− (1 − β)

X

¯ ´ ¯ αP σˆ t (A t ) = f ¯H t X

r f,t −

f∈A t ∩ cond ( t)

f∈A t ∩ cond ( t) c

¯ ´ ³ ¯ r f,t P σˆ t (A t ) = f ¯H t

³

X

r f,t +

X

f∈A t ∩ cond ( t) c

¯ i ¯ =E rˆ σˆ t (At ),t ¯H t , I t = 0 − (1 − β) h

¯ ´ ¡ ¢ ³ ¯ α − r f,t P σˆ t (A t ) = f ¯H t

X

r f,t −

f∈A t ∩ cond ( t)

X f∈A t ∩ cond ( t) c

¯ ´ ¡ ¢ ³ ¯ α − r f,t P σˆ t (A t ) = f ¯H t

¯ i ¯ ≥E rˆ σˆ t (At ),t ¯H t , I t = 0 − (1 − β)c 0 |A t | − αβ |A t | ¯ h i ¯ ≥E rˆ σˆ t (At ),t ¯H t , I t = 0 − 2αβ |A t | , h

where cond(t) c denotes the complement of set cond(t); ³the first inequality above is due to ¯ ´ ¯ t the assumption that for all f ∈ A t ∩ cond(t), we have P σˆ (A t ) = f ¯H t ≥ β. For t = 1, the above inequality is trivial since rˆ σˆ 1 ¡U ¡fˆ0 ¢¢,1 ≡ 0 by its definition. Hence, for t ≥ 1, one has ¯ i ¯ ¯ h i h i h ¯ ¯ ¯ E r fˆ t ,t ¯H t = ²E r σˆ t (F p ),t ¯H t , I t = 1 + (1 − ²)E r σˆ t (At ),t ¯H t , I t = 0 ¯ i h ¯ ≥ E rˆ fˆ t ,t ¯H t − 2αβ |A t | .

(9)

Summing on both side of inequality (9) over t terminates the proof of Lemma 4.

Lemma 5. Let cˆ0 = "

(

T X

c0 β

+ α. If 0 < η 1 = η 2 = · · · = η T = η
0,

à P

T X

Var(Yt | H t ) ≤ a2 ,

Yt − E [Yt | H t ] ≤ b.

!

µ Yt > Y0 + λ ≤ exp −

t=1

λ2

2T(a2 + b2 )



.

The proof can be found in Chung and Lu (2006, Theorem 7.3). Lemma 6. Assume that 0 < β < |F1 | , α ≥ p "

(

T X

"

and η > 0, then we have

)# 1 − E max rˆ σˆ (At ),t − h (σˆ (A t )) σˆ η t=1 v " # µ ¶ u ¯ ¯ ¢u ¯ ¯ c20 ¡ 1 2 t ≤ 1 − ¯F p ¯ β + α2 (1 − β) + (c 0 + 2α) ln + ¯F p ¯ β c 0 T. 2T β β

1 E max r σ(At ),t − h (σ (A t )) σ η t=1

)#

c0 β

(

T X

Proof. First, we have almost surely that ( ) ( ) T T T ¡ X X X ¢ 1 1 max r σ(At ),t − h (σ (A t )) − max rˆ σˆ (At ),t − h (σˆ (A t )) ≤ max r f,t − rˆ f,t . σ ˆ σ f∈F p t=1 η η t=1 t=1 Denote by Yf,t = r f,t − rˆ f,t . Since ( ¡ ¡ ¢¢ ¯ i r f,t + (1 − ²)α 1 − P fˆ t = f|H t ¯ E rˆ f,t ¯H t = ² r f,t + (1 − ²)α h

if f ∈ U (fˆ t−1 ) ∩ cond(t), otherwise,

and α > c 0 ≥ r f,t uniformly for any f and t, then we have uniformly that E [Yt |H t ] ≤ 0, hence satisfying the first condition. ¡ ¢ For the second condition, if f ∈ U fˆ t−1 ∩ cond(t), then h i ¡ £ ¤¢2 Var(Yt |H t ) =E rˆ 2f,t |H t − E rˆ f,t |H t

19

r 2f,t

"

≤² r 2f,t + (1 − ²)

# ¡ ¡ ¢¢ ¢ + α 1 − P fˆ t = f|H t

¡ P fˆ t = f|H t £ ¡ ¡ ¢¢¤2 − r f,t + (1 − ²)α 1 − P fˆ t = f|H t



r 2f,t β

2

+ α (1 − β) ≤

c20 β

+ α2 (1 − β).

¢ ¡ Similarly, for f 6∈ U fˆ t−1 ∩ cond(t), one can have Var(Yt |H t ) ≤ α2 .

Moreover, for the third condition, since £ ¤ E Yf,t |H t ≥ −2α,

then £ ¤ Yf,t − E Yf,t |H t ≤ r f,t + 2α ≤ c 0 + 2α. r

Setting λ =

2T

h

c20 β

i ³ ´ + α2 (1 − β) + (c 0 + 2α)2 ln β1 leads to à P

T X

!

Yf,t ≥ λ ≤ β.

t=1

¯ ¯ ¯ ¯ Hence the following inequality holds with probability 1 − ¯F p ¯β

max

T ¡ X

f∈F p t=1

r f,t − rˆ f,t

Finally, noticing that maxf∈F p Lemma 6.

¢

v " # µ ¶ u u c20 1 2 t + α2 (1 − β) + (c 0 + 2α) ln . ≤ 2T β β

PT ¡ t=1

¢ r f,t − rˆ f,t ≤ c 0 T almost surely, we terminate the proof of

References J.-Y. Audibert. Fast learning rates in statistical inference through aggregation. The Annals of Statistics, 37(4):1591–1646, 2009. 5 P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. The nonstochastic multiarmed bandit problem. SIAM Journal of Computing, 32(1):48–77, 2003. 8 J. D. Banfield and A. E. Raftery. Ice floe identification in satellite images using mathematical morphology and clustering about principal curves. Journal of the American Statistical Association, 87(417):7–16, 1992. 1 A. Barron, L. Birgé, and P. Massart. Risk bounds for model selection via penalization. Probability Theory and Related Fields, 113:301–413, 1999. 3 G. Biau and A. Fischer. Parameter selection for principal curves. IEEE Transactions on Information Theory, 58(3):1924–1939, 2012. 3, 4, 8 L. Birgé and P. Massart. Minimal penalties for gaussian model selection. Probability Theory and Related Fields, 183:33–73, 2007. 3 20

C. Brunsdon. Path estimation from GPS tracks. In Proceedings of the 9th International Conference on GeoComputation, National Centre for Geocomputation, National University of Ireland, Maynooth, Eire, 2007. 1 N. Cesa-Bianchi and G. Lugosi. Prediction, Learning and Games. Cambridge University Press, New York, 2006. 4, 6 N. Cesa-Bianchi, G. Lugosi, and G. Stoltz. Minimizing regret with label-efficient prediction. IEEE Transactions on Information Theory, 51:2152–2162, 2005. 11 F. Chung and L. Lu. Concentration inequalities and martingale inequalities: A survey. Internet Mathematics, 3:79–127, 2006. 19 E. R. Engdahl and A. Villaseñor. 41 global seismicity: 1900–1999. International Geophysics, 81:665–690, 2002. 12, 13 H. Friedsam and W. A. Oren. The application of the principal curve analysis technique to smooth beamlines. In Proceedings of the 1st International Workshop on Accelerator Alignment, 1989. 1 T. Hastie and W. Stuetzle. Principal curves. Journal of the American Statistical Association, 84:502–516, 1989. 1 H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal of educational psychology, 24(6):417, 1933. 1 M. Hutter and J. Poland. Adaptive online prediction by following the perturbed leader. Journal of Machine Learning Research, 6:639–660, 2005. 6 V. Kanade, B. McMahan, and B. Bryan. Sleeping experts and bandits with stochastic action availability and adversarial rewards. AISTATS, 3:1137–1155, 2009. 8, 12 B. Kégl. Principal curves: learning, design, and applications. PhD thesis, Concordia University Montreal, Quebec, 1999. 2, 3, 16 B. Kégl and A. Krzyz˙ ak. Piecewise linear skeletonization using principal curves. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(1):59–74, 2002. 1, 3 B. Kégl, A. Krzyz˙ ak, T. Linder, and K. Zeger. Learning and design of principal curves. IEEE transactions on pattern analysis and machine intelligence, 22(3):281–297, 2000. 2, 3, 4 R. D. Kleinberg, A. Niculescu-Mizil, and Y. Sharma. Regret Bounds for Sleeping Experts and Bandits. In COLT. Springer, 2008. 8 Valero Laparra and Jesús Malo. Sequential principal curves analysis. arXiv preprint, 2016. URL https://arxiv.org/abs/1606.00856. 3 L. Li, B. Guedj, and S. Loustau. A Quasi-Bayesian Perspective to Online Clustering. arXiv preprint, 2016. URL https://arxiv.org/abs/1602.00522. 4, 5 D. A. McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999a. 4 D. A. McAllester. PAC-Bayesian model averaging. In Proceedings of the 12th annual conference on Computational Learning Theory, pages 164–170. ACM, 1999b. 4

21

G. Neu and G. Bartók. An efficient algorithm for learning with semi-bandit feedback. In Lecture Notes in Computer Science, volume 8139, pages 234–248. Springer, Berlin, Heidelberg, 2013. 11 K Pearson. On lines and planes of closest fit to systems of point in space. Philosophical Magazine, 2(11):559–572, 1901. 1 K. Reinhard and M. Niranjan. Parametric subspace modeling of speech transitions. Speech Communication, 27:19–42, 1999. 1 S. Sandilya and S. R. Kulkarni. Principal curves with bounded turn. IEEE Transactions on Information Theory, 48:2789–2793, 2002. 3, 4 J. Shawe-Taylor and R. C. Williamson. A PAC analysis of a Bayes estimator. In Proceedings of the 10th annual conference on Computational Learning Theory, pages 2–9. ACM, 1997. doi: 10.1145/267460.267466. 4 C. Spearman. "General Intelligence", Objectively Determined and Measured. The American Journal of Psychology, 15(2):201–292, 1904. 1 D. C. Stanford and A. E. Raftery. Finding curvilinear features in spatial point patterns: principal curve clustering with noise. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(6):601–609, 2000. 1

22

Suggest Documents