On the Error of Random Fourier Features

7 downloads 1564 Views 750KB Size Report
Jun 9, 2015 - uous positive-definite function k(x − y), its Fourier trans- ..... the kernel evaluations at test time kx are unapproximated, ..... ing in Python”.
On the Error of Random Fourier Features

arXiv:1506.02785v1 [cs.LG] 9 Jun 2015

Dougal J. Sutherland Carnegie Mellon University Pittsburgh, PA [email protected]

Jeff Schneider Carnegie Mellon University Pittsburgh, PA [email protected]

Abstract

Rahimi and Recht (2007) give two such embeddings, based on the Fourier transform P (ω) of the kernel k: one of the form   sin(ω1T x) T cos(ω1 x)  r   2  .. iid   z˜(x) := (1)   , ωi ∼ P (ω) .  D T  sin(ωD/2 x)  T cos(ωD/2 x)

Kernel methods give powerful, flexible, and theoretically grounded approaches to solving many problems in machine learning. The standard approach, however, requires pairwise evaluations of a kernel function, which can lead to scalability issues for very large datasets. Rahimi and Recht (2007) suggested a popular approach to handling this problem, known as random Fourier features. The quality of this approximation, however, is not well understood. We improve the uniform error bound of that paper, as well as giving novel understandings of the embedding’s variance, approximation error, and use in some machine learning methods. We also point out that surprisingly, of the two main variants of those features, the more widely used is strictly highervariance for the Gaussian kernel and has worse bounds.

1

and another of the form   r cos(ω1T x + b1 ) iid 2   ωi ∼ P (ω) .. , . (2) z˘(x) :=   . iid D b ∼ Unif T i [0,2π] cos(ωD x + bD ) Bochner’s theorem (1959) guarantees that for any continuous positive-definite function k(x − y), its Fourier transform will be a nonnegative measure; if k(0) = 1, it will be properly normalized. Letting s˜ be the reconstruction based on z˜ and s˘ that for z˘, we have that:

INTRODUCTION

Kernel methods provide an elegant, theoretically wellfounded, and powerful approach to solving many learning problems. Since traditional algorithms require the computation of a full N × N pairwise kernel matrix to solve learning problems on N input instances, however, scaling these methods to large-scale datasets containing more than thousands of data points has proved challenging. Rahimi and Recht (2007) spurred interest in one very attractive approach: approximating a continuous shift-invariant kernel k : X × X → R by

s˜(x, y) =

D/2 1 X cos(ωiT (x − y)) D/2 i=1

s˘(x, y) =

D 1 X cos(ωiT (x − y)) + cos(ωiT (x + y) + 2bi ). D i=1

Letting ∆ := x − y, we have: Z T √ E cos(ω T ∆) = < eω ∆ −1 dP (ω) = 0, let   1 1 + k(2x, 2y) − k(x, y)2 + 13 ε , αε := min 1, sup 2 x,y∈X 2   −d 2   6d+2 d d+2 d d+2 βd := + 2 2 d+2 . 2

Figure 2: The coefficient βd of Proposition 1 (blue, for z˜) and βd0 of Proposition 2 (orange, for z˘). Rahimi and Recht (2007) used a constant of 256 for z˜. For the Gaussian kernel, αε ≤ 12 + 13 ε and σp2 = d/σ 2 ; the Bernstein bound is tighter when ε < 32 . For z˘, since the embedding s˘ is not shift-invariant, we must instead place the ε-net on X 2 . The additional noise in s˘ also increases the expected Lipschitz constant and gives looser bounds on each term in the sum, though there are twice as many such terms. The corresponding bound is as follows:

Then, assuming only for the second statement that ε ≤ σp `,  2      Dε2 σp ` 1+ d2 ˜ exp − Pr kf k∞ ≥ ε ≤ βd ε 8(d + 2)αε  2   σp ` Dε2 ≤ 66 exp − . ε 8(d + 2)

Proposition 2. Let k, X , `, P (ω), and σp be as in Proposition 1. Define z˘ by (2), and f˘(x, y) := z˘(x)T z˘(y)−k(x, y). For any ε > 0, define   0 2 1 1 1 1 αε := min 1, sup 4 + 8 k(2x, 2y) − 4 k(x, y) + 6 ε ,

Thus, we can achieve an embedding with pointwise error no more than ε with probability at least 1 − δ as long as   8(d + 2)αε 2 σp ` βd D≥ log + log . ε2 ε δ 1 + d2

βd0

x,y∈X



:= d

−d d+1

 5d+1 d 1 + d d+1 2 d+1 3 d+1 .

Then, assuming only for the second statement that ε ≤ σp `,  2      Dε2 σp ` 1+ d1 exp − Pr kf˘k∞ ≥ ε ≤ βd0 ε 32(d + 1)αε0 2    Dε2 σp ` exp − . ≤ 98 ε 32(d + 1)

The proof strategy is very similar to that of Rahimi and Recht (2007): place an ε-net with radius r over X∆ := {x − y : x, y ∈ X }, bound the error f˜ by ε/2 at the centers of the net by Hoeffding’s inequality (1963), and bound the Lipschitz constant of f˜, which is at most that of s˜, by ε/(2r) with Markov’s inequality. The introduction of αε is by replacing Hoeffding’s inequality with that of Bernstein (1924) when it is tighter, using the variance from (5). The constant βd is obtained by exactly optimizing the value of r, rather than the algebraically simpler value originally used; β64 = 66 is its maximum, and limd→∞ βd = 64, though it is lower for small d, as shown in Figure 2. The additional hypothesis, that ∇2 k(0) exists, is equivalent to the existence of the first two moments of P (ω); a finite first moment is used in the proof, and of course without a finite second moment the bound is vacuous. The full proof is given in Appendix A.1.

Thus, we can achieve an embedding with pointwise error no more than ε with probability at least 1 − δ as long as   32(d + 1)αε0 σp ` βd0 2 D≥ log + log . ε2 ε δ 1 + d1 0 = 98, and limd→∞ βd0 = 96, also shown in Figure 2. β48 The full proof is given in Appendix A.2.

For the Gaussian kernel, αε0 ≤ 14 + 16 ε, so that the Berstein bound is essentially always superior. 2.2.2

Expected Max Error R∞ Noting that Ekf k∞ = 0 Pr (kf k∞ ≥ ε) dε, one could consider bounding Ekf k∞ via Propositions 1 and 2. Unfortunately, that integral diverges on (0, γ) for any γ > 0.

1 Note that our D is half of the D in Rahimi and Recht (2007), since we want to compare embeddings of the same dimension.

3

2.2.3

If we instead integrate the minimum of that bound and 1, the result depends on a solution to a transcendental equation, so analytical manipulation is difficult.

Bousquet’s inequality (2002) can be used to show exponential concentration of sup f about its mean.

We can, however, use a slight generalization of Dudley’s entropy integral (1967) to obtain the following bound:

We consider f˜ first. Let  2 cos(ω T ∆) − k(∆) , f˜ω (∆) := D PD/2 ˜ so f (∆) = i=1 fωi (∆). Define the “wimpy variance” of f˜/2 (which we use so that |f˜/2| ≤ 1) as

Proposition 3. Let k, X , `, and P (ω) be as in Proposition 1. Define z˜ by (1), and f˜(x, y) := z˜(x)T z˜(y)−k(x, y). Let X∆ := {x − y | x, y ∈ X }; suppose k is L-Lipschitz on X∆ . Let R := E maxi=1,..., D kωi k. Then 2

h

E kf˜k∞

i

D/2

√ 24γ d` ≤ √ (R + L) D

σf2˜/2 : = sup

X

∆∈X∆ i=1

Var

h

1 ˜ 2 fωi (∆)

i

  1 sup 1 + k(2∆) − 2k(∆)2 D ∆∈X∆ 1 2 , =: σw D 2 ≤ 2; for the Gaussian kernel, it using (7). Clearly 1 ≤ σw is 1. Proposition 5. Let k, X , and P (ω) be as in Proposition 1, and z˜ be defined by (1). Let f˜(∆) = z˜(x)T z˜(y) − k(∆) for 2 ∆ = x − y, and σw := sup∆∈X∆ 1 + k(2∆) − 2k(∆)2 . Then   Pr kf˜k∞ − Ekf˜k∞ ≥ ε ! Dε2 ≤ 2 exp − . 2 + Dε D Ekf˜k∞ + 12 σw 6 =

where γ ≈ 0.964. The proof is given in Appendix A.3. In order to apply the method of Dudley (1967), we must work around kωi k (which appears in the covariance of the error process) being potentially unbounded. To do so, we bound a process with truncated kωi k, and then relate that bound to f˜. √ For the Gaussian kernel, L = 1/(σ e) and2  √ Γ ((d + 1)/2) p R≤ 2 + 2 log (D/2) /σ Γ (d/2) √  p ≤ d + 2 log (D/2) /σ. 

Thus Ekf˜k∞ is less than √  p 24γ d `  −1/2 √ √ e + d + 2 log(D/2) . Dσ

Concentration About Mean

Proof. We use the Bernstein-style form of Theorem 12.5 ˜ of Boucheron et al. (2013)  on f (∆)/2 to obtain that Pr sup f˜ − E sup f˜ ≥ ε is at most

(8)

 2 ε . exp − E sup f˜ + 12 σf2˜/2 + 6ε 

We can also prove an analogous bound for the z˜ features: Proposition 4. Let k, X , `, and P (ω) be as in Proposition 1. Define z˘ by (2), and f˘(x, y) := z˘(x)T z˘(y) − k(x, y). Suppose k(∆) is L-Lipschitz. Let R := E maxi=1,...,D kωi k. Then, for X and D not extremely small, h i 48γ 0 `√d √X E kf˘k∞ ≤ (R + L) D

The same holds for −f˜, and E sup f˜ ≤ E supkf k∞ , E sup(−f˜) ≤ E supkf k∞ . The claim follows by a union bound. A bound on the lower tail, unfortunately, is not available in the same form. For f˘, note |f˘| ≤ 3, so we use f˘/3. Letting f˘ω := 1 1 T 2 2 D (cos(ω ∆) − k(∆)), we have σf˘/3 = 18D (σw + 1). Thus the same argument gives us: Proposition 6. Let k and X be as in Proposition 1, with P (ω) defined as there. Let z˘ be as in (2), f˜(x, y) = z˜(x)T z˜(y) − k(x, y), and define σw as above. Then   Pr kf˘k∞ − Ekf˘k∞ ≥ ε ! Dε2 ≤ 2 exp − 4 . D Ekf˘k∞ + 1 (σ 2 + 1) + 2 Dε

0 where 0.803 < γX < 1.542. See Appendix A.4 for details 0 on γX and the “not extremely small” assumption.

The proof is given in Appendix A.4. It is similar to that for Proposition 3, but the lack of shift invariance increases some constants and otherwise slightly complicates matters. Note also that the R of Proposition 4 is somewhat larger than that of Proposition 3. 2

By the Gaussian concentration inequality (Boucheron et al. 2013, Theorem 5.6), each kωk − Ekωk is sub-Gaussian with variance factor σ −2 ; the claim follows from their Section 2.5.

9

4

81

w

27

Note that Proposition 6 actually gives a somewhat tighter concentration than Proposition 5. This is most likely because, between the space of possible errors being larger and the higher base variance illustrated in Figure 1, the f˘ error function has more “opportunities” to achieve its maximal error. The experimental results (Figure 5) show that, at least in one case, kf˘k∞ does concentrate about its mean more tightly, but that mean is enough higher than that of kf˜k∞ that kf˘k∞ stochastically dominates kf˜k∞ . 2.3

The second version of the bound is simpler, but somewhat looser for D  1; asymptotically, the coefficient of the denominator becomes 128. Similarly, the variation of kf˘kµ is bounded by at most 2 32 D+1 D 2 µ(X ) (shown in Appendix B.2). Thus: Proposition 8. Let k, µ, k·kµ , and M be as in Proposition 7. Define z˘ as in (2) and let f˘(x, y) = z˘(x)T z˘(y) − k(x, y). Then     −D3 ε2 ˘2 2 ˘ Pr kf kµ − Ekf kµ ≥ ε ≤ 2 exp 512(D + 1)2 M2   −Dε2 ≤ 2 exp . 2048 M2

L2 ERROR BOUND

L∞ bounds provide useful guarantees, but are very strict. It can also be useful to consider a less stringent error measure. Let µ be a σ-finite measure on X × X ; define Z 2 kf kµ := f (x, y)2 dµ(x, y). (9)

The cost of a simpler dependence on D is higher here; the asymptotic coefficient of the denominator is 512.

X2

First, we have that Z Ekf˜k2µ = E f˜(x, y)2 dµ(x, y) 2 Z X = E f˜(x, y)2 dµ(x, y) (10) X2 Z  1  1 + k(2x, 2y) − 2k(x, y)2 dµ(x, y) = D X2   Z 1 2 2 = µ(X ) + k(2x, 2y) dµ(x, y) − 2 kkkµ D X2   Z 1 1 2 2 2 ˘ Ekf kµ = µ(X ) + k(2x, 2y) dµ(x, y) − kkkµ D 2 X2

3

DOWNSTREAM ERROR

Rahimi and Recht (2008a; 2008b) give a bound on the L2 distance between any given function in the reproducing kernel Hilbert space (RKHS) induced by k and the closest function in the RKHS of s: results invaluable for the study of learning rates. In some situations, however, it is useful to consider not the learning-theoretic convergence of hypotheses to the assumed “true” function, but rather directly consider the difference in predictions due to using the z embedding instead of the exact kernel k. 3.1

where (10) is justified by Tonelli’s theorem. If µ = PX ×RPY is a joint distribution of independent variables, then X 2 k(2x, 2y) dµ(x, y) = MMK(P2X , P2Y ), where MMK is the mean map kernel (see Section 3.3). Likewise, kkk2µ = MMK(PX , PY ) using the kernel k 2 .3

KERNEL RIDGE REGRESSION

We first consider kernel ridge regression (KRR; Saunders et al. 1998). Suppose we are given n training pairs (xi , yi ) ∈ Rd × R as well as a regularization parameter λ = nλ0 > 0. We construct the training Gram matrix K by Kij = k(xi , xj ). KRR gives predictions h(x) = αT kx , where α = (K +λI)−1 y and kx is the vector with ith component k(xi , x).4 When using Fourier features, one would not use α, but instead a primal weight vector w; still, it will be useful for us to analyze the situation in the dual.

Viewing kf˜kµ as a function of ω1 , . . . , ωD/2 , changing ωi to a different ω ˆ i changes the value of kf˜kµ by at most 2 4 4D+1 µ(X ); this can be seen by simple algebra and is D2 shown in Appendix B.1. Thus McDiarmid (1989) gives us an exponential concentration bound:

Proposition 1 of Cortes et al. (2010) bounds the change in KRR predictions from approximating the kernel matrix K ˆ in terms of kK ˆ − Kk2 . They assume, however, that by K, the kernel evaluations at test time kx are unapproximated, which is certainly not the case when using Fourier features. We therefore extend their result to Proposition 9 before using it to analyze the performance of Fourier features.

Proposition 7. Let k be a continuous shift-invariant positive-definite function k(x, y) = k(∆) defined on X ⊆ Rd , with k(0) = 1. Let µ be a σ-finite measure on X 2 , and define k·k2µ as in (9). Define z˜ as in (1) and let f˜(x, y) = z˜(x)T z˜(y) − k(x, y). Let M := µ(X 2 ). Then     −D3 ε2 Pr kf˜k2µ − Ekf˜k2µ ≥ ε ≤ 2 exp 8(4D + 1)2 M2   −Dε2 ≤ 2 exp . 200 M2

If a bias term is desired, we can use k0 (x, x0 ) = k(x, x0 ) + 1 by appending a constant feature 1 to the embedding z. Because this change is accounted for exactly, it affects the error analysis here only in that we must use sup|k(x, y)| ≤ 2, in which case the first factor of (11) becomes (λ0 + 2)/λ20 . 4

3 2

k is also a PSD kernel, by the Schur product theorem.

5

n

Proposition 9. Given a training set {(xi , yi )}i=1 , with xi ∈ Rd and yi ∈ R, let h(x) denote the result of kernel ridge regression using the PSD training kernel matrix ˆ K and test kernel values kx . Let h(x) be the same using a ˆ and test PSD approximation to the training kernel matrix K ˆ kernel values P kx . Further assume that the training labels Pn n are centered, i=1 yi = 0, and let σy2 := n1 i=1 yi2 . Also suppose kkx k∞ ≤ κ. Then: σy ˆ κσy ˆ |h0 (x) − h(x)| ≤ √ kK − Kk2 . kkx − kx k + nλ20 nλ0

3.2

Consider a Support Vector Machine (SVM) classifier with no offset, such that h(x) = wT Φ(x) for a kernel embedding Φ(x) : X → H and w is found by n C0 X 1 max (0, 1 − yi hw, Φ(xi )i) argmin kwk2 + n i=1 w∈H 2

where {(xi , yi )}ni=1 is our training set with yi ∈ {−1, 1}, and the decision function is h(x) = hw, Φ(x)i.5 For a given x, Cortes et al. (2010) consider an embedding in H = Rn+1 which is equivalent on the given set of points. ˆ ˆ − Kk2 in their They bound h(x) − h(x) in terms of kK Proposition 2, but again assume that the test-time kernel values kx are exact. We will again extend their result in Proposition 10:

ˆ + λI)−1 y. Thus, Proof. Let α = (K + λI)−1 y, α ˆ = (K −1 −1 −1 ˆ ˆ (M ˆ − M )M −1 , we have using M − M = −M ˆ + λI)−1 (K ˆ − K)(K + λI)−1 y α ˆ − α = −(K ˆ + λI)−1 k2 kK ˆ − Kk2 k(K + λI)−1 k2 kyk kˆ α − αk ≤ k(K ≤

SUPPORT VECTOR MACHINES

1 ˆ kK − Kk2 kyk λ2

Proposition 10. Given a training set {(xi , yi )}ni=1 , with xi ∈ Rd and yi ∈ {−1, 1}, let h(x) denote the decision function of an SVM classifier using the PSD training matrix ˆ K and test kernel values kx . Let h(x) be the same using ˆ and a PSD approximation to the training kernel matrix K ˆ test kernel values kx . Suppose sup k(x, x) ≤ κ. Then:

ˆ since the smallest eigenvalues √ of K + λI and K + λI are at least λ. Since kkx k ≤ nκ and kˆ αk ≤ kyk/λ: ˆ |h(x) − h(x)| = |ˆ αT kˆx − αT kx | = |ˆ αT (kˆx − kx ) + (α ˆ − α)T kx | ≤ kˆ αkkkˆx − kx k + kˆ α − αkkkx k √ nκkyk ˆ kyk ˆ kkx − kx k + kK − Kk2 . ≤ λ λ2 √ The claim follows from λ = nλ0 , kyk = nσy .

ˆ |h(x) − h(x)| 1/4 √ 3  ˆ − Kk2 + kkˆx − kx k + |fx | ≤ 2κ 4 C0 kK  1/2 √ ˆ − Kk2 + kkˆx − kx k + |fx | + κC0 kK ,

Suppose that, per the uniform error bounds of Section √ 2.2, sup |k(x, y) − s(x, y)| ≤ ε. Then kkˆx − kx k ≤ nε and ˆ − Kk2 ≤ kK ˆ − KkF ≤ nε, and Proposition 9 gives kK σy √ σy ˆ nε + nε h(x) − h(x) ≤ √ nλ20 nλ0 λ0 + 1 ≤ σy ε. (11) λ20

ˆ x) − k(x, x). where fx = k(x,

Proof. Use the setup of Section 2.2 √ of Cortes et al. (2010). In particular, we will use kwk ≤ κC0 and their (16-17):

Thus 

Pr (|h0 (x) − h(x)| ≥ ε) ≤ Pr kf k∞

λ20 ε ≥ (λ0 + 1)σy

Φ(xi ) = Kx1/2 ei √ 2 ˆ 1/2 − K 1/2 k, kw ˆ − wk ≤ 2C02 κkK x x

 .

which we can bound with Proposition 1 or 2. We can therefore guarantee |h(x) − h0 (x)| ≤ ε with probability at least δ if  2 (λ0 + 1)σy D=Ω d λ20 ε   λ20 ε − log σp ` . log δ + log (λ0 + 1)σy



K where Kx = T kx

 kx and ei the ith standard basis. k(x, x) 1/2

ˆx − Further, Lemma 1 of Cortes et al. (2010) says that kK 1/2 1/2 ˆ ˆ Kx k2 ≤ kKx − Kx k2 . Let fx := k(x, x) − k(x, x); Then, by Weyl’s inequality for singular values,

 ˆ −K

K

kˆT − k T x x

Note that this rate does not depend on n. If we want h0 (x)√ → h(x) at least as fast as h(x)’s convergence rate of O(1/ n) (Bousquet and Elisseeff 2001), ignoring the logarithmic terms, we thus need D to be linear in n, matching the conclusion of Rahimi and Recht (2008a).

5

 kˆx − kx ˆ ˆ

≤ kK−Kk 2 +kkx −kx k+|fx | .

fx 2

We again assume there is no bias term for simplicity; adding a constant feature again changes the analysis only in that it makes the κ of Proposition 10 2 instead of 1.

6

The distance kϕ(P ) − ϕ(Q)k is known as the maximum mean discrepancy (MMD), which can be estimated with:

Thus ˆ |h(x) − h(x)| ˆ ˆ ˆ − w)T Φ(x) + wT (Φ(x) − Φ(x)) = (w

kϕ(P ) − ϕ(Q)k2 = hϕ(P ), ϕ(P )i + hϕ(Q), ϕ(Q)i − 2 hϕ(P ), ϕ(Q)i .

ˆ ˆ ≤ kw ˆ − wkkΦ(x)k + kwkkΦ(x) − Φ(x)k √ 1 √ 1/2 1/2 1/2 ˆ x − Kx k ≤ 2κ 4 C0 kK κ 2 √ 1/2 1/2 ˆ + κC0 k(Kx − Kx )en+1 k √ 3 ˆ x − Kx k1/4 ≤ 2κ 4 C0 kK 2 √ 1/2 ˆ + κC0 kKx − Kx k 1/4 √ 3  ˆ − Kk2 + kkˆx − kx k + |fx | ≤ 2κ 4 C0 kK  1/2 √ ˆ − Kk2 + kkˆx − kx k + |fx | + κC0 kK

MMK (X, X) is a biased estimator, because of the k(Xi , Xi ) and k(Yi , Yi ) terms; removing them gives an unbiased estimator (Gretton et al. 2012). The MMK can be used in standard kernel methods to perform learning on probability distributions, such as when images are treated as sets of local patch descriptors (Muandet et al. 2012) or documents as sets of word descriptors (Yoshikawa et al. 2014). The MMD has strong applications to two-sample testing, where it serves as the statistic for testing the hypothesis that X and Y are sampled from the same distribution (Gretton et al. 2012); this has applications in, for example, comparing microarray data from different experimental situations or in matching attributes when merging databases.

as claimed.

Suppose that sup|k(x, y) − s(x, y)| ≤ ε. Then, as in the √ ˆ −Kk2 ≤ nε. Then, last section, kkˆx −kx k ≤ nε and kK letting γ be 0 for z˜ and 1 for z˘, Proposition 10 gives ˆ |h(x) − h(x)| ≤





1/4 1/4 n+γ ε 1/2 1/2 √ n+ n+γ ε .

2C0 n + + C0

The MMK estimate can clearly be approximated with an explicit embedding: if k(x, y) ≈ z(x)T z(y), n

MMK z (X, Y )

ˆ Then |h(x) − h(x)| ≥ u only if p 2C02 + 4C0 u + u2 − 2(C0 + u) C0 (C0 + 2u) √ . ε≤ C02 (n + n + γ)

= z¯(X)T z¯(Y ). 2

Thus the biased estimator of MMK(X, X) is just k¯ z (X)k ; the unbiased estimator is ! n n2 1 X 2 2 k¯ z (X)k − 2 kz(Xi )k n2 − n n i=1

This bound has the unfortunate property of requiring the approximation to be more accurate as the training set size increases, and thus can prove only a very loose upper bound on the number of features needed to achieve a given approximation accuracy, due to the looseness of Proposition 10. Analyses of generalization error in the induced RKHS , such as Rahimi and Recht (2008a); Yang et al. (2012), are more useful in this case. 3.3

When z(x)T z(x) = 1, as with z˜, this simplifies to 2 n 1 z (X)k − n−1 . When that is not necessarily true, n−1 k¯ as with z˘, that simplification holds only in expectation. This has been noticed a few times in the literature, e.g. by Li and Tsang (2011). Gretton et al. (2012) gives different linear-time test statistics based on subsampling the sum over pairs; this version avoids reducing the amount of data used in favor of approximating the kernel. Additionally, when using the MMK in a kernel method this approximation allows the use of linear solvers, whereas the other linear approximations must still perform some pairwise computation. Zhao and Meng (2014) compare the empirical performance of an approximation equivalent to z˘ against other linear-time approximations for two-sample testing. They find it is slower than the MMD-linear approximation but far more accurate, while being more accurate

MAXIMUM MEAN DISCREPANCY

Another area of application for random Fourier embeddings is to the mean embedding of distributions, which uses some kernel k to represent a probability distribution P in the RKHS induced by k as ϕ(P ) = Ex∼P [k(x, ·)]. For samples {Xi }ni=1 ∼ P and {Yj }m j=1 ∼ Q, we can estimate the inner product in the embedding space, the mean map kernel (MMK), by n

MMK (X, Y )

:=

m

1 XX z(Xi )T z(Yj ) nm i=1 j=1  !T  n m X 1X 1  = z(Xi ) z(Yj ) n i=1 m j=1 =

m

1 XX k(Xi , Yj ) ≈ hϕ(P ), ϕ(Q)i . nm i=1 j=1 7

and comparable in speed to a block-based B-test (Zaremba et al. 2013).

evenly spaced 1 000 points on [−5, 5] and approximated the kernel matrix using both embeddings at D ∈ {50, 100, 200, . . . , 900, 1 000, 2 000, . . . , 9 000, 10 000}, repeating each trial 1 000 times, estimating kf k∞ and kf kµ at those points. We do not consider d > 1 here, because obtaining a reliable estimate of sup|f | becomes very computationally expensive even for d = 2.

Zhao and Meng (2014) also state a simple uniform error bound on the quality of this approximation. Specifically, since we can write |MMKz (X, Y ) − MMK(X, Y )| as the mean of |f (Xi , Yj )|, uniform error bounds on f apply directly to MMKz , including to the unbiased version of MMKz (X, X). Moreover, since MMD2 (X, Y ) = MMK (X, X) + MMK (Y, Y ) − 2 MMK (X, Y ), its error is at most 4 times kf k∞ . The advantage of this bound is that it applies uniformly to all sample sets on the input space X , which is useful when we use MMK for a kernel method.

Figure 3 shows the behavior of Ekf k∞ as b increases for various values of D. As expected, the z˜ embeddings have almost no error near 0. The error increases out to one or two bandwidths, after which the curve appears approximately linear in `/σ, as predicted by Proposition 3.

For a single two-sample test, however, we can get a tighter bound. Consider X and Y fixed for now. Note that EMMKz (X, Y ) = MMK(X, Y ), by linearity of expectation. The variance of MMKz (X, Y ) is exactly 1 XX Cov (s(Xi , Yj ), s(Xi0 , Yj 0 )) , (12) 2 n m2 i,j 0 0 i ,j

which can be evaluated using the formulas of Section 2.1 and so, viewed only as a function of D, is O(1/D). Alternatively, we can use a bounded difference approach: viewing MMKz˜(X, Y ) as a function of the ωi s, changing ωi to ω ˆ i changes the MMK estimate by n X m 1 X  2 T T , cos(ˆ ω (X − Y )) − cos(ω (X − Y )) i j i j i i nm D Figure 3: The maximum error within a given radius in R, i=1 j=1 averaged over 1 000 evaluations. Solid lines represent z˜ which is at most 4/D. The bound for z˘ is in fact the same and dashed lines z˘; black is D = 50, blue is D = 100, red here. Thus McDiarmid’s inequality tells us that for fixed D = 500, and cyan D = 1 000. sets X and Y and either z,  Pr (|MMKz (X, Y ) − MMK(X, Y )|) ≤ 2 exp − 18 Dε2 . Figure 4 fixes b = 3 and shows the expected maximal error p as a function of D. It also plots the expected error obThus E |MMKz (X, Y ) − MMK(X, Y )| ≤ 2 2π/D. Simitained by numerically integrating the bounds of Proposilarly, MMDz can be changed by at most 16/D, giving tions 1 and 2 (using the minimum of 1 and the bound). We  1 Pr (|MMDz (X, Y ) − MMD(X, Y )|) ≤ 2 exp − 128 Dε2 can see that all of the bounds are fairly loose, but that the p first version of the bound in the propositions (with βd , the and expected absolute error of at most 8 2π/D. exponent depending on d, and αε ) is substantially tighter Now, if we consider the distributions P and Q to be fixed than the second version when d = 1. but the sample sets random, Theorems 7 and 10 of Gretton The bounds on Ekf k∞ of Propositions 3 and 4 are unet al. (2012) give exponential convergence bounds for the fortunately too loose to show on the same plot. However, biased and unbiased population estimators of MMD, which one important property does√hold. For a fixed X , (8) precan easily be combined with the above bounds. Note that dicts that Ekf k∞ = O(1/ D). This holds empirically: this approach allows the domain X to be unbounded, unperforming linear regression of log Ekf˜k∞ against log D like the other bound. One could extend this to a bound yields a model of Ekf˜k∞ = ec Dm , with a 95% conuniform over some smoothness class of distributions using fidence interval for m of [−0.502, −0.496]; kf˘k∞ gives the techniques of Section 2.2, though we do not do so here. [−0.503, −0.497]. The integrated bounds of Propositions 1 and 2 do not fit the scaling as a function of D nearly as well.

4

4.1

NUMERICAL EVALUATION

Figure 5 shows the empirical survival function of the max error for D = 500, along with the bounds of Propositions 1 and 2 and those of Propositions 5 and 6 using the empirical mean. The latter bounds are tighter than the former for low ε, especially for low D, but have a lower slope.

APPROXIMATION ON AN INTERVAL

We first conduct a detailed study of the approximations on the interval X = [−b, b]. Specifically, we 8

Figure 5: Pr (Ekf k∞ > ε) for the Gaussian kernel on [−3, 3] with σ = 1 and D = 500, based on 1 000 evaluations (black), numerical integration of the bounds from Propositions 1 and 2 (same colors as Figure 4), and the bounds of Propositions 5 and 6 using the empirical mean (yellow).

Figure 4: Ekf k∞ for the Gaussian kernel on [−3, 3] with σ = 1, based on the mean of 1 000 evaluations and on numerical integration of the bounds from Propositions 1 and 2. (“Tight” refers to the bound with constants depending on d, and “loose” the second version; “old” is the version from Rahimi and Recht (2007).)

The mean of the mean squared error, on the other hand, exactly follows the expectation of Section 2.3 using µ as the uniform distribution on X 2 : in this case, Ekf˜kµ ≈ 0.66/D, Ekf˘kµ ≈ 0.83/D. (This is natural, as the expectation is exact.) Convergence to that mean, however, is substantially faster than guaranteed by the McDiarmid bound of Propositions 7 and 8. We omit the plot due to space constraints. 4.2

MAXIMUM MEAN DISCREPANCY

We now turn to the problem of computing the MMD with a Fourier embedding. Specifically, we consider the problem of distinguishing the standard normal distribution N (0, Ip ) from the two-dimensional mixture 0.95N (0, I2 ) + 0.05N (0, 14 I2 ). We take fixed sample sets X and Y each of size 1 000 and compute the biased MMD estimate with varying D for both z˜ and z˘, we used a Gaussian kernel of bandwidth 1. The mean absolute errors of the resulting estimates are shown in Figure 6. z˜ performs mildly better than z˘.

Figure 6: Mean absolute error of the biased estimator for MMD (X, Y ), based on 100 evaluations. isting bounds and showing new ones, including an analytic bound on Ekf k∞ and exponential concentration about its mean, as well as an exact form for Ekf kµ and exponential concentration in that case as well. We also extend previous results on the change in learned models due to kernel approximation. We verify some aspects of these bounds empirically for the Gaussian kernel. We also point out that, of the two embeddings provided by Rahimi and Recht (2007), the z˜ embedding (with half as many sampled frequencies, but no additional noise due to phase shifts) is superior in the most common case of the Gaussian kernel.

Again, the McDiarmid bound of Section √ 3.3 predicts that the mean absolute error decays as O(1/ D), but with too high a multiplicative constant; the 95% confidence interval for the exponent of D is [−0.515, −0.468] for z˜ and [−0.520, −0.486] for z˘. We also know that √ the expected root mean squared error decays like O(1/ D) via (12).

5

Acknowledgments

DISCUSSION

This work was funded in part by DARPA grant FA87501220324. DJS is also supported by a Sandia Campus Executive Program fellowship.

We provide a novel investigation of the approximation error of the popular random Fourier features, tightening ex9

References

tions via support measure machines”. In: Advances in Neural Information Processing Systems. Pedregosa, F. et al. (2011). “Scikit-learn: Machine Learning in Python”. In: Journal of Machine Learning Research 12, pp. 2825–2830. Raff, Edward (2011-15). JSAT: Java Statistical Analysis Tool. https : / / code . google . com / p / java statistical-analysis-tool/. Rahimi, Ali and Benjamin Recht (2007). “Random Features for Large-Scale Kernel Machines”. In: Advances in Neural Information Processing Systems. MIT Press. — (2008a). “Weighted sums of random kitchen sinks: Replacing minimization with randomization in learning”. In: Advances in Neural Information Processing Systems. MIT Press, pp. 1313–1320. — (2008b). “Uniform approximation of functions with random bases”. In: 46th Annual Allerton Conference on Communication, Control, and Computing, pp. 555–561. Saunders, C., A. Gammerman, and V. Vovk (1998). “Ridge Regression Learning Algorithm in Dual Variables”. In: Proceedings of the 15th International Conference on Machine Learning, pp. 515–521. Sonnenburg, S¨oren et al. (2010). “The SHOGUN Machine Learning Toolbox”. In: Journal of Machine Learning Research 11, pp. 1799–1802. Yang, Tianbao, Yu-Feng Li, Mehrdad Mahdavi, Rong Jin, and Zhi-Hua Zhou (2012). “Nystr¨om Method vs Random Fourier Features: A Theoretical and Empirical Comparison”. In: Advances in Neural Information Processing Systems. MIT Press. Yoshikawa, Yuya, Tomoharu Iwata, and Hiroshi Sawada (2014). “Latent Support Measure Machines for Bag-ofWords Data Classification”. In: Advances in Neural Information Processing Systems, pp. 1961–1969. Zaremba, Wojciech, Arthur Gretton, and Matthew Blaschko (2013). “B-tests: Low Variance Kernel TwoSample Tests”. In: Advances in Neural Information Processing Systems. Zhao, Ji and Deyu Meng (May 2014). “FastMMD: Ensemble of Circular Discrepancy for Efficient Two-Sample Test”. In: arXiv:1405.2664.

Bernstein, Sergei (1924). “On a modification of Chebyshevs inequality and of the error formula of Laplace”. Russian. In: Ann. Sci. Inst. Savantes Ukraine, Sect. Math. 1, pp. 38–49. Bochner, Salomon (1959). Lectures on Fourier integrals. Princeton University Press. Boucheron, St´ephane, G´abor Lugosi, and Pascal Massart (2013). Concentration Inequalities: A Nonasymptotic Theory of Independence. Oxford, UK: Oxford University Press. Bousquet, Olivier (2002). “A Bennett concentration inequality and its application to suprema of empirical processes”. In: Comptes Rendus Mathematique 334, pp. 495– 500. Bousquet, Olivier and Andr´e Elisseeff (2001). “Algorithmic Stability and Generalization Performance”. In: Advances in Neural Information Processing Systems, pp. 196– 202. Cheng, Steve (2013). Differentiation under the integral sign. Version 16. URL: http://planetmath.org/ differentiationundertheintegralsign. Cortes, Corinna, M Mohri, and A Talwalkar (2010). “On the impact of kernel approximation on learning accuracy”. In: International Conference on Artificial Intelligence and Statistics, pp. 113–120. Cucker, Felipe and Steve Smale (2001). “On the mathematical foundations of learning”. In: Bulletin of the American Mathematical Society 39.1, pp. 1–49. Dai, Bo et al. (2014). “Scalable Kernel Methods via Doubly Stochastic Gradients”. In: Advances in Neural Information Processing Systems, pp. 3041–3049. Dudley, Richard M (1967). “The sizes of compact subsets of Hilbert space and continuity of Gaussian processes”. In: Journal of Functional Analysis 1.3, pp. 290–330. Gretton, Arthur, Karsten M Borgwardt, Malte J Rasch, Bernhard Sch¨olkopf, and Alex J Smola (2012). “A Kernel Two-Sample Test”. In: The Journal of Machine Learning Research 13. Hoeffding, Wassily (1963). “Probability inequalities for sums of bounded random variables”. In: Journal of the American Statistical Association 58.301, pp. 13–30. Joachims, Thorsten (2006). “Training linear SVMs in linear time”. In: ACM SIGKDD international conference on Knowledge Discovery and Data mining. Li, Shukai and Ivor W Tsang (2011). “Learning to Locate Relative Outliers”. In: Asian Conference on Machine Learning. Vol. 20. JMLR: Workshop and Conference Proceedings, pp. 47–62. McDiarmid, Colin (1989). “On the method of bounded differences”. In: Surveys in combinatorics 141.1, pp. 148– 188. Muandet, Krikamol, Kenji Fukumizu, Francesco Dinuzzo, and Bernhard Sch¨olkopf (2012). “Learning from distribu-

10

A

PROOFS FOR UNIFORM ERROR BOUND (SECTION 2.2)

A.1

PROOF OF PROPOSITION 1

The proof strategy closely follows that of Rahimi and Recht (2007); we fill in some (important) details, tightening some parts of the proof as we go. Let X∆ = {x − y | x, y ∈ X }. It’s compact, with diameter at most 2`, so we can find an ε-net covering X∆ with at most T = (4`/r)d balls of radius r (Cucker and Smale 2001, Proposition 5). Let {∆i }Ti=1 denote their centers, and Lf˜ be the Lipschitz constant of f˜. If |f˜(∆i )| < ε/2 for all i and Lf˜ < ε/(2r), then |f˜(∆)| < ε for all ∆ ∈ M∆ .  Let z˜i (x) := sin(ωiT x) A.1.1

T cos(ωiT x) , so that z˜(x)T z˜(y) =

D/2 1 X z˜i (x)T z˜i (y). D/2 i=1

Regularity Condition

We will first need to establish that E∇˜ s(∆) = ∇E˜ s(∆) = ∇k(∆). This can be proved via the following form of the Leibniz rule, quoted verbatim from Cheng (2013): Theorem (Cheng 2013, Theorem 2). Let X be an open subset of R, and Ω be a measure space. Suppose f : X × Ω → R satisfies the following conditions: 1. f (x, ω) is a Lebesgue-integrable function of ω for each x ∈ X. 2. For almost all ω ∈ Ω, the derivative

∂f (x,ω) ∂x

exists for all x ∈ X. (x,ω) 3. There is an integrable function Θ : Ω → R such that ∂f ∂x ≤ Θ(ω) for all x ∈ X. Then for all x ∈ X, d dx

Z

Z f (x, ω) dω =





∂ f (x, ω) dω. ∂x

i i Define the function g˜x,y (t, ω) : R × Ω → R by g˜x,y (t, ω) = s˜ω (x + tei , y), where ei is the ith standard basis vector, and i ω is the tuple of all the ωi used in z˜. g˜x,y (t, ·) is Lebesgue integrable in ω, since Z i g˜x,y (t, ω) dω = E˜ s(x + tei , y) = k(x + tei , y) < ∞.

For any ω ∈ Ω,

∂ i ˜x,y (t, ω) ∂t g

exists, and satisfies: D/2 ∂ i 2 X ∂ ∂ T T T T Eω gx,y (t, ω) = Eω sin(ωj y) sin(ωj x + tωji ) + cos(ωj y) cos(ωj x + tωji ) ∂t ∂t ∂t D j=1 D/2 2 X T T T T = Eω ωji sin(ωj y) cos(ωj x + tωji ) − ωji cos(ωj y) sin(ωj x + tωji ) D j=1   D/2 X 2 ωji sin(ωjT y) cos(ωjT x + tωji ) + ωji cos(ωjT y) sin(ωjT x + tωji )  ≤ Eω  D j=1   D/2 X 2 ≤ Eω  2 |ωji | D j=1 ≤ 2Eω |ω| ,

which is finite since the first moment of ω is assumed to exist. ∂ ∂ Thus we have ∂x E˜ s(x, y) = E ∂x s˜(x, y). The same holds for y by symmetry. Combining the results for each component, i i we get as desired that E∇∆ s(x, y) = ∇∆ Es(x, y).

11

A.1.2

Lipschitz Constant





Since f˜ is differentiable, Lf˜ = ∇f˜(∆∗ ) , where ∆∗ = argmax∆∈M∆ ∇f˜(∆) . Via Jensen’s inequality, E k∇˜ s(∆)k ≥ kE∇˜ s(∆)k. Now, letting ∆∗ = x∗ − y ∗ : h i 2 E[L2f˜] = E k∇˜ s(∆∗ ) − ∇k(∆∗ )k # " h i   ∗ 2 ∗ ∗ ∗ 2 s(∆ )k − 2 k∇k(∆ )k E k∇˜ s(∆ )k + k∇k(∆ )k = E∆∗ E k∇˜ "

h



s(∆ )k ≤ E∆∗ E k∇˜

2

i

# ∗

2



2

− 2 k∇k(∆ )k + k∇k(∆ )k

h i h i 2 2 = E k∇˜ s(∆∗ )k − E∆∗ k∇k(∆∗ )k 2

≤ E k∇˜ s(∆∗ )k

2 = E ∇˜ z (x∗ )T z˜(y ∗ )

2

D/2 X

1 ∗ T ∗

= E ∇ z˜i (x ) z˜i (y )

D/2 i=1

2 = E ∇˜ zi (x∗ )T z˜i (y ∗ )

2 = E ∇ cos(ω T ∆∗ )

2

= E − sin(ω T ∆∗ ) ω h i 2 = E sin2 (ω T ∆∗ ) kωk h i 2 ≤ E kωk = σp2 .

(13)

We can thus use Markov’s inequality:   2   ε 2  ε 2r Pr Lf˜ ≥ = Pr L2f˜ ≥ ≤ σp2 . 2r 2r ε A.1.3

Anchor Points

For any fixed ∆ = x − y, f˜(∆) is a mean of D/2 terms with expectation k(x, y) bounded by ±1. Applying Hoeffding’s inequality and a union bound: !  !   T D ε 2   [ 2 Dε2 2 2 1 1 Pr |f˜(∆i )| ≥ 2 ε ≤ T Pr |f˜(∆)| ≥ 2 ε ≤ 2T exp − = 2T exp − . (1 − (−1))2 16 i=1 Since we know the variance of each term from (5), we could alternatively use Bernstein’s inequality: ! ! D ε2   Dε2 ˜ 1 2 4  . T Pr f (∆) > 2 ε ≤ 2T exp − = 2T exp − 2 Var[cos(ω T ∆)] + 23 ε 16 Var[cos(ω T ∆)] + 13 ε This is a better bound when Var[cos(ω T ∆)]+ 13 ε < 1; for the RBF kernel, this is true whenever ε < 23 , and the improvement is bigger when k(∆) is large or ε is small.  To unify the two, let αε := min 1, max∆∈M∆ 12 + 12 k(2∆) − k(∆2 ) + 13 ε . Then !   Dε2 1 ˜ Pr |f (∆i )| ≥ 2 ε ≤ 2T exp − . 16αε i=1 T [

12

A.1.4

Optimizing Over r

Combining these two bounds, we have a bound in terms of r:   ˜ Pr sup f (∆) ≤ ε ≥ 1 − κ1 r−d − κ2 r2 , ∆∈M∆



 Dε2 letting κ1 = 2(4`)d exp − 16α , κ2 = 4σp2 ε−2 . ε 2

d

If we choose r = (κ1 /κ2 )1/(d+2) , as did Rahimi and Recht (2007), the bound again becomes 1 − 2κ1d+2 κ2d+2 . But we 1   d+2 1 could instead maximize the bound by choosing r such that dκ1 r−d−1 − 2κ2 r = 0, i.e. r = dκ . Then the bound 2κ2  2  −d d 2   d d+2 + d2 d+2 κ1d+2 κ2d+2 : becomes 1 − 2  Pr

  sup f˜(∆) > ε ≤

∆∈M∆

 =  =

d 2

d 2

d 2

−d  d+2

−d  d+2

−d  d+2

2    d+2  d Dε2 d + 2(4`) exp − 4σp2 ε−2 d+2 16αε 2d     d+2   2 2+4d+2d σp ` Dε2 + d2 d+2 2 d+2 exp − ε 8(d + 2)αε 2  1+2/d      2 6d+2 σp ` Dε2 + d2 d+2 2 d+2 exp − . ε 8(d + 2)αε

d 2

2  d+2

(14) (15)

For ε ≤ σp `, we can loosen the exponent on the middle term to 2, though in low dimensions we have a somewhat sharper bound. We no longer need the ` > 1 assumption of the original proof. To prove the final statement of Proposition 1, simply set (14) to be at most δ and solve for D. A.2

PROOF OF PROPOSITION 2

We will follow the proof strategy of Proposition 1 as closely as possible. Our approximation is now s˘(x, y) = z˘(x)T z˘(y), and the error is f˘(x, y) = s˘(x, y)−k(y, x). Note that s˘ and f˘ are not shiftinvariant: for example, with D = 1, s˘(x, y) = cos(ω T ∆)+cos(ω T (x+y)+2b) but s˘(∆, 0) = cos(ω T ∆)+cos(ω T ∆+2b).   √ x ∈ X 2 denote the argument to these functions. X 2 is a compact set in R2d with diameter 2`, so we can cover Let q = y √ 2d it with an ε-net using at most T = 2 2`/r balls of radius r. Let {qi }Ti=1 denote their centers, and Lf be the Lipschitz 2d constant of f : R → R. A.2.1

Regularity Condition

i To show E∇˘ s(q) = ∇E˘ s(q), we can define g˘x,y (t, ω) analogously to in Appendix A.1.1, where here ω contains all the ωi and bi variables used in z˘. We then have:   D D ∂˘ i X X g (t, ω) 1 1 x,y Eω −ωji cos(ωjT y + bj ) sin(ωjT x + tωji + bj ) ≤ Eω  |ωji | ≤ Eω |ω| , = Eω ∂t D j=1 D j=1

which we have assumed to be finite. A.2.2

Lipschitz Constant

The argument follows that of Appendix A.1.2 up to (13), using q ∗ in place of ∆∗ . Then: 2

E[L2f˘] ≤ E k∇˘ s(q ∗ )k

 2 = E ∇q 2 cos(ω T x + b) cos(ω T y + b) 13

h  2  2 i = E ∇x 2 cos(ω T x + b) cos(ω T y + b) + ∇y 2 cos(ω T x + b) cos(ω T y + b) h

2

2 i = E −2 sin(ω T x∗ + b) cos(ω T y ∗ + b) ω + −2 cos(ω T x∗ + b) sin(ω T y ∗ + b) ω h i  2 = E 4 sin2 (ω T x∗ + b) cos2 (ω T y ∗ + b) + cos2 (ω T x∗ + b) sin2 (ω T y ∗ + b) kωk h  i  2 = Eω Eb 2 − cos(2ω T (x∗ − y ∗ )) − cos(2ω T (x∗ + y ∗ ) + 4b) kωk h i  2 = Eω 2 − cos(2ω T (x∗ − y ∗ )) kωk 2

≤ 3E kωk = 3σp2 . Following through with Markov’s inequality:   Pr Lf˘ ≥ ε/(2r) ≤ 3σp2 (2r/ε)2 = 12(σp r/ε)2 . A.2.3

Anchor Points

For any fixed x, y, s˘ takes a mean of D terms with expectation k(x, y) bounded by ±2. Using Hoeffding’s inequality: ! 2 !   T   [ 2D 2ε Dε2 1 1 ˘ ˘ Pr |f (qi )| ≥ 2 ε ≤ T Pr |f (q)| ≥ 2 ε ≤ 2T exp − = 2T exp − . (2 − (−2))2 32 i=1 Since the variance of each term is given by (6), we can instead use Bernstein’s inequality: ! 2    D ε4 Dε2 ˘ 1  = 2T exp − T Pr f (∆) > 2 ε ≤ 2T exp − 4 + 8 Var[cos(ω T ∆)] + 2 Var[cos(ω T ∆)] + 21 + 43 ε

 16 3 ε

.

Thus Bernstein’s gives us a tighter bound if 4 + 8 Var[cos(ω T ∆)] +

16 ε < 32 3

To unify the bounds, define αε0 = min 1, max∆

4 2 Var[cos(ω T ∆)] + ε < 7. 3

i.e.

 Var[cos(ω T ∆)] + 16 ε ; then !   T [ Dε2 1 ˘ Pr |f (qi )| ≥ 2 ε ≤ 2T exp − . 32αε0 i=1

A.2.4

1 8

+

1 4

Optimizing Over r

Our bound is now of the form Pr

! ˘ sup f (q) ≤ ε ≥ 1 − κ1 r−2d − κ2 r2 ,

q∈M2

  √ 2d Dε2 with κ1 = 2 2 2` exp − 32α and κ2 = 12σp2 ε−2 . 0 ε

This is maximized by r when 2dκ1 r−2d−1 − 2κ2 r = 0, i.e. r =  −d  1 d 1 yields 1 − d d+1 + d d+1 κ1d+1 κ2d+1 , and thus:



dκ1 κ2

1  2d+2

. Substituting that value of r into the bound

! 1   d+1    √ 2d  −d 2  d 1 Dε ˘ 2 2 2` exp − Pr sup f (q) > ε ≤ d d+1 + d d+1 12σp2 ε−2 d+1 0 32αε q∈M2 2d      −d  1+2d+d+2d d 1 σp ` d+1 Dε2 d+1 d+1 d+1 d+1 = d +d 2 3 exp − ε 32(d + 1)αε0 14

(16)



= d

−d d+1

+d

1 d+1



2

5d+1 d+1

3

d d+1



σp ` ε

2  1+1/d

Dε2 exp − 32(d + 1)αε0 

 .

As before, when ε ≤ σp ` we can loosen the exponent on the middle term to 2; it is slightly worse than the corresponding exponent of (15) for small d. To prove the final statement of Proposition 2, set (16) to be at most δ and solve for D. A.3

PROOF OF PROPOSITION 3

Consider the z˜ features, and recall that we supposed k is L-Lipschitz over X∆ := {x − y | x, y ∈ X }. Our primary tool will be the following slight generalization of Dudley’s entropy integral, which is a special case of Lemma 13.1 of Boucheron et al. (2013). (The only difference from their Corollary 13.2 is that we maintain the variance factor v.) Theorem (Boucheron et al. 2013). Let T be a finite pseudometric space and let (Xt )t∈T be a collection of random variables such that for some constant v > 0, log Eeλ(Xt −Xt0 ) ≤

1 2 2 vλ d (t, t0 ) 2

for all t, t0 ∈ T and all λ > 0. Let δ = supt∈T d(t, t0 ). Then, for any t0 ∈ T ,   Z δ/2 p √ E sup Xt − Xt0 ≤ 12 v H(u, T ) du. t∈T

0

Note that, although stated for finite pseudometric spaces, the result is extensible to seperable pseudometric spaces (such as X∆ ) by standard arguments. Here H(δ, T ) = log N (δ, T ), where N is the δ-packing number, is known as the δ-entropy number. It is the case that the δ-packing number is at most the 2δ -covering number, which Proposition 5 of Cucker and Smale (2001) bounds. Thus, picking ∆0 = 0 gives δ = `, H(δ, X∆ ) ≤ d log (8`/δ), and Z `/2 p Z `/2 p √ H(u, X∆ ) du ≤ d log(8`/u) du = γ` d, 0

0

√ √ √ where γ := 4 π erfc(2 log 2) + log 2 ≈ 0.964. Now,

2 D

 cos(ωiT ∆) − k(∆) − cos(ωiT ∆0 ) + k(∆0 ) has mean zero, and absolute value bounded by 2   2 T T 0 0 T T 0 0 cos(ω ∆) − k(∆) − cos(ω ∆ ) + k(∆ ) i i D ≤ D cos(ωi ∆) − cos(ωi ∆ ) + |k(∆) − k(∆ )|  2 T ωi ∆ − ωiT ∆0 + L k∆ − ∆0 k ≤ D 2 ≤ (kωi k + L) k∆ − ∆0 k. D

(17)

Thus, via Hoeffding’s lemma (Boucheron et al. 2013, Lemma 2.2), each such term has log moment generating function at most D22 (kωi k + L)2 λ2 k∆ − ∆0 k2 . This is almost in the form required by Dudley’s entropy integral, except that ωi is a random variable. Thus, for any r > 0, define the random process g˜r which is distributed as f˜ except we require that kω1 k = r and kωi k ≤ r for all i > 1. Since log mgfs of independent variables are additive, we thus have   D/2 X 0 1 2 1  log Eeλ(˜gr (∆)−˜gr (∆ )) ≤ (kωi k + L)2  λ2 k∆ − ∆0 k2 ≤ (r + L)2 λ2 k∆ − ∆0 k2 . D D i=1 D 1 g˜r satisfies the conditions of the theorem with v = D (r + L)2 . Now, g˜r (0) = 0, so we have √   12γ d` E sup g˜r (∆) ≤ √ (r + L). D ∆∈X∆

15

But the distribution of f˜ conditioned on the event maxi kωi k = r is the same as the distribution of g˜r . Thus # " √ √ 12γ d` 12γ d` ˜ √ E sup f = Er [E[sup g˜r ]] ≤ Er (r + L) = √ (R + L) D D D/2

where R := E maxi=1 kωi k. ˜ The Since we ihave sup f˜ ≥ h same holds for Eisup(−hf ). E max(sup f˜, sup(−f˜)) ≤ E sup f˜ + sup(−f˜) . A.4

0, sup(−f˜)



0, the claim follows from

PROOF OF PROPOSITION 4

For the z˘ features, the error process again must be defined over X 2 due to the non-shift invariant noise. We still assume that k is L-Lipschitz over X∆ , however. √  Compared to the argument of Appendix A.3, we have H(u, X 2 ) ≤ 2d log 4 2`/u . Unlike X∆ , however, X 2 does not necessarily contain an obvious point q0 to minimize supq∈X 2 d(q, q0 ), nor an obvious minimal value. We rather consider √ the “radius” ρ := supx∈X d(x, x0 ), achieved by any convenient point x0 ; then supq∈X 2 d (q, (x0 , x0 )) = 2ρ. Note that 1 2 ` ≤ ρ ≤ `, where the lower bound is achieved by X a ball, and the upper bound by X a sphere. The integral in the bound is then Z 0

√ ρ/ 2

√ ρ/ 2

q √ 2d log(4 2`/u) 0 q  √ √ q √ ` 1 = 4 πd` erfc log 2 + log 4 2 + ρ d 52 log 2 + log ρ` 2 ρ  q   q √ √ ` √ ρ 1 5 ` = 4 π erfc log 2 + log 4 2 + log 2 + log ` d. 2 ρ ` 2 ρ

Z p 2 H(u, X ) ≤

(18)

0 Calling the term in parentheses γ`/ρ , we have that γ10 ≈ 1.541, γ20 ≈ 0.803, and it decreases monotonically in between, as shown in Figure 7.

Figure 7: The coefficient of (18) as a function of `/ρ. We will again use the notation of q = (x, y) ∈ X 2 , ∆ = x − y, t = x + y. Each term in the sum of f˘(q) − f˘(q 0 ) has mean zero and absolute value at most 1 |cos(ωiT ∆) + cos(ωiT t + 2bi ) − k(∆) − cos(ωiT ∆0 ) + cos(ωiT t0 + 2bi ) + k(∆0 )| D  1 ≤ cos(ωiT ∆) − cos(ωiT ∆0 ) + cos(ωiT t + 2bi ) − cos(ωiT t0 + 2bi ) + |k(∆) − k(∆0 )| D 1 (kωi kk∆ − ∆0 k + kωi kkt − t0 k + Lk∆ − ∆0 k) . ≤ D 16

Now, in order to cast this in terms of distance on X 2 , let δx = x − x0 , δy = y − y 0 . Then kq − q 0 k2 = kδx k2 + kδy k2 2 q q 2 kδx k2 + kδy k2 − 2δxT δy + kδx k2 + kδy k2 + 2δxT δy (k∆ − ∆0 k + kt − t0 k) = q 2 = 2kδx k2 + 2kδy k2 + 2 (kδx k2 + kδy k2 ) − 4(δxT δy )2  ≤ 4 kδx k2 + kδy k2 k∆ − ∆0 k + kt − t0 k ≤ 2kq − q 0 k k∆ − ∆0 k ≤ 2kq − q 0 k 2 (kωi k + L) kq −q 0 k. Note that this agrees exactly and so each term in the sum of f˘(q)− f˘(q 0 ) has absolute value at most D 0 with (17), but the sum in f˘(q) − f˘(q ) has D terms rather than D/2. Defining g˘r analogously to g˜r , we thus get that ! D 2 2 1 X 2 λ(˘ gr (q)−˘ gr (q 0 )) log Ee ≤ (kωi k + L) λ2 kq − q 0 k2 ≤ (r + L)2 λ2 kq − q 0 k2 , D D i=1 D

and the conditions of the theorem hold with v = we get that

4 2 D (r + L) .

Note that E˘ gr (q0 ) = 0. Carrying out the rest of the argument,

# " √ √ 24β`/ρ ` d 24β`/ρ ` d ˘ √ √ (r + L) = (R + L), E sup f = Er [E[sup g˘r ]] ≤ Er D D

and similarly for E sup f˘. We do not have a guarantee that f˘(q) does not have a consistent sign, and so our bound becomes h i     Ekf˘k∞ ≤ E kf˘k∞ | f˘ crosses 0 Pr f˘ crosses 0 + 3 Pr f˘ does not cross 0 √     48β`/ρ ` d √ ≤ (R + L) Pr f˘ crosses 0 + 3 Pr f˘ does not cross 0 . D   Pr f˘ crosses 0 is extremely close to 1 in “usual” situations.

B B.1

PROOFS FOR L2 ERROR BOUND (SECTION 2.3) BOUNDED CHANGE IN kf˜k2µ

ˆ 1 . (By symmetry, the Viewing kf˜k2µ as a function of ω1 , . . . , ωD/2 , we can bound the change due to replacing ω1 by ω change is the same for any ωi .) ˜2 ω1 , ω2 , . . . , ωD/2 ) kf kµ (ω1 , ω2 , . . . , ωD/2 ) − kf˜k2µ (ˆ 2 Z  D/2 X 2 2  cos(ω1T (x − y)) + = cos(ωiT (x − y)) − k(x, y) dµ(x, y) X2 D D i=2  2 Z D/2 X 2 2 T T  cos(ˆ − ω1 (x − y)) + cos(ωi (x − y)) − k(x, y) dµ(x, y) D D i=2 X2  2 Z D/2 4 Z X 2  cos2 (ω1T (x − y))dµ(x, y) + cos(ωiT (x − y)) − k(x, y) dµ(x, y) = 2 D X2 D i=2 2 X   Z D/2 X 2 2 +2 cos(ω1T (x − y))  cos(ωiT (x − y)) − k(x, y) dµ(x, y) D D 2 X i=2 17

2 D/2 X 2  cos2 (ˆ ω1T (x − y))dµ(x, y) − cos(ωiT (x − y)) − k(x, y) dµ(x, y) D i=2 X2 X2   Z D/2 X 2 2 T T −2 cos(ˆ ω1 (x − y))  cos(ωi (x − y)) − k(x, y) dµ(x, y) D i=2 X2 D Z 4  = 2 cos2 (ω1T (x − y)) − cos2 (ˆ ω1T (x − y)) dµ(x, y) D X2   Z D/2 X  2 4 T T T   cos(ω1 (x − y)) − cos(ˆ ω1 (x − y)) cos(ωi (x − y)) − k(x, y) dµ(x, y) + D X2 D i=2 Z Z 4 4 ≤ 2 dµ(x, y) + 4dµ(x, y) D X2 D X2   16 16D + 4 4 + µ(X 2 ). µ(X 2 ) = = 2 D D D2 

4 − 2 D

B.2

Z

Z

BOUNDED CHANGE IN kf˘k2µ

We can do essentially the same thing for f˘: ˘2 ω1 , ω2 , . . . , ωD/2 ) kf kµ (ω1 , ω2 , . . . , ωD/2 ) − kf˘k2µ (ˆ Z   2 cos(ω1T (x − y)) + cos(ω1T (x + y) + 2bi ) = D X2 2 X   2 + cos(ωiT (x − y)) + cos(ωiT (x + y) + 2bi ) − k(x, y) dµ(x, y) D i=2 Z   2 − cos(ˆ ω1T (x − y)) + cos(ˆ ω1T (x + y) + 2bi ) D X2 2 D/2  2 X T T cos(ωi (x − y)) + cos(ωi (x − y) + 2bi ) − k(x, y) dµ(x, y) + D i=2 Z  4 2 2  = 2 cos(ω1T (x − y)) + cos(ω1T (x + y) + 2bi ) − cos(ˆ ω1T (x − y)) + cos(ˆ ω1T (x + y) + 2bi ) dµ(x, y) D X2 Z  4 + cos(ω1T (x − y)) + cos(ω1T (x + y) + 2bi ) − cos(ˆ ω1T (x − y)) − cos(ˆ ω1T (x + y) + 2bi ) D X2   D/2 X   2 cos(ωiT (x − y)) + cos(ωiT (x + y) + 2bi ) − k(x, y) dµ(x, y) D i=2 Z Z 4 4 ≤ 2 8 dµ(x, y) + 8 dµ(x, y) D X2 D X2 32 32 = 2 µ(X 2 ) + µ(X 2 ) D D D+1 = 32 µ(X 2 ). D2 D/2

18

Suggest Documents