Learning With Kernel Smoothing Models and Low ... - IEEE Xplore

0 downloads 0 Views 275KB Size Report
that, under suitable regularity assumptions, consistency of the empirical risk minimization is guaranteed with a good rate of convergence of the estimation error, ...
504

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

Learning With Kernel Smoothing Models and Low-Discrepancy Sampling Cristiano Cervellera and Danilo Macciò

Abstract— This brief presents an analysis of the performance of kernel smoothing models used to estimate an unknown target function, addressing the case where the choice of the training set is part of the learning process. In particular, we consider a choice of the points at which the function is observed based on lowdiscrepancy sequences, which is a family of sampling methods commonly employed for efficient numerical integration. We prove that, under suitable regularity assumptions, consistency of the empirical risk minimization is guaranteed with a good rate of convergence of the estimation error, as well as the convergence of the approximation error. Simulation results confirm, in practice, the good theoretical properties given by the combination of kernel smoothing models with low-discrepancy sampling. Index Terms— Empirical risk minimization, function learning, kernel smoothing models, low-discrepancy sequences.

I. I NTRODUCTION Consider a learning context in which the input data are not generated by an external source. In this case, the points at which the unknown function is observed need to be chosen by some specific algorithm. This last condition (which is referred to as design of experiment in statistics) can be found in applications, among others, from neural networks, robotics, optimal control, and statistics (see [1]–[4] and the references therein). Within such framework, we address the case where kernel-based local models are employed for the empirical risk minimization (ERM) principle, which is typical of machine learning problems (see [5]–[7]). Kernel smoothing models have been routinely employed in the literature for many applications, both in contexts where input data are provided by an external source and those where the choice of the training set is an issue. Statistical regression, classification, and clustering are, for instance, typical problems where kernel models are popular (two good textbooks that treat the subjects in detail are [8] and [9]). Other important contexts in which kernel models are applied are, among others, approximate dynamic programming (ADP) and reinforcement learning, density estimation, control, and image processing (see [10]–[13] and the references therein). In general, the learning case where observations can be chosen freely leads to the problem of generating a good sampling of the input space that can positively affect the rate of estimation of the best element within the class of approximating structures (sometimes referred to as sample complexity). This should be done without resorting to iterative point selection procedures that may lead to an infeasible computational burden [3]. Manuscript received October 18, 2011; revised October 26, 2012; accepted December 19, 2012. Date of publication January 14, 2013; date of current version January 30, 2013 The authors are with the Institute of Intelligent Systems for Automation, National Research Council, Genova 16149, Italy (e-mail: cristiano.cervellera@ cnr.it; [email protected]). Digital Object Identifier 10.1109/TNNLS.2012.2236353

The issue is even more crucial with local models such as kernel smoothing approximators, since their structure depends directly on the observed data. The kind of sampling we focus on as “good” in the terms defined above is the class of socalled low-discrepancy sequences (see [14]). Such methods, which have been introduced in the literature to outperform the classic Monte Carlo algorithms for numerical integration, have the advantage that asymptotic coverage of the input space is achieved deterministically with a rate of convergence that is better than the quadratic rate typical of a random extraction of points. The general formulation of the addressed learning problem consists in estimating, inside a given class of parameterized models  = {ψ(x, α), α ∈ ⊂ Rk }, the element (i.e., the value of α) that best approximates a functional dependence of the form y = g(x), where x ∈ X ⊂ Rn and y ∈ Y ⊂ R. This can be done by choosing a finite sample of points in X, L = {x 1 , . . . , x L }, where we are able to observe the output of the function g; then, we obtain the set of output observations in Y , ϒ L = {y1 , . . . , y L }, with yl = g(xl )+ηl , where the terms ηl are independent identically distributed (i.i.d.) realizations of a random noise η with density p(η), having zero mean and bounded values (i.e., η ∈ E = [ηmin , ηmax ]). In this context, where the input sample is generated by a deterministic rule (i.e., not by a random process with a given probability), it is reasonable to evaluate the goodness of the approximation by a functional risk that weights the distance of ψ from y uniformly over X (y, ψ(x, α)) p(η)d xdη R(α) = X ×E

where the loss function  : Y 2 → R has the properties (t, t) = 0 and (t, s) > 0, for every t, s. Because of its simplicity and mathematical tradition, the most used loss function is the squared error (t, s) = (t − s)2 ; in this case, R is commonly referred to as the least-squares error. Concerning the class of models , as mentioned, we investigate the use of kernel-based smoothing structures that perform a local estimation of the unknown function. This means that we assume to know the target function g in a set of points in X,  K = {ξ 1 , . . . , ξ K }, different from L , and define the estimation ψ in a generic point x as an average of g evaluated in the points of  K , weighted by the distance of x from every ξ k . More specifically, given the set U K = {y1 , . . . , y K }, with yk = g(ξ k ) + ηk , we define the estimation of g(x) as K Kα (x, ξ k )yk ψ(x, α) = k=1 (1) K k=1 Kα (x, ξ k ) where Kα (x, ξ k ) = G( x − ξ k /α) is called the kernel and G(t) is a nonincreasing function for t  0. The most 2 commonly employed is the Gaussian kernel G(t) = e−πt . The term α ∈ (0, ∞), also called bandwidth, defines the range of influence of the kernel and plays the role of the parameter to be optimized in the minimization of the risk. The estimation problem is thus reduced to finding the α ∗ such that R(α ∗ ) = minα>0 R(α).

2162–237X/$31.00 © 2013 IEEE

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

Since the minimization of the risk would require the knowledge of the target function g in every point of the domain, the ERM principle must be employed. This relies on the concept of empirical risk, which is an approximation of the true risk based on the set of input/output observations ( L , ϒ L ) defined as L 1 (yl , ψ(x l , α)). (2) Remp (α) = L l=1

Then, the learning problem can be stated as follows: find α ∗L such that Remp (α ∗L ) = minα>0 Remp (α). The ERM procedure is said to be consistent when R (α ∗L ) = R(α ∗L ) − R(α ∗ ) tends to 0 as L → ∞. Notice that, such training, corresponding to a 1-D minimization, makes kernel-based models very attractive from a computational point of view. In the following, we analyze the use of low-discrepancy sequences in the local kernel learning context, showing that, under some regularity conditions on the involved functions, they allow us to obtain universal approximation capabilities of the kernel models (i.e., convergence of R(α ∗ ), also called approximation error as the number of kernels K grows) and consistency of the ERM principle, with an advantageous asymptotic rate of convergence of the estimation error R (α ∗L ). II. L OW-D ISCREPANCY S EQUENCES Low-discrepancy sequences are special families of algorithms to generate points that are scattered uniformly, according to a measure called discrepancy. Consider the number C(B, L ) of points of L belonging to B. In the following, we will assume that X = [0, 1]n , i.e., the unit n-dimensional hypercube. This is not a limitation, since the results can be extended to more complex subsets of Rn by suitable transformations [15]. Definition 4: When γ is the family ofall the closed subinn tervals of X that can be written as i=1 [0, bi ], the star ∗ discrepancy D ( L ) is defined as     C(B, L ) − λ(B) D∗ ( L ) = sup  (3) L B∈γ where λ(B) is the Lebesgue measure of B. The idea underlying the low-discrepancy concept is that we can obtain samples that are well uniformly spread over X by ensuring that every basic subset in which the set can be subdivided contains a number of points of the sequence that is as proportional as possible to the volume of the subset itself. This can be achieved by employing a family of sequences that implement such concept, i.e., (t, n)-sequences [14]. The key feature of (t, n)-sequences is that they satisfy deterministically [14, p. 95] (4) D∗ ( L )  O(L −1 (log L)n−1 ). Then, the rate at which the discrepancy of the sample points coming from (t, n)-sequences converges as L grows, i.e., how fast and uniformly the space is covered adding new points, is almost linear (ignoring the logarithmic factor). For a comparison, it can be proved [15] that the discrepancy of a sample of L points i.i.d. with a uniform distribution on X, which is typical of Monte Carlo methods, is O(L −1/2 ).

(a) Fig. 1.

505

(b)

(c)

(a) Sobol’, (b) Halton, and (c) purely random sequence.

Different algorithms can lead to (t, n)-sequences. For instance, the Halton sequence, the Sobol’ sequence, and the Niederreiter sequence are all examples of (t, n)-sequences. The procedure to obtain this kind of sequence varies from method to method. As an example, the construction of a Halton sequence [15, p. 27] is briefly reported. Any natural number k has a unique m-digit representation of the form k = q0 + q1 m + q2 m 2 + · · · + qr m r , where m is a natural number  2 and m r  k < m r+1 . Then, the radical inverse of k with base m is defined as ym (k) = q0 m −1 + q1m −2 + · · · + qr m −r−1 ∈ (0, 1). Consider n different prime numbers pi , i = 1, . . . , n. Then, the kth point of a Halton sequence is defined as z k = (y p1 (k), . . . , y pn (k)),

k = 1, 2, . . . .

For example, [14] and [15] are good textbooks where methods to obtain (t, n)-sequences are described, together with their properties. Nowadays, such sampling schemes are already implemented in many popular software packages (such as, e.g., M ATLAB) and libraries available on the Web. Fig. 1 illustrates sampling of a bidimensional space with two different low-discrepancy sequences, namely a Halton and a Sobol’ sequence. For a comparison, an i.i.d. random sequence with uniform distribution is also depicted. It can be clearly noticed how the (t, n)-sequences ensure better uniformity. III. A NALYSIS OF THE E STIMATION E RROR Here we analyze the convergence to 0 of the estimation error R (α ∗L ) when L → ∞, assuming the loss function is quadratic, i.e., (t, s) = (t −s)2 . The class of kernel smoothing methods  = {ψ(x, α), α > 0} is employed, with ψ defined as in (1). We first provide some definitions of quantities that will be usefulin the following. For each vertex of a given subinterval B = ni=1 [ai , bi ] of X, we can define a binary label which assigns “0” to every ai and “1” to every bi . Given a function ϕ : X → R, denote by (ϕ, B) the alternating  sum of ϕ computed at the vertexes of B, i.e., (ϕ, B) = x∈e B ϕ(x) −  ϕ(x), where e is the set of vertexes with an even B x∈o B number of “1”s in their label, and o B is the set of vertexes with an odd number of “1”s. Definition 5: The variation of ϕ on X in the sense of Vitali is defined by [14]  |(ϕ, B)| (5) V (n) (ϕ) = sup ℘

B∈℘

where ℘ is any partition of X into subintervals.

506

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

When the partial derivatives of ϕ are continuous on X, can be written in an easier way as [14, p. 19]  1 1   ∂nϕ (n)   ··· (6) V (ϕ) =  ∂ x · · · ∂ x  d x1 · · · d xn 1 n 0 0

0.5

V (n) (ϕ)

where x i is the i th component of x. Definition 6: The variation of ϕ on X in the sense of Hardy and Krause is defined by [14] VHK (ϕ) =

n 



V (k) (ϕ; i 1, . . . , i k )

(7)

k=1 1i1 0. Then there exists M ∗ such that ψ belongs to W M ∗ (X). As to point 2), we have seen from (4) that low-discrepancy sequences satisfy the requirement. Then, sampling of the input space based on this kind of sequences guarantees consistency of the ERM principle with efficient error rates. IV. A NALYSIS OF THE A PPROXIMATION C APABILITIES The following Assumptions are introduced to prove the main results of the section concerning the approximation error R(α ∗ ). Assumption 2: The target function g : X → R is continuous on X. Assumption 3: The dispersion of the set  K , defined [14] as δ ∗ ( K ) = sup x∈X mink=1,...,K x − ξ k , is such that lim K →∞ δ ∗ ( K ) = 0. Dispersion is a measure of uniformity strictly related to discrepancy. In fact, Assumption 3 is verified when lowdiscrepancy sequences are employed for the construction of the set  K , due to the fact that not only the discrepancy of such sequences but also their dispersion tends to 0 as K → ∞ [14]. The next assumption uses the following notation, which will be useful also for the proof of Theorem 4. Define δmin (x) = j mink=1,...,K x − ξ k , and kmin , j = 1, . . . , N, the corresponding indexes such that δmin (x) = x − ξ k j . Then, set min ˜ ˜ = x − ξ ˜ , δ(x) = min j x − ξ k and k˜ such that δ(x) k =kmin

k

assuming, for simplicity and without loss of generality, that such index is unique. Assumption 4: The kernel function is such that for every x ∈ X and j = 1, . . . , N lim α↓0

˜ Kα (x, ξ k˜ ) G(δ(x)/α) = lim = 0. Kα (x, ξ k j ) α↓0 G(δmin (x)/α) min

Notice that Assumption 4 is fulfilled by most common kernels such as the Gaussian. Assumption 5: The set of points  K and the kernel Kα are

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

such that, for every x ∈ X and α > 0 2  Kα (x, ξ i ) lim K · sup = 0. K K →∞ 1i K k=1 Kα (x, ξ k )

TABLE I MAE OF THE E RRORS ( IN 10−2 U NITS ) FOR THE T ROLLEY P ROBLEM AND

We introduce now a technical lemma that will be useful for the main result. Lemma 3: Let η = {η1 , . . . , η K } be a sample of i.i.d. realizations of a random variable with zero mean belonging to the K set E = [ηmin , ηmax ]. Then k=1 βk ηk converges to zero in probability for K → ∞, i.e.,    K  βk ηk   ε = 0 for every ε > 0 lim Prob  K →∞ η∈E K

k=1

provided the sequence of values {β1 , β2 , . . .} is such that K sup1i K βi2 → 0 as K → ∞. Theorem 4: If Assumptions 2 and 3 hold, then there exists α ∗ (K ) such that for every ε > 0 and every x ∈ X

lim Prob |g(x) − ψ(x, α ∗ (K ))|  ε = 0. K →∞ η∈E K

By inspecting the proof, it can be noticed that Assumption 5 is required by Theorem 4 only to annihilate the probabilistic part of the error (12). If this condition is not met, a bias in the approximation is introduced, and its value depends on the intensity of the noise. In this case, a tradeoff of the bias–variance kind, which is typical of most learning algorithms, arises. In the noise-free case, i.e., when ηk = 0, k = 1, . . . , K , the following result is a consequence of the dominated convergence theorem. Corollary 1: If Assumptions 2 and 3 hold and ηk = 0, k = 1, . . . , K , then R(α ∗ (K )) → 0 as K → ∞. V. S IMULATION T ESTS In order to show the advantages of low-discrepancy sequences with kernel models in a context where input samples have to be chosen, two test cases of real applicative interest have been addressed through simulation. Test Case 1: In case 1, the problem concerns the estimation, through a learning procedure, of the behavior of some complex system as a function of parameters that we can control. For the actual test, a classic system representing a trolley supporting a payload through a cable (which can be used to model, e.g., the dynamics of a crane loading and/or unloading objects) was considered. For the test, the dynamics of the system were simulated through the well-known Lagrange equations, as (M + m) y¨ − ml cos θ θ¨ + ml sin θ θ˙ 2 = Fy , cos θ y¨ − l θ¨ + g sin θ = 0

507

p-VALUES FOR THE T ESTS V ERSUS R ANDOM S EQUENCES K, L 200, 50

500, 100

1000, 200

2000, 400

Sobol’

3.7, p ∼ 0

2.5, p ∼ 0

1.9, p ∼ 0

1.6, p ∼ 0

Halton

3.4, p ∼ 0

2.5, p ∼ 0

1.8, p ∼ 0

1.5, p ∼ 0

Niederreiter

3.7, p ∼ 0

2.4, p ∼ 0

1.9, p ∼ 0

1.6, p ∼ 0

4.7

3.2

2.5

2.0

Random

and T ∈ [0, 1] s. The input/output pairs for the ERM learning are then formed by the values of x coming from the chosen sampling scheme and the corresponding final positions yT (x), computed by discretizing (9), with M = 2 kg and l = 1.5 m, adding a measurement noise having Gaussian distribution with zero mean and 0.01 variance. Test Case 2: The second test involves estimation of the value function in ADP [4]. Specifically, a 9-D inventory forecasting problem, which is a classic test bed for ADP (see [16], [17]), has been considered. In such an inventory problem, the demand of three items must be satisfied through a finite horizon of temporal stages, while the storage levels must be kept as close to 0 as possible. The components of the state vector x t, j are the item storage in period t when j = 1, 2, 3 (taken in the ranges [−10, 10], [−12, 12], [−7.5, 7.5]), the forecasts for the demand of each item in period t + 1 when j = 4, 5, 6 (taken in the ranges [0, 10], [0, 12], [0, 7.5]), and the forecasts for the demand of each item in period t + 2 when j = 7, 8, 9 (taken in the ranges [0, 6.5], [0, 8], [0, 5]), respectively. The control variable u t, j denotes the amount of product j ordered at stage t. Also define the 9-D random vector ζ t representing a correction between the forecast and the true demand of items, which is assumed to have lognormal distribution. The single-stage cost function has the following V-shaped structure: h(x t , ut , ζ t ) =

3 

[β j max{x t +1, j , 0} − π j min{0, −x t +1, j }]

j =1

where x t +1, j follows the state equation: x t +1, j = x t, j + u t, j − x t, j +N · ζt, j for j = 1, 2, 3, x t +1, j = x t +1, j +N · ζt, j for j = 4, 5, 6, x t +1, j = d j · ζt, j for j = 7, 8, 9.

(9)

where y and θ are the position of the trolley and the angle of the cable, respectively, M is the mass of the trolley, m is the mass of the payload, l is the length of the cable, g = 9.8 m/s2 is the acceleration due to gravity, and Fy is a horizontal force that can be applied to move the trolley. The response we want to model is the final position yT of the trolley, starting from the equilibrium, when a given force Fy is applied for T seconds, with a payload of m kilograms. Then, the 3-D input vector is defined as x = [m, Fy , T ], with m ∈ [0, 1] kg, Fy ∈ [0, 10] N

The coefficients β j  0 and π j  0 are the holding cost and the backorder cost parameters for item j , respectively, while d j is a constant. The components of the control vector ut are constrained so that u t,1 + u t,2  10, u t,2 + u t,3  9 and u t, j  0 for every j . The well-known ADP paradigm involves the computation, at each stage t and for each state point x t , of the value function defined as   Jt (x t ) = min E h(x t , ut , ζ t ) + Jˆt +1 (x t +1 ) (10) ut ∈U ζt

508

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

TABLE II MAE OF THE E RRORS FOR THE I NVENTORY P ROBLEM AND p-VALUES FOR THE T ESTS V ERSUS R ANDOM S EQUENCES K, L 400, 100

2000, 500

4000, 1000

8000, 2000

Sobol’

1.05, p = 1.4 · 10−4

0.78, p = 5.9 · 10−6

0.70 p = 2.7 · 10−3

0.63 p = 0.42

Halton

1.03, p = 4.1 · 10−5

0.78, p = 3 · 10−3

0.70, p = 4.5 · 10−3

0.62, p = 1.1 · 10−6

Niederreiter

1.05, p = 4.6·10−10

0.80, p = 1.8 · 10−7

0.70, p = 5.2 · 10−3

0.60, p = 0.12

1.09

0.82

0.72

0.63

Random

where Jˆt +1 is an approximation of the value function that quantifies the cost to be paid from state x t +1 on to the end of the horizon. Obtaining such approximation is a typical ERM learning problem. Notice that a good approximation leads to better policies when the actual control vector [i.e., the argument of (10)] is computed. In this case, the L input–output pairs are, at stage t, formed by state x t and the corresponding Jt computed by (10) (estimating the mean over ζ t with an average over a finite set of Q realizations). For the tests, we considered learning the value function for stage T , assuming, by definition, that JT +1 = 0. Test Procedure and Results: For both test cases, three different kinds of low-discrepancy sequences, namely: 1) a Sobol’ sequence; 2) a Halton sequence; and 3) a Niederreiter sequence have been employed for the ERM procedure by minimizing (2), being ψ a kernel smoothing model approximating the final position of the trolley and the value function, respectively. Then, for a comparison with a possible alternative sampling scheme, random i.i.d. points drawn with uniform distribution have also been employed. For each sampling scheme, repeated tests have been performed by choosing 20 different instances of the sequence, leading to different approximations yˆT , j and JˆT , j , j = 1, . . . , 20, for the first and the second test case, respectively. Then, the performance of each approximation has been evaluated by computing the absolute difference ei j between the output of the trained model and the true final position yT (test case 1) and the true value function JT (test case 2) in 1000 test input points drawn with uniform probability (i.e., ei j = | yˆT , j (x i ) − yT (x i )| and ei j = | JˆT , j (x T ,i ) − JT (x T ,i )|, respectively). The total average errors e¯ over the 1000 the 20 different instances (i.e.,  test pointsand 20 e¯ = 1/1000 1000 i=1 1/20 j =1 ei j ) are reported in Tables I and II for various combinations of K and L in both test problems. The tables also show, for each low-discrepancy kind, the p-values corresponding to the test of the hypothesis that the errors ei j from the random sequences of the same size have larger mean (using a t-test with unequal population variances.) The results are consistent with the theory presented in the previous sections. In particular, they show that the lowdiscrepancy sequences lead to better accuracy as the number of sample points grow. Furthermore, they always yield the best performance for both test problems. This is more evident in the first test problem, which is characterized by a smoother cost. Yet, also in the DP example, low-discrepancy sets turn out as the best choice. In fact, the p-values show that only in one

0.95 0.90 0.85 0.80 0.75

0.030 0.025 0.020 0

100

250

500

0

5000

(a)

8000

10000

(b)

Fig. 3. MAE for the corrupted Sobol’ sequences in (a) trolley and (b) inventory case, for increasing level of corruption (number of clusters).

case (Sobol’ with K = 8000, L = 2000) the performance of the random sequence is statistically similar, while in all other cases low-discrepancy sequences win with large confidence. All the tests have been performed on a 1.8-GHz Intel Core2 machine, with 2 GB of memory. Then, another round of tests has been set up to investigate the importance of ensuring that the discrepancy of the training set converges as the number of points grow. In particular, two of the low-discrepancy sets used for the first round have been chosen as a starting base for  K and L (namely, the Sobol’ sets with K = 1000, L = 200 and K = 4000, L = 1000 for test case 1 and 2, respectively), and their points have been perturbed by moving them far from randomly selected forbidden zones, in order to form clusters that corrupt the lowdiscrepancy structure of the set and prevent convergence of the discrepancy. Kernel approximations of the final trolley position and value function have been computed as before, using three new training sets defined in this way, obtained by employing an increasing number of forbidden zones. The results are shown in Fig. 3 and prove that the more a sampling scheme deviates from uniformity, the worse is the performance of the kernel learning task. A PPENDIX Proof of Theorem 2: Because the noise has zero mean, the risk can be easily written as η2 p(η) dη. R(α) = (g(x) − ψ(x, α))2 p(η) d x + X

E

Analogously, the empirical risk can be decomposed as Remp (α) =

L L 1 1 (g(x l ) − ψ(x l , α))2 + (ηl )2 L L

+

l=1 L 

2 L

l=1

l=1

ηl (g(xl ) − ψ(x l , α)).

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 3, MARCH 2013

The assertion then follows from the application of the Koksma–Hlawka inequality (8) and the definition of α ∗ and α ∗L , similar to that in [3, Th. 2]. Proof of Theorem 3: Let us rewrite the kernel smoothing model ψ(x, α) as a composition ψ(x, α) = (μ1 (x), . . . , μ K (x)), where μ p (·) = Kα (·, ξ p ) : X → [a p (α), b p (α)] for 0 < a p (α) < b p (α), and   (z) = ( Kp=1 u p z p )/( Kp=1 z p ), being u p = g(ξ p ) + η p , p = 1, . . . , K . From the of the partial derivatives of (z), we   computation   k for 1  i k  K , is bounded by have that  ∂z∂i ϕ(z) , ···∂z i M1 =

k!

K

1

k

 + ηmax |b j (α) (k − 1)! kj =1 |yi j + ηmax | + .



 k+1 k K K a (α) a (α) j =1 j j =1 j j =1 |y j

K [a p (α), b p (α)]). Then,  belongs to the class W M1 ( i=1 The conclusion directly follows from [3, Lemma 1]. Proof of Lemma 3: Since the terms βk ηk are independent bounded random variables with zero mean, the application of Hoeffding’s inequality [5] yields  

  K 2ε2    ε  exp − Prob β η k k  {η1 ,...,η K }  K R2 k=1

where R = sup1i K |βi | (ηmax − ηmin ). Since K sup1i K βi2 → 0 as K → ∞ by assumption, the conclusion follows. Proof of Theorem 4: The following notation will be adopted: δk (x) = x − ξ k , k (x) = |g(x) − g(ξ k )|,  = max x,z∈X |g(x) − g(z)|. First we notice that error can be decomposed into two terms, one completely deterministic, and the other one containing the dependence on the random noise   K    k=1 Kα (x, ξ k )g(ξ k )  (11) |g(x) − ψ(x, α)|  g(x) −  K    k=1 Kα (x, ξ k )    K K (x, ξ )η  k  k=1 α  k +  K (12) .   k=1 Kα (x, ξ k ) We first verify that there exists α ∗ (K ) such that (11) vanishes on X when K increases indefinitely in the case where there is no noise [i.e., when the term (12) is null]. Since g(ξ k ) = g(x) ± k (x), in absence of noise, we can write K Kα (x, ξ k )k (x) ψ(x, α) = g(x) ± k=1 . K k=1 Kα (x, ξ k ) For a given α and K , consider the set N (x) of the N points such that x − ξ n  2δ ∗ for ξ n ∈ N (x), where δ ∗ is the dispersion of  K . Then, we can write |g(x) − ψ(x, α)| as   Kα (x, ξ j ) j (x) k ∈ N (x) Kα (x, ξ k )k (x) + K K k=1 Kα (x, ξ k ) k=1 Kα (x, ξ k ) ξ ∈ (x) j

N



max

ξ j ∈ N (x)

 j (x) +

Kα (x, ξ k˜ ) ¯ (K − N) Kα (x, ξ k 1 ) min

(13)

509

where ξ k˜ is the point outside N (x) closest to x. The first term in the last expression is due to the convexity of the kernel structure. Now, let ε be a positive quantity. Because of the fact that X is compact, g is uniformly continuous on X, i.e., there exists δˆ such that, for every x and every ξ j ∈ N (x), ˆ Since the dispersion δ ∗ ( K )  j (x)  ε/2 when x−ξ j  δ. tends to 0 as K grows, we can take a Kˆ sufficiently large such that x − ξ j < 2δ ∗ ( K ) < δˆ when K > Kˆ , for all x ∈ X and all ξ j ∈ N (x). Then, because of Assumption 4, we can take α ∗ > 0 sufficiently small such that the second addend of (13) is less than ε/2 for every ε > 0 and every x ∈ X. Then, we obtain |g(x) − ψ(x, α ∗ )| < ε/2 + ε/2, i.e., for any ε > 0, there exist Kˆ sufficiently large and α ∗ (K ) sufficiently small such that |g(x) − ψ(x, α ∗ (K ))| < ε for K > Kˆ . Let us now focus on the probabilistic part of the error, i.e., the quantity defined in (12). We know by Assumption 5 that  K 2 ∗ Kα ∗ (x, ξ k ) → 0 as K → K sup1i K Kα (x, ξ i )/ k=1 ∞, and that {η1 , . . . , η K } is a set of realizations of a bounded random variable. Then, we can invoke Lemma 3 to conclude that also (12) tends to zero in probability by increasing K . R EFERENCES [1] K. Fukumizu, “Statistical active learning in multilayer perceptrons,” IEEE Trans. Neural Netw., vol. 11, no. 1, pp. 17–26, Jan. 2000. [2] R. King, K. Whelan, F. Jones, P. Reiser, C. Bryant, S. Muggleton, D. Kell, and S. Oliver, “Functional genomic hypothesis generation and experimentation by a robot scientist,” Nature, vol. 427, pp. 247–252, Jan. 2004. [3] C. Cervellera and M. Muselli, “Deterministic design for neural network learning: An approach based on discrepancy,” IEEE Trans. Neural Netw., vol. 15, no. 3, pp. 533–543, May 2004. [4] W. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality. 2nd ed. New York: Wiley, 2011. [5] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [6] C. Zhang, W. Bian, D. Tao, and W. Lin, “Discretized-Vapnik– Chervonenkis dimension for analyzing complexity of real function classes,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 9, pp. 1461– 1472, Sep. 2012. [7] W. Bian and D. Tao, “Constrained empirical risk minimization framework for distance metric learning,” IEEE Trans. Neural Netw. Learn. Syst., vol. 23, no. 8, pp. 1194–1205, Aug. 2012. [8] T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning. 2nd ed. New York: Springer-Verlag, 2009. [9] L. Devroye, L. Györfi, and G. Lugosi, A Probabilistic Theory of Pattern Recognition. New York: Springer-Verlag, 1996. [10] D. Ernst, M. Glavic, F. Capitanescu, and L. Wehenkel, “Reinforcement learning versus model predictive control: A comparison on a power system problem,” IEEE Trans. Syst., Man., Cybern. B, Cybern., vol. 39, no. 2, pp. 517–529, Apr. 2009. [11] M. A. Jácome, I. Gijbels, and R. Cao, “Comparison of presmoothing methods in kernel density estimation under censoring,” Comput. Statist., vol. 23, no. 3, pp. 381–406, 2008. [12] D. Macciò and C. Cervellera, “Local models for data-driven learning of control policies for complex systems,” Expert Syst. Appl., vol. 39, no. 18, pp. 13399–13408, 2012. [13] D. Comaniciu, V. Ramesh, and P. Meer, “Kernel-based object tracking,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 25, no. 5, pp. 564–577, May 2003. [14] H. Niederreiter, Random Number Generation and Quasi-Monte Carlo Methods. Philadelphia, PA: SIAM, 1992. [15] K. T. Fang and Y. Wang, Number-theoretic Methods in Statistics. London, U.K.: Chapman & Hall, 1994. [16] C. Cervellera, A. Wen, and V. Chen, “Neural network and regression spline value function approximations for stochastic dynamic programming,” Comput. Operat. Res., vol. 34, no. 1, pp. 70–90, 2007. [17] C. Cervellera and D. Macciò, “A comparison of global and semi-local approximation in T-stage stochastic optimization,” Eur. J. Operat. Res., vol. 208, pp. 109–118, Jan. 2011.

Suggest Documents