A Deterministic Learning Approach Based on ... - Semantic Scholar

1 downloads 0 Views 174KB Size Report
Consistency of classical Empirical Risk Minimization (ERM) method is guar- ... Most existing active learning methods use an optimization procedure for the.
A Deterministic Learning Approach Based on Discrepancy Cristiano Cervellera1 and Marco Muselli2 1

Istituto di Studi sui Sistemi Intelligenti per l’Automazione - CNR via De Marini, 6 - 16149 Genova, Italy [email protected] 2

Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni - CNR via De Marini, 6 - 16149 Genova, Italy [email protected]

Abstract. The general problem of reconstructing an unknown function from a finite collection of samples is considered, in case the position of each input vector in the training set is not fixed beforehand, but is part of the learning process. In particular, the consistency of the Empirical Risk Minimization (ERM) principle is analyzed, when the points in the input space are generated by employing a purely deterministic algorithm (deterministic learning). When the output generation is not subject to noise, classical number-theoretic results, involving discrepancy and variation, allow to establish a sufficient condition for the consistency of the ERM principle. In addition, the adoption of low-discrepancy sequences permits to achieve a learning rate of O(1/L), being L the size of the training set. An extension to the noisy case is discussed.

1

Introduction

Neural networks are recognized to be universal approximators, i.e., structures that can approximate arbitrarily well general classes of functions [1, 2]. Once the set of neural networks for the approximation of a given unknown function is chosen, the problem of estimating the best network inside such set (i.e., the network that is “closer” to the true function) from the available data (the training set) can be effectively analyzed in the context of Statistical Learning Theory (SLT). Consistency of classical Empirical Risk Minimization (ERM) method is guaranteed by proper theorems [3], whose hypotheses generally include that the samples are generated by i.i.d. realizations of a random variable having an unknown density p. When the target function f represents a relation between an input space X and an output space Y , like in classification and regression problems, training samples are formed by input-output pairs (xl , yl ), for l = 0, . . . , L − 1 where xl ∈ X and yl is a possibly noisy evaluation of f (xl ). If the location of the input

patterns xl is not fixed beforehand, but is part of the learning process, the term active learning or query learning is used in the literature. Most existing active learning methods use an optimization procedure for the generation of the input sample xl+1 on the basis of the information contained in previous training points [4], which possibly leads to a heavy computational burden. In addition, some strong assumptions on the observation noise y − f (x) (typically, noise with normal density [5]) or on the class of learning models are introduced. In the present paper a deterministic learning framework is discussed, which ensures consistency for the worst-case error under very general conditions. In particular, the observation noise can be described by any probability distribution, provided that it does not depend on x and its mean is zero, and the true function and the model can have any form (it is sufficient that they satisfy mild regularity conditions). Number theoretic results on integral approximation can be employed to obtain upper bounds for the generalization error, which depends on the size of the training set. In the case where the output is unaffected by noise, the corresponding learning rates can be almost linear, which are better than those obtained with the passive learning approach [3]. Furthermore, they are intrinsically deterministic and do not imply a confidence level. In case of noisy output, the consistency of the method can still be proven, though the stochastic nature of the noise spoils the advantages of a deterministic design, thus resulting in a final quadratic rate of convergence of the estimation error (which, however, is not worse than classical SLT bounds [3]). Since the generation of samples for learning and the entire training process is intrinsically deterministic, the formal approach introduced in this paper will be named deterministic learning to distinguish it from the widely accepted statistical learning.

2

The deterministic learning problem

© ª Inside a family of neural networks Γ = ψ(x, α) : α ∈ Λ ⊂ Rk we want to estimate the device (i.e., the parameter vector α) that best approximates a given functional dependence of the form y = g(x), where x ∈ X ⊂ Rd and y ∈ Y ⊂ R, starting from a set of samples (xL , y L ) ∈ (X L × Y L ). Suppose that the output for a given input is observed without noise; the extension to the “noisy” case is discussed in Section 5. In the following we will assume that X is the d-dimensional semi-closed unit cube [0, 1)d . Suitable transformations can be employed in order to extend the results to other intervals of Rd or more complex input spaces. A proper deterministic algorithm will be considered to generate the sample of points xL ∈ X L , xL = {x0 , . . . , xL−1 }; since its behavior is fixed a priori, the obtained sequence xL is uniquely determined and is not the realization of some random variable.

The goodness of the approximation is evaluated at any point of X by a loss function ` : (Y × Y ) 7→ R that measures the difference between the function g and the output of the network. The risk functional R(α), which measures the difference between the true function and the model over X, is defined as Z R(α) = `(g(x), ψ(x, α))dx (1) X

Then, the estimation problem can be stated as Problem E Find α∗ ∈ Λ such that R(α∗ ) = minα∈Λ R(α) The choice of the Lebesgue measure in (1) for evaluating the estimation performance is not a limitation: in fact, under mild hypotheses, the consistency of learning for the risk computed with the Lebesgue measure implies consistency also for the risk computed with any other measure which is absolutely continuous with respect to the uniform one [6]. Since we know g only in correspondance of the points of the sequence xL , according to classical SLT literature, we consider the minimization of R on the basis of the empirical risk given L observation samples Remp (α, L) =

L−1 1 X `(yl , ψ(xl , α)) L l=0

In order to obtain a minimum point for the actual risk R(α), we adopt an ∗ optimization algorithm ΠL aimed at finding Remp (L) = minα∈Λ Remp (α, L). αL is the parameter vector obtained after L samples have been extracted and the minimization has been performed. r(αL ) will denote the difference between the actual value of the risk and the best achievable risk: r(αL ) = R(αL ) − R(α∗ ). Definition 1 We say that the learning procedure is deterministically consistent if r(αL ) → 0 as L → ∞. The next theorem gives sufficient conditions for a sequence xL to be deterministically consistent. It can be viewed as the corresponding result in deterministic learning of the key theorem [3] for classical SLT. The proof can be found in [6]. Theorem 1 Suppose the following two conditions are verified: 1. The deterministic sequence xL is such that sup |Remp (α, L) − R(α)| → 0 as L → ∞.

(2)

α∈Λ

2. The learning algorithm ΠL is deterministically convergent, i.e., for all ² > ∗ 0 and all L, it is possible to obtain Remp (αL , L) − Remp (L) < ² after a “sufficient” training. Then the learning procedure is deterministically consistent (i.e., r(αL ) → 0 as L → ∞).

3

Discrepancy-based learning rates

Since we are using the Lebesgue measure, which corresponds to a uniform distribution, to weight the loss function on the input space X, we must ensure a good spread of the points of the deterministic sequence xL over X. A measure of the spread of points of xL is given by the star discrepancy, widely employed in numerical analysis [7] and probability [8]. Consider a multisample xL ∈ X L . If cB is the characteristic function of a subset B ⊂ X (i.e., cB (x) = 1 if x ∈ B, cB (x) = 0 otherwise), we define PL−1 C(B, xL ) = l=0 cB (xl ). Definition 2 If β is the family of all closed subintervals of X of the form Q d L ∗ i=1 [0, bi ], the star discrepancy DL (x ) is defined as ¯ ¯ ¯ ¯ C(B, xL ) ∗ DL (xL ) = sup ¯¯ − λ(B)¯¯ (3) L B∈β where λ indicates the Lebesgue measure. The smaller is the star discrepancy, the more uniformly distributed is the sequence of points in the space. By employing the Koksma-Hlawka inequality [9], it is possible to prove that the rate of (uniform) convergence of the empirical risk to the true risk is closely related to the rate of convergence of the star discrepancy of xL . In particular, it is possible to prove [6] that ∗ sup |Remp (α, L) − R(α)| ≤ V DL (xL )

(4)

α∈Λ

where V is a bound on the variation in the sense of Hardy and Krause [10] of the loss function. Such parameter, which takes into account the regularity of the involved functions, plays the role that VC-dimension has in SLT literature. It is interesting to note that the structure of (4) permits to deal separately with the issues of model complexity and sample complexity. 3.1

Low-discrepancy sequences

Since the rate of convergence of the learning procedure can be controlled by the rate of convergence of the star discrepancy of the sequence xL , in this section we present a special family of deterministic sequences which turns out to yield an almost linear convergence of the estimation error. Such sequences are usually referred to as low-discrepancy sequences, and are commonly employed in quasirandom (or quasi-Montecarlo) integration methods (see [7] for a survey). An elementary interval in base b (where b ≥ 2 is an integer) is a subinterval Qd E of X of the form E = i=1 [ai b−pi , (ai + 1)b−pi ), where ai , pi ∈ Z, pi > 0, 0 ≤ ai ≤ bpi for 1 ≤ i ≤ d.

Definition 3 Let t, m be two integers verifying 0 ≤ t ≤ m. A (t,m,d)-net in base b is a set P of bm points in X such that C(E; P ) = bt for every elementary interval E in base b with λ(E) = bt−m . If the multisample xL is a (t, m, d)-net in base b, every elementary interval in which we divide X must contain bt points of xL . It is clear that this is a property of good “uniform spread” for our multisample in X. Definition 4 Let t ≥ 0 be an integer. A sequence {x0 , . . . , xL−1 } of points in X is a (t,d)-sequence in base b if, for all © integers k ≥ 0 andª m ≥ t (with (k +1)bm ≤ L−1), the point set consisting of xkbm , . . . , x(k+1)bm is a (t, m, d)net in base b. Explicit non-asympthotic bounds for the star discrepancy of a (t, d)-sequence in base b can be given. For what concerns the asymptotic behaviour, special ∗ (xL ) optimized sequences can ¢be obtained that yield a convergence rate for DL ¡ d−1 of order O (log L) /L Consequently, for what concerns the learning problem, by employing (t, d)sequences we obtain µ ¶ V (log L)d−1 sup |Remp (α, L) − R(α)| ≤ O (5) L α∈Λ If we compare the bound in (5) with classic results from SLT [3], we can see that, for a fixed dimension d, the use of deterministic low-discrepancy sequences permits a faster asymptotic convergence. Specifically, if we ignore logarithmic factors, we¢ have a rate of O (1/L) for a (T (v, d), d)-sequence, and a rate of ¡ O 1/L1/2 for a random extraction of points.

4

Experimental results

In order to present experimental results on the use of low-discrepancy sequences, three different test functions taken from [11] have been considered and approximated by means of neural networks of the form of one hidden-layer feedforward networks with sigmoidal activation function. For each function, low-discrepancy sequences (LDS) and training sets formed by i.i.d. samples randomly extracted with uniform probability (URS) have been compared. The LDS are based on Niederreiter sequences with different prime power bases, which benefit from the almost linear convergence property described in Subsection 3.1. The empirical risk Remp , computed by using a quadratic loss function, has been minimized up to the same level of accuracy for each function. The generalization error has been estimated by computing the square Root of the Mean Square error (RMS) over a set of points obtained by a uniform discretization of the components of the input space.

Function 1) (Highly oscillatory behavior) g(x) = sin(a · b),

where a = e2x1 ·sin(πx4 ) , b = e2x2 ·sin(πx3 ) , x ∈ [0, 1]4 .

Six training sets with length L = 3000, three of which random and three based on Niederreiter’s low-discrepancy sequences, have been used to build a neural network with ν = 50 hidden units. In order to construct a learning curve which shows the improvement achieved by increasing the number of training samples, the results obtained with subsets containing L = 500, 1000, 1500, 2000 and 2500 points of the basic sequences are also presented. Table 1 contains the average RMS for the two kinds of sequences, computed over a fixed uniform grid of 154 = 50625 points. Training Sample size L Set 500 1000 1500 2000 2500 3000 URS 0.366 0.302 0.256 0.235 0.220 0.199 LDS 0.323 0.291 0.251 0.228 0.209 0.193 Table 1. Function 1: Average RMS for random and low-discrepancy sequences.

Function 2) (Multiplicative) ´ ³ p ¡ ¢¡ ¢ g(x) = 4 x1 − 12 x4 − 12 sin 2π (x22 + x23 ) ,

x ∈ [0, 1]4

For this function, the same network with ν = 50 hidden units and the same input vectors contained in the 36 training sets used for function 1 were employed. Table 2 contains the average RMS (in 10−3 units) for the different sequences, computed over the same uniform grid of 154 = 50625 points used for function 1. Training Sample size L Set 500 1000 1500 2000 2500 3000 URS 0.657 0.493 0.271 0.213 0.179 0.157 LDS 0.402 0.336 0.221 0.157 0.136 0.124 Table 2. Function 2: Average RMS for random and low-discrepancy sequences.

Function 3) (Additive) ¡ ¢2 g(x) = 10 sin(πx1 x2 ) + 20 x3 − 12 + 10x4 + 5x5 + x6 ,

x ∈ [0, 1]6

36 new training sets with L = 3000 points were employed for this sixdimensional function. Again, 18 of them are based on a random extraction with uniform probability and the others 18 are based on low-discrepancy sequences. For each set, the same network with ν = 40 hidden units was trained by minimizing the empirical risk The RMS are computed over a uniform grid of 66 = 46656 points. Table 3 contains the average values of the RMS (in 10−3 units) for the two kinds of sampling. Training Sample size L Set 500 1000 1500 2000 2500 3000 URS 0.140 0.108 0.093 0.087 0.086 0.083 LDS 0.130 0.112 0.088 0.080 0.077 0.073 Table 3. Function 3: Average RMS for random and low-discrepancy sequences.

5

Comments

About the variation. In order for the bound in (5) to hold, the loss function ` must satisfy suitable regularity conditions. In particular, its variation in the sense of Hardy and Krause must be finite. It can be shown [6] that this implies finiteness of the variation for each element of the family of neural networks Γ and the unknown function g. In general, computing the variation of a function can be a very difficult task [10]. However, in case of “well-behaved” functions (having continuous derivatives), the computation of upper bounds for the variation can be much simpler. For this reason, it is possible to prove [6] that commonly used neural networks, such as feedforward neural networks and radial basis functions, satisfy the required regularity conditions. Consistency in case of noisy output. Suppose that the value of the output y, for a given input x, is affected by a random noise ² ∈ E ⊂