Document not found! Please try again

Deterministic Design for Neural Network Learning: An ... - CiteSeerX

5 downloads 0 Views 239KB Size Report
particular, the consistency of the Empirical Risk Minimization. (ERM) principle is analyzed, when the points in the input space are generated by employing a ...
1

Deterministic Design for Neural Network Learning: An Approach Based on Discrepancy Cristiano Cervellera and Marco Muselli, Member, IEEE

Abstract— The general problem of reconstructing an unknown function from a finite collection of samples is considered, in case the position of each input vector in the training set is not fixed beforehand, but is part of the learning process. In particular, the consistency of the Empirical Risk Minimization (ERM) principle is analyzed, when the points in the input space are generated by employing a purely deterministic algorithm (deterministic learning). When the output generation is not subject to noise, classical number-theoretic results, involving discrepancy and variation, allow to establish a sufficient condition for the consistency of the ERM principle. In addition, the adoption of low-discrepancy sequences permits to achieve a learning rate of O(1/L), being L the size of the training set. An extension to the noisy case is provided, which shows that the good properties of deterministic learning are preserved, if the level of noise at the output is not high. Simulation results confirm the validity of the proposed approach. Index Terms— Deterministic learning, empirical risk minimization, discrepancy, variation, learning rate.

(m)

AL (xL , y L ): training algorithm with m iterations (m) (m) αL : parameter vector obtained by AL (xL , y L ) λ(B): Lebesgue measure of the set B ⊂ Rd cB (x): characteristic function of the set B ⊂ Rd ∗ (xL ): discrepancy and star discrepancy of the DL (xL ), DL L sample x ∆(ϕ, B): alternating sum of the function ϕ at the vertices of the interval B V (d) (ϕ): variation in the sense of Vitali of the function ϕ VHK (ϕ): variation in the sense of Hardy and Krause of the function ϕ ∂i1 ,...,ik ϕ: kth partial derivative of ϕ(x) with respect to the components xi1 , . . . , xik WM (B): class of functions ϕ such that ∂i1 ,...,ik ϕ is continuous and bounded η ∈ R: random noise with zero mean I. I NTRODUCTION

N OTATION X ⊂ Rd , Y ⊂ R: input and output space x ∈ X, y ∈ Y : input vector and scalar output xi ∈ R: ith component of the input vector x g(x): unknown function to be estimated from the training set L: number of points in the training set (xl , yl ): lth example of the training set, with l = 1, . . . , L. xL ∈ X L : collection of all the input vectors xl in the training set p(x),q(x): probability density function for the training and the test phase Q(x) ∈ Q: distribution probability for the test phase Γ: family of neural network models ψ(x, α): neural network model belonging to Γ Λ ⊂ Rk : parameter space for the family Γ α ∈ Λ: parameter vector of the neural network model ψ(x, α) Ψ(L): deterministic function for the selection of xL `(·, ·): loss function R(α): expected risk for the neural network ψ(x, α) r(α): difference between the risk R(α) and the best achievable risk R(α∗ ) RQ (α): risk computed on the basis of the probability Q Remp (α, xL ): empirical risk computed on the sample xL AL (xL , y L ): training algorithm for the neural network Cristiano Cervellera is with the Istituto di Studi sui Sistemi Intelligenti per l’Automazione, Consiglio Nazionale delle Ricerche, via De Marini, 6 - 16149 Genova, Italy Marco Muselli is with the Istituto di Elettronica e di Ingegneria dell’Informazione e delle Telecomunicazioni, Consiglio Nazionale delle Ricerche, via De Marini, 6 - 16149 Genova, Italy

N

EURAL networks are recognized to be universal approximators, i.e., structures that can approximate arbitrarily well general classes of functions [1], [2], [3], [4]. These theoretical properties, together with a successful application in a wide variety of real-world situations, indicate such nonlinear architectures as excellent tools for the problem of learning an unknown functional dependence in high-dimensional contexts. Once the set of neural networks for the approximation of a given unknown function is chosen, the problem of estimating the best network inside such set (i.e., the network that is “closer” to the true function) from the available data (the training set) can be effectively analyzed in the context of learning theory. Most results on learning theory are based on a common statistical framework, which arose in the pattern recognition community [5], [6], [7], [8] and has been naturally extended to other inductive problems, like regression estimation [9], [10], [11] and probability density reconstruction [10]. In this framework available data are viewed as realizations of a random variable, generated according to an unknown (but fixed) probability density p. The learning problem is then equivalent to find the optimal function g that minimizes a measure of the error committed when g is employed. In most cases this has the form of a suitably defined cost functional, called expected risk, depending on a specific density q, which assigns the probability of dealing with any point in the feasible space. Since the density q is usually unknown, the minimization cannot be performed directly; alternative methods, e.g. Empirical Risk Minimization (ERM), using available data must

2

then be employed to achieve the solution of the given learning problem. Consistency of these methods is guaranteed by proper theorems [7], [8], [10], [11], [12], whose hypotheses generally include the following two: 1) samples are generated by i.i.d. realizations of the unknown density p; 2) densities p and q, used in the training and in the application phase, respectively, are equal. The first requirement is rarely verified in real world situations; nevertheless, the removal of the i.i.d. hypothesis limits the applicability of the available theoretical results [12], which are heavily based on Hoeffding’s inequality [8], [13]. An attempt in this direction is described in [14], but its validity is restricted to nonlinear FIR models. The equality condition for p and q cannot be ensured in practice; one can only hope that the mechanism involved in obtaining the samples for the training phase remains almost unchanged when new data are generated. On the other hand, if p and q are radically different from each other, the indirect minimization of the expected risk can lead to poor results. When the target function g represents a relation between an input space X and an output space Y , like in classification and regression problems, training samples are formed by inputoutput pairs (xl , yl ), for l = 0, . . . , L − 1, where xl ∈ X and yl is a possibly noisy evaluation of g(xl ). If the location of the input patterns xl is not fixed beforehand, but is part of the learning process, the term active learning (or query learning) is used in the literature. Most existing active learning methods use an optimization procedure for the generation of the input sample xl+1 on the basis of the information contained in previous training points [15], [16], [17], [18], which possibly leads to a heavy computational burden. In addition, some strong assumptions on the observation noise y−g(x) (typically, noise with normal density [16], [19]) or on the class of learning models are introduced. A discussion of different active learning techniques for training multilayer perceptrons is contained in [19]. Here the observation noise is supposed to be Gaussian with zero mean and the probability q is supposed to be known. Furthermore, the true function must belong to the family of models (zeroerror case), even if such assumption can be relaxed by suitable heuristics. It must be emphasized that the target of the methods described in [19], as for most of the active learning literature, is the minimization of the average generalization error, unlike classical Statistical Learning Theory (SLT) [6], [10], [12] which deals with the worst case error. In the present paper a first attempt in this direction is proposed, by proving consistency for the worst-case error under very general conditions: 1) the probability q, by which the samples are drawn after the training phase, can be unknown (distribution-free case); 2) if q is known, it can be suitably taken into account (distribution-dependent case); 3) the observation noise can be described by any probability distribution, provided that it does not depend on x

and its mean is zero1 ; 4) the true function and the model can have any form; it is sufficient that they satisfy mild regularity conditions. It will be shown that popular neural network structures, such as multilayer perceptrons and radial basis function networks, can be succesfully adopted. Samples (x0 , y0 ) . . . (xl , yl ) of the training set are not taken into account for the selection of the next sample xl+1 . The same approach is also successfully adopted in other analogous situations, such as experimental design in statistics [20]. This can be seen as a lack of optimization, but on the other hand it results in a very simple and fast generation of the training set, thus avoiding the computational burden given by other active learning techniques [16], [17]. In the distribution-free case we assume of having no information about the behavior of q; thus, a uniform probability on the set X will be employed to evaluate the expected risk functional. This corresponds to suppose equally probable the extraction of each feasible input pattern. A proper theorem will show that in the zero-error case consistency of methods adopting a uniform probability measure implies consistency with any other bounded density q. Number theoretic results on integral approximation can then be employed to obtain upper bounds for the generalization error, which depends on the size of the training set. Characteristic quantities, such as variation and discrepancy, play a basic role in determining the position of samples in the input space, which guarantee consistency. In the case where the output is unaffected by noise, the corresponding learning rates can be significantly better than those obtained with the passive learning approach [8], [10], [12]. In fact, by employing special kinds of low-discrepancy sequences, an almost linear convergence can be achieved. It is important to note that the upper bounds on the generalization error are intrinsically deterministic and do not imply a confidence level for the probability of obtaining a consistent result. In case of noisy output, the consistency of the method can still be proved, though the stochastic nature of the noise may spoil the advantages of a deterministic design, thus resulting in a final quadratic rate of convergence of the estimation error (which, however, is not worse than classical SLT bounds [10]). Anyway, in many practical situations, the effect of the noise term on the estimation rate can reasonably be considered small. Since the generation of samples for learning and the entire training process is intrinsically deterministic, the mathematical framework introduced in this paper will be named deterministic learning to distinguish it from the widely accepted statistical learning. The paper is organized as follows. In Section II the main distribution-free deterministic learning problem is presented. The ERM principle is introduced and its general consistency proved; bounds on the learning rates based on discrepancy are derived, together with the regularity conditions required by the 1 This property is sometimes referred to as homoscedasticity in the statistics literature.

3

involved functions, with a focus on two popular neural network structures. In Section III the distribution-dependent context is discussed, while Section IV is devoted to the case of noisy output. Section V contains experimental results based on the deterministic learning of different test functions. Finally, the Appendix contains all the proofs of the theorems presented in the work. II. T HE DISTRIBUTION - FREE CASE We© want to estimate, insideª our family of neural networks Γ = ψ(x, α) : α ∈ Λ ⊂ Rk , the device (i.e., the parameter vector α) that best approximates a given functional dependence of the form y = g(x), where x ∈ X ⊂ Rd and y ∈ Y ⊂ R, starting from a set of samples (xL , y L ) ∈ (X L × Y L ). Suppose at first that the output for a given input is observed without noise; the extension to the “noisy” case will be considered in Section IV. In the following we will assume that X is the d-dimensional semi-closed unit cube [0, 1)d . Suitable transformations can be employed in order to extend the results to other intervals of Rd or more complex input spaces. In particular, it is possible to consider spheres and other compact convex domains such as simplexes [21]. If the input space X is not compact, for the Ulam’s theorem [22, pag. 176] it is always possible to find a compact K ⊂ X such that the probability measure of the difference X \ K is small than any fixed positive value ². Now, the smallest interval I including K can be considered as the input space by simply assigning null probability to the measurable set I \ K and by defining g(α) = 0 for x ∈ I \ K. A proper deterministic algorithm will be considered to generate the sample of points xL ∈ X L , xL = {x0 , . . . , xL−1 }; since its behavior is fixed a priori, the obtained sequence xL is uniquely determined and is not the realization of some random variable. To this purpose, we introduce a function l Ψ : N 7→ ∪∞ l=1 X which acts as a deterministic input sequence selector. Then, the particular sequence xL can be written as Ψ(L). Accordingly, Ψl (L) denotes the single point xl of the sequence. The goodness of the approximation is evaluated at any point of X by a loss function ` : Y 2 7→ R that measures the difference between the function g and the output of the network. The risk functional R(α), which measures the difference between the true function and the model over X, is defined as Z R(α) = `(g(x), ψ(x, α))dx (1) X

Then, the estimation problem can be stated as Problem E Find α∗ ∈ Λ such that R(α∗ ) = min R(α) α∈Λ

In defining Problem E we have supposed the existence of the minimum of R(α) over Λ; if this is not the case the target of our problem can be to find α∗ ∈ Λ such that R(α∗ ) < inf R(α) + ² for some fixed ² > 0.

α∈Λ

The choice of the Lebesgue measure in (1) for evaluating the estimation performance is dictated by the assumptions that (i) the behavior of the function in the various regions of the input space is totally unknown and (ii) we have no hint about the probability measure by which future input samples will be drawn. Under these hypotheses, it would be unreasonable to priviledge certain regions of the input space over others. Anyway, the use of the uniform measure is not a limitation: in fact, if the risk R can be minimized up to any accuracy (as is for most neural network architectures), the consistency of learning for the risk computed with the Lebesgue measure implies consistency also for the risk computed with any other measure which is absolutely continuous with respect to the uniform one. The following subsection will be devoted to this issue. Note that, since X is bounded, the Lebesgue measure can be seen as the probability density of a uniformly distributed random variable. Since we know g only in correspondance of the points of the sequence xL , we employ some learning algorithm AL : (X L × Y L ) 7→ Λ aimed at minimizing R on the basis of the available data. We define αL = AL (xL , y L ), where y L = {y0 , . . . , yL−1 }, by which we obtain the corresponding risk R(αL ). r(αL ) will denote the difference between the actual value of the risk R(αL ) and the best achievable risk R(α∗ ) r(αL ) = R(αL ) − R(α∗ ) Definition 1: We say that the learning procedure is deterministically consistent if lim r(αL ) = 0

L→∞

For a fixed family of neural networks, the rate of convergence of r(αL ) can be interpreted as a measure of efficiency for the deterministic input selector Ψ. A. General consistency of the uniform estimation Define the risk computed on the basis of a testing probability distribution Q(x) as Z RQ (αL ) = `(g(x), ψ(x, αL ))dQ(x), Q ∈ Q X

where Q is a class of probability distributions and αL is the parameter vector obtained by xL after the training process. Theorem 1: Suppose the following two hypotheses are verified: 1) The learning algorithm is deterministically consistent and the risk can be minimized up to any desired accuracy (zero-error hypothesis), i.e., lim R(αL ) = 0. L→∞ 2) Q is a class of probability distributions endowed with density, i.e., for all Q ∈ Q there exists q ∈ Ω such that dQ(x) = q(x)dx. Furthermore, the set Ω of probability density functions is uniformly bounded, i.e. there exists M ∈ R such that sup kqk∞ ≤ M q∈Ω

4

Then we have lim sup RQ (αL ) = 0

L→∞ q∈Ω

B. Empirical Risk Minimization According to classical SLT literature, we consider the minimization of R on the basis of the empirical risk given L observation samples Remp (α, xL ) =

L−1 1 X `(yl , ψ(xl , α)) L l=0

In this case the learning algorithm AL is an optimization technique searching for the minimum of Remp (α, xL ); denote with α∗L the parameter vector obtained after L samples have been extracted and the minimization has been performed α∗L = arg min Remp (α, xL ) α∈Λ

Since optimization algorithms are generally iterative, we (m) define AL as the learning algorithm obtaining n by taking o∞ (m) the first m iterations of AL and consider AL m=1 as the full training algorithm. Accordingly, denote with (m) (m) αL = AL (xL , y L ) the parameter vector produced by (m) AL . n Definition o∞ 2: We say that the learning algorithm (m) is deterministically convergent if, for all AL m=1 ² > 0 and all L, there exists an index m ¯ such that, for m ≥ m, ¯ it is possible to obtain (m)

Remp (αL , xL ) − Remp (α∗L , xL ) < ²

(2)

The next theorem gives sufficient conditions for a learning procedure to be deterministically consistent. It can be viewed as the corresponding result in deterministic learning of the key theorem [10] for classical SLT. Theorem 2: Suppose the following two conditions are verified 1) The deterministic sequence Ψ(L) of samples is such that lim sup |Remp (α, Ψ(L)) − R(α)| = 0

L→∞ α∈Λ

(3)

n o∞ (m) 2) The learning algorithm AL is deterministically m=1 convergent. Then the learning procedure is deterministically consistent. C. Discrepancy-based learning rates Since we are using the Lebesgue measure, which corresponds to a uniform distribution, to weight the loss function on the input space X, we must ensure a good spread of the points of the deterministic sequence xL = Ψ(L) over X. This seems to be a basic requirement for the fulfillment of Condition (3), which can be seen as the equivalent, for deterministic sequences, of the fundamental uniform convergence of empirical means property of SLT [23].

A measure of the spread of points of xL is given by the discrepancy, widely employed in numerical analysis [21], [24] and probability [25]. Consider a sample xL ∈ X L . If cB is the characteristic function of a subset B ⊂ X (i.e., cB (x) = 1 if x ∈ B, cB (x) = 0 otherwise), we define: C(B, xL ) =

L−1 X

cB (xl )

l=0

C(B, xL ) indicates the number of points of xL which belong to B. Definition 3: If βQis the family of all the closed subintervals d of X of the form i=1 [ai , bi ], the discrepancy DL (xL ) is defined as ¯ ¯ ¯ ¯ C(B, xL ) DL (xL ) = sup ¯¯ − λ(B)¯¯ (4) L B∈β ∗ Definition 4: The star discrepancy DL (xL ) of the sequence xL is defined by (4), when β Q is the family of all the closed d subintervals of X of the form i=1 [0, bi ]. A classic result [26] states that the two following properties are equivalent: 1) Ψ(L) is uniformly distributed in X, i.e., L−1 1 X lim cB (Ψl (L)) = λ(B) for all the subintervals L→∞ L l=0 B of X 2) lim DL (Ψ(L)) = 0 L→∞ ∗ 3) lim DL (Ψ(L)) = 0 L→∞

So the smaller is the discrepancy or the star discrepancy, the more uniformly distributed is the sequence of points in the input domain. d Y For each vertex of a given subinterval B = [ai , bi ] of i=1

X, we can define a binary label by assigning ‘0’ to every ai and ‘1’ to every bi . For every function ϕ : X 7→ R we define ∆(ϕ, B) as the alternating sum of ϕ computed at the vertices of B, i.e., X X ∆(ϕ, B) = ϕ(x) − ϕ(x) x∈eB

x∈oB

where eB is the set of vertices with an even number of ‘1’s in their label, and oB is the set of vertices with an odd number of ‘1’s. Definition 5: The variation of ϕ on X in the sense of Vitali is defined by [24] X V (d) (ϕ) = sup |∆(ϕ, B)| (5) ℘

B∈℘

where ℘ is any partition of X into subintervals. If the partial derivatives of ϕ are continuous on X, it is possible to write V (d) (ϕ) in an easier way as [24] ¯ Z 1 Z 1¯ ¯ ∂dϕ ¯ (d) ¯ ¯ V (ϕ) = ··· (6) ¯ ∂x1 · · · xd ¯ dx1 · · · dxd 0 0

5

where xi is the ith component of x. The equivalence between (5) and (6) can be readily seen when the function ϕ is monotone increasing in the domain [0, 1]d . In this case, the supremum in (5) is reached when the partition ℘ contains only the whole interval [0, 1]d . On the other hand, the dth derivative in (6) is always nonnegative and a direct integration shows that the alternating sum ∆(ϕ, [0, 1]d ) follows as result. A similar reasoning allows to achieve the same conclusion when ϕ is monotone decreasing. For a general ϕ the equivalence between (5) and (6) can be viewed by partitioning the domain [0, 1]d into subintervals, where the restriction of ϕ to each of them is again monotone.

of the training procedure (as defined in Theorem 2); this corresponds to ensure a sufficient spread of the points in Ψ(L). Furthermore, the rate of convergence of the estimation error is directly related to the rate of convergence of the star discrepancy of the sequence Ψ(L). It is interesting to note that, due to the structure of the KH inequality, this can be dealt separately from the complexity of the neural network. This last issue, in fact, is ruled by the value of the variation of the loss function (which plays the role that VC-dimension has in SLT literature).

For 1 ≤ k ≤ d and 1 ≤ i1 < i2 < · · · < ik ≤ d, let V (k) (ϕ, i1 , . . . , ik ) be the variation in the sense of Vitali of the restriction of ϕ to the k-dimensional face {(x1 , . . . , xd ) ∈ X : xi = 1 for i 6= i1 , . . . , ik }.

In order for Assumption 2 to be true, the loss function `, the family of neural networks Γ and the unknown function g must satisfy suitable regularity conditions. For a generic function φ(z) : A ⊂ Rp 7→ R, we introduce, for 1 ≤ k ≤ p and 1 ≤ i1 ≤ i2 ≤ . . . ≤ ik ≤ p, the following notation

Definition 6: The variation of ϕ on X in the sense of Hardy and Krause is defined by [24] VHK (ϕ) =

d X

X

V (k) (ϕ, i1 , . . . , ik )

(7)

k=1 1≤i1 [a1 , . . . , a> ν , b1 , . . . , bν , c0 , . . . , cν ] and à d ! ν X X Y (k) anij ∂i1 ,...,ik ψ = cn h ani xi + bn n=1

i=1

where h (z), z ∈ R, is the kth derivative of h. If we define (k) 4 ¯ h = sup |h(k) (z)| (supposed to be finite), we obtain z∈R

|∂i1 ,...,ik ψ| ≤

i=1

d Y

0

i=1

xi .

i=1

Here every term V (k) (ϕ2 , i1 , . . . , ik ) is equal to 1. So the variation is

VHK (ϕ2 ) =

d X

X

k=1 1≤i1 > τ> 1 , . . . , τ ν ] , and the computation of ∂i1 ,...,ik ψ yields

(1 − xi ).

i=1

In this case the restriction of ϕ3 to the k-dimensional face {(x1 , . . . , xd ) ∈ X : xi = 1 for i 6= i1 , . . . , ik } is equal to 0 for each k < d. Therefore, only the term V (d) survives and the variation is Z 1 VHK (ϕ3 ) = dxi = 1

∂i1 ,...,ik ψ =

ν X

k (−s−2 n ) cn

n=1

à P ! d 2 Y (x − τ ) i ni (xij − τnij ) exp − i=1 2 2sn 1≤j≤k

Thus we have

0

In general, computing the variation of a function can be a very difficult task (a detailed discussion of this kind of variation can be found in [28], [29], [30]). Furthermore, examples 2) and 3) show how similar functions (identical up to rotation) give completely different behaviors. However, in case of “well-behaved” functions (having continuous derivatives), the computation of upper bounds for the variation can be much simpler. This is the case of most commonly used neural networks. As an example, we compute upper bounds for the variation of feedforward neural networks and radial basis functions.

|∂i1 ,...,ik ψ| ≤

ν X

|s−2k n ||cn |

n=1

Y

|xij − τnij |

1≤j≤k

If the parameters cn and τnij are bounded for each n = 1, . . . , ν the finiteness of sup VHK (ψ) directly follows. α∈Λ

If X ≡ [0, 1)d , usually τni ∈ [0, 1); in this case we have |∂i1 ,...,ik ψ| ≤

ν X

|s−2k n ||cn |

n=1

For what concerns sn , we can notice that the upper bound on the variation increases as sn gets smaller, i.e., as the

7

basis functions “shrink”. This seems consistent with classic learning theory, since it might be related to the well-known phenomenon of overfitting [31]. E. Bounds on the rate of convergence of the learning procedure Since the rate of convergence of the learning procedure can be controlled by the rate of convergence of the star discrepancy of the sequence Ψ(L), in this section we present a special family of deterministic sequences which turns out to yield an almost linear convergence of the estimation error. Such sequences are usually referred to as low-discrepancy sequences, and are commonly employed in quasi-random (or quasi-Monte Carlo) integration methods. A detailed discussion of these methods, together with the construction of lowdiscrepancy sequences, can be found in [24]. Definition 7: An elementary interval in base b (where b ≥ 2 is an integer) is a subinterval E of X of the form:

∗ LDL (xL ) ≤ C(d, v)q T (v,d) (log L)d + O((log L)d−1 )

We can then perform an optimization in the family of sequences with different v, by choosing Cd = min C(d, v)v T (v,d) v

(10)

where the minimum is made over all the prime powers v. Cd can be found as the minimum in a finite set [24], so it is possible to tabulate the values of Cd for every dimension d, along with the corresponding optimal prime power v. It can be shown [24] that asymptotically we have µ ¶d 1 d Cd < d! log(2d)

where ai , pi ∈ Z, pi > 0, 0 ≤ ai ≤ bpi for 1 ≤ i ≤ d.

This means that Cd → 0 superexponentially as d → ∞. In other words, we can say that our (T (v, d), d)-sequence xL , obtained after the minimization in (10), satisfies µ ¶ (log L)d−1 ∗ DL (xL ) ≤ O (11) L

Definition 8: Let t, m be two integers satisfying 0 ≤ t ≤ m. A (t,m,d)-net in base b is a set P of bm points in X ⊂ Rd such that C(E; P ) = bt for every elementary interval E in base b with λ(E) = bt−m . If the sample xL is a (t, m, d)-net in base b, every elementary interval in which we divide X must contain bt points of xL . It is clear that this is a property of good “uniform spread” for our sample in X.

Consequently, (t, d)-sequences in base b satisfy Assumption 1 with an almost linear rate of convergence. For what concerns the learning problem, if we use the “optimized” (T (v, d), d)sequence, we obtain from Theorem 4 and equation (11) the following result Corollary 1: Let xL be a (T (v, d), d)-sequence optimized as in (10), and V¯ = sup VHK (`) < ∞. Then we have

E=

d Y

[ai b−pi , (ai + 1)b−pi )

i=1

α∈Λ

Note that the cardinality of (t, m, d)-nets is constrained to be equal to bm ; in order to have a higher degree of freedom in choosing the sample size L (t,d)-sequences are introduced. Definition 9: Let t ≥ 0 be an integer. A sequence {x0 , . . . , xL−1 } of points in X ⊂ Rd is a (t,d)-sequence in base b if, for all the integers k ≥ 0 and m ≥ t (with (k + 1)bm ≤ª L − 1), the point set consisting of © xkbm , . . . , x(k+1)bm is a (t, m, d)-net in base b. The definition of (t, m, d)-nets and (t, d)-sequences are due to Sobol’ [32] in the case b = 2 and to Niederreiter [33] in the general case. It is possible to prove the following result [24]. Theorem 5: For every dimension d ≥ 1 and every prime power v (i.e., an integer power of a prime number), there exists a (T (v, d), d)-sequence in base v, where T (v, d) is a suitable constant that depends only on v and d. Explicit non-asymptotic bounds for the star discrepancy of a (t, d)-sequence in base b can be given. Furthermore, it can be shown that the asymptotic behavior is better than that obtained with classic Monte Carlo methods [34]. In particular, suppose that b is a prime power v. By Theorem 5 there exists a (T (v, d), d)-sequence xL where T (v, d) can be determined from v and d. For this sequence we have [24]

¯ ¯ sup ¯Remp (α, xL ) − R(α)¯ ≤ O

α∈Λ

¶ µ¯ V (log L)d−1 L

(12)

If we compare the bound in (12) with classic results from SLT [10], we can see that, for a fixed dimension d, the use of deterministic low-discrepancy sequences permits a faster asymptotic convergence. Specifically, if we ignore logarithmic µ ¶ 1 factors, we have a rate of O for a (T (v, d), d)-sequence, L µ ¶ 1 and a rate of O for a random extraction of points. L1/2 III. T HE DISTRIBUTION - DEPENDENT CONTEXT Consider the problem of minimizing the risk functional when we have complete information about the probability measure Q by which the samples will be drawn in the application phase. In this case, the learning problem aims at finding Z min RQ (α) = `(g(x), ψ(x, α))dQ(x) α∈Λ

X

where Q(x) is a known probability distribution. This problem can still be solved by employing deterministic sequences generated by a uniform probability measure if the following method is adopted.

8

Consider the following assumptions Assumption 3: The probability distribution Q admits a continuous density q. Assumption 4: q belongs to WM (X) and ` belongs to WM (Y 2 ). Then we can write the risk RQ as Z Z RQ (α) = `(g(x), ψ(x, α))q(x)dx = `0 (x, α)dx X

X

We can generate a deterministic low-discrepancy sequence xL = {x1 , . . . , xL } as described in the previous sections, and minimize the following empirical risk Remp,Q (α) =

L X

`0 (xl , α)

l=1

where `0 (xl , α) = `(g(xl ), ψ(xl , α))q(xl ). By assumption 4 and Lemma 1, we can prove that there exists M 0 ∈ R such that `0 belongs to WM 0 (X). In this way, by minimizing Remp,Q (α) we minimize RQ (α), and the deterministic consistency of the learning problem is guaranteed.

Consequently, the consistency of the learning algorithm is preserved. Anyway, the presence of the noise spoils the linear rate of estimation for the “deterministic” part of the output, resulting in a global quadratic rate of convergence (which is not worse than classic SLT rates). This is reasonable, since we can expect to fully exploit the advantageous properties of the quasi-random approach in a purely deterministic context. Nevertheless, if the output error is small, by applying the Bernstein-Chernoff bounds [35] for the last two terms at the right hand side of (13) we obtain again that the rate of convergence is almost linear in 1/L. V. E XPERIMENTAL RESULTS In order to present experimental results on the use of lowdiscrepancy sequences, three different test functions taken from [36] have been considered and approximated by means of neural networks of the form of one hidden-layer feedforward networks with sigmoidal activation function, i.e., nonlinear mappings of the form (9) where h(·) is the “hyperbolic tangent”

IV. C ONSISTENCY IN CASE OF NOISY OUTPUT Suppose that the value of the output y, for a given input x, is affected by a random noise η ∈ E ⊂ < y = g(x) + η Then, a random term ηl is given in correspondance of any sample input xl . We make the following hypotheses Assumption 5: 1) The vectors ηl are i.i.d. according to a probability measure Pη with density pη and have zero mean; 2) The random vectors ηl are independent from xl for l = 0, . . . , L − 1; 3) The loss function ` is quadratic (a typical choice in regression and classification problems). Then we can prove the following. Corollary 2: Suppose Assumptions 6.1 - 6.3 hold. Then we have sup |Remp (α, xL ) − R(α)|

α∈Λ

¯ ¯ L−1 Z ¯1 X ¯ ¯ ¯ 2 2 ≤ sup ¯ (g(xl ) − ψ(xl , α)) − (g(x) − ψ(x, α)) dx¯ ¯ α∈Λ ¯ L l=0 X µ ¶ L−1 1 X +2 |ηl | sup |(g(xl ) − ψ(xl , α))| L l=0 α∈Λ ¯ L−1 ¯ Z ¯1 X ¯ ¯ ¯ (13) +¯ (ηl )2 − η 2 p(η)dη ¯ ¯L ¯ E l=0

The first summand in (13) converges, µ ¶ as described in the 1 . previous sections, with a rate of O L Then, Hoeffding’s inequality µ [8], [13] ¶ allows to obtain a 1 rate of convergence of order O √ for both the second L and the third term, which does not depend on α.

h(z) =

ez − e−z . ez + e−z

For each function, low-discrepancy sequences (LDS) and training sets formed by i.i.d. samples randomly extracted with uniform probability (URS) have been compared. The LDS are based on Niederreiter sequences with different prime power bases, which benefit from the almost linear convergence property described in Subsection II-E; a detailed description of such sequences can be found in [24], while the software implementation has been taken from [37]. The empirical risk Remp (α, xL ), computed by using a quadratic loss function Remp (α, xL ) =

L−1 1 X 2 (g(xl ) − ψ(xl , α)) L l=0

has been minimized through the Levenberg-Marquardt algorithm [38], up to the same level of accuracy for each function. The generalization error has been estimated by computing the square Root of the Mean Square error (RMS) over a set of points obtained by a uniform discretization of the components of the input space. Function 1) (Highly oscillatory behavior) g(x) = sin(a · b), x ∈ [0, 1]4 .

where a = e2x1 ·sin(πx4 ) , b = e2x2 ·sin(πx3 ) ,

Six training sets with length L = 3000, three of which random and three based on Niederreiter’s low-discrepancy sequences, have been used to build a neural network with ν = 50 hidden units. In order to construct a learning curve which shows the improvement achieved by increasing the number of training samples, the results obtained with subsets

9

containing L = 500, 1000, 1500, 2000 and 2500 points of the basic sequences are also presented. The chosen level of accuracy for the minimization of the empirical risk is 0.14 for any size of the training set. Table I contains the RMS for the various sequences, computed over a fixed uniform grid of 154 = 50625 points. In the table, “UR-n” means “uniform random sequence number n”, while “LD-q” means “low-discrepancy sequence in base q”. In Table II, the average values of the RMS for the two kinds of sampling are presented. TABLE I F UNCTION 1: RMS FOR THE 36 Training Set UR-1 UR-2 UR-3 LD-3 LD-5 LD-7

500 0.390 0.338 0.371 0.354 0.305 0.310

1000 0.304 0.287 0.313 0.313 0.272 0.288

Sample 1500 0.254 0.223 0.293 0.266 0.224 0.262

2500 0.201 0.195 0.263 0.211 0.193 0.222

3000 0.191 0.194 0.212 0.195 0.184 0.200

TABLE II F UNCTION 1: AVERAGE RMS FOR RANDOM AND LOW- DISCREPANCY SEQUENCES . Training Set URS LDS

500 0.366 0.323

1000 0.302 0.291

Sample 1500 0.256 0.251

size L 2000 0.235 0.228

2500 0.220 0.209

3000 0.199 0.193

Function 2) (Multiplicative) ¡

g(x) = 4 x1 − x ∈ [0, 1]4

¢¡ 1 2

x4 −

¢ 1 2

³ sin 2π

p

´ (x22 + x23 ) ,

For this function, the same network with ν = 50 hidden units and the same input vectors contained in the 36 training sets used for function 1 were employed. The empirical risk was minimized up to the accuracy level of 10−4 . Table III contains the RMS (in 10−3 units) for the different sequences, computed over the same uniform grid of 154 = 50625 points used for function 1. Table IV contains the average values of the RMS (in 10−3 units) for the two kinds of discretizations. F UNCTION Training Set UR-1 UR-2 UR-3 LD-3 LD-5 LD-7

TABLE III 2: RMS FOR THE 36

500 0.509 0.716 0.748 0.397 0.351 0.457

1000 0.586 0.435 0.457 0.340 0.318 0.349

Sample 1500 0.289 0.258 0.266 0.258 0.207 0.198

DISCRETIZATIONS .

size L 2000 0.249 0.190 0.198 0.145 0.147 0.180

2500 0.209 0.156 0.171 0.131 0.131 0.147

Training Set URS LDS

500 0.657 0.402

1000 0.493 0.336

Sample 1500 0.271 0.221

size L 2000 0.213 0.157

2500 0.179 0.136

3000 0.157 0.124

Function 3) (Additive) ¡ ¢2 g(x) = 10 sin(πx1 x2 ) + 20 x3 − 21 + 10x4 + 5x5 + x6 , x ∈ [0, 1]6

DISCRETIZATIONS .

size L 2000 0.231 0.208 0.267 0.238 0.201 0.244

TABLE IV F UNCTION 2: AVERAGE RMS FOR RANDOM AND LOW- DISCREPANCY SEQUENCES .

3000 0.135 0.141 0.167 0.116 0.124 0.131

36 new training sets with L = 3000 points were employed for this six-dimensional function. Again, 18 of them are based on a random extraction with uniform probability and the others 18 are based on low-discrepancy sequences. For each set, the same network with ν = 40 hidden units was trained by minimizing the empirical risk up to the accuracy level of 10−4 . The RMS are computed over a uniform grid of 66 = 46656 points. In Table V the RMS (in 10−3 units) for the different sequences are presented, while Table VI contains the average values of the RMS (in 10−3 units) for the two kinds of sampling. TABLE V F UNCTION 3: RMS FOR THE 36 Training Set UR-1 UR-2 UR-3 LD-8 LD-9 LD-11

500 0.156 0.111 0.155 0.095 0.147 0.148

1000 0.114 0.106 0.104 0.085 0.131 0.119

Sample 1500 0.094 0.090 0.094 0.075 0.096 0.093

DISCRETIZATIONS .

size L 2000 0.087 0.089 0.084 0.070 0.085 0.084

2500 0.086 0.087 0.084 0.068 0.081 0.081

3000 0.085 0.084 0.081 0.066 0.077 0.077

TABLE VI F UNCTION 3: AVERAGE RMS FOR RANDOM AND LOW- DISCREPANCY SEQUENCES . Training Set URS LDS

500 0.140 0.130

1000 0.108 0.112

Sample 1500 0.093 0.088

size L 2000 0.087 0.080

2500 0.086 0.077

3000 0.083 0.073

The results obtained for the three functions, each having different behavior and complexity, show that LDS outperform URS. In fact, in all the cases but one (function 3, L = 1000) the average RMS given by LDS are smaller. Furthermore, the best single RMS is always given by a LDS. Finally, the better performance of the LDS with respect to URS becomes more evident when the size L increases, thus confirming the good asymptotic properties of this particular kind of deterministic sequences.

10

VI. C ONCLUSIONS The context of Deterministic Learning has been introduced for the problem of estimating an unknown functional dependence when the training set is generated by a fixed deterministic algorithm. Sequences based on the notion of discrepancy have then been proposed for the generation of input/output samples. The consistency of the learning process has been proved under mild regularity conditions to be satisfied by the set of functions. In particular, the concept of variation has been discussed, showing two case studies on widely used neural network structures. In case of absence of noise on the output observations, the generalization error is proved to be almost linear in terms of sample complexity, thus suggesting that the method is particularly suitable for high-dimensional contexts. In the experimental tests presented, training sets based on low-discrepancy deterministic sequences outperform randomly generated ones, thus confirming in practice the good theoretical properties of the method. A PPENDIX Proof of Theorem 1: If Assumptions 1 and 2 hold we can ¯ such that, for all L > L ¯ choose, for a given ² > 0, L R(αL ) ≤

² . M

By Assumption 2 we can write RQ (αL ) in the following form

Proof of Theorem 4: By considering the definition of Remp (α, Ψ(L)) and R(α), the KH inequality (8) can be written as ∗ |Remp (α, Ψ(L)) − R(α)| ≤ VHK (α)DL (Ψ(L))

By assumptions 1 and 2 we have lim |Remp (α, Ψ(L)) − R(α)| = 0 for each α ∈ Λ.

L→∞

For any fixed α, VHK (α) is a constant that depends on the loss function, on the approximating model and on the function we want to approximate, but not on the number of samples L. Therefore, if we consider: V¯ = sup VHK (α) α∈Λ

we can write: ∗ |Remp (α, Ψ(L)) − R(α)| ≤ V¯ DL (Ψ(L))

ˆ independent from α Then, for each ² > 0, there exists L ˆ and such that |Remp (α, Ψ(L)) − R(α)| < ² for every L > L every α ∈ Λ. Proof of Lemma 1: We recall that, for a generic function φ(z) : A ⊂ Rp 7→ R, ∂i1 ,...,ik φ is defined, for 1 ≤ k ≤ p and 1 ≤ i1 ≤ i2 ≤ . . . ≤ ik ≤ p, as 4

∂i1 ,...,ik φ =

Z RQ (αL ) =

`(g(x), ψ(x, αL ))q(x)dx XZ



M

`(g(x), ψ(x, αL ))dx X

=

M R(αL ) ≤ M

² =² M

Proof of Theorem 2: If condition (3) is verified we can ¯ = L(²) ¯ ¯ choose a L such that, for ² > 0 and for all L ≥ L, (m)

(m)

R(αL ) ≤ Remp (αL , Ψ(L)) +

² 3

that

∂kφ ∂zi1 · · · ∂zik

By (7) it follows that the total variation VHK (g) is finite if every term V (k) (g; i1 , . . . , ik ), for 1 ≤ k ≤ d, is finite, where ξi = 1 for i 6= i1 , . . . , ik . To prove this fact we recur to equation (6) and write Vˆ (k) (g; i1 , . . . , ik ) = ¯ Z 1¯ k Z 1 ¯ ∂ g(µ1 (ξ1 , . . . , ξd ), . . . , µs (ξ1 , . . . , ξd )) ¯ ¯ ¯ dξi · · · dξi ... 1 k ¯ ¯ ∂ξi1 · · · ∂ξik 0 0

The term Vˆ (k) (g; i1 , . . . , ik ) corresponds to the actual variation in the sense of Vitali V (k) (g; i1 , . . . , ik ), provided that the partial derivative under integration is continuous on [0, 1)d . In particular, we have

and Remp (α∗ , Ψ(L)) ≤ R(α∗ ) +

² 3

Condition (2) states that, after a sufficient training, we can (m) obtain a parameters vector αL such that (m)

Remp (αL , Ψ(L)) ≤ Remp (α∗L , Ψ(L)) +

² 3

By definition of α∗L we have Remp (α∗L , Ψ(L)) ≤ Remp (α∗ , Ψ(L)) By combining the previous inequalities we obtain the result.

Vˆ (k) (g; i1 , . . . , ik ) ¯ ¯ k ¯ ∂ g(µ1 (ξ1 , . . . , ξd ), . . . , µs (ξ1 , . . . , ξd )) ¯ ¯ ¯ ≤ sup ¯ ¯ ∂ξi · · · ∂ξi ξ ,...,ξ i1

1

ik

k

For a generic ξp , i1 ≤ p ≤ ik , we can compute s

X ∂g ∂µj ∂g = ∂ξp ∂µj ∂ξp j=1 The second partial derivative with respect to another generic ξq , q 6= p yields Ã" s # ! s X X ∂ 2 g ∂µn ∂µj ∂g ∂ 2 µj ∂2g = + ∂ξp ∂ξq ∂µj ∂µn ∂ξq ∂ξp ∂µj ∂ξp ∂ξq n=1 j=1

11

∂rg for the computation ∂ξi1 . . . ∂ξir ∂ r+1 g of the higher order derivatives , 2 < r ≤ k − 1, ∂ξi1 . . . ∂ξir+1 we have a combination through sums and products of several terms with the two following structures: In general, when deriving

1)

∂j g , for 1 ≤ j ≤ r. ∂µi1 . . . ∂µij

2)

∂µi1 · · · ∂µij ∂µm ∂ξir+1

∂ j µp , for 1 ≤ j ≤ r and 1 ≤ p ≤ s. ∂ξi1 . . . ∂ξij Each of these terms, once derived with respect to ξir+1 , generates a term corresponding to ∂ j+1 µp ∂ξi1 . . . ∂ξij ∂ξir+1

As 1 corresponds to ∂i1 ,...,ij g and 2 to ∂i1 ,...,ij µp , it follows easily from conditions 1 and 2 that ∂ k g(µ1 (ξ1 , . . . , ξd ), . . . , µs (ξ1 , . . . , ξd )) ∂ξi1 · · · ∂ξik is continuous on [0, 1)d . Furthermore, by recalling that, for every a, b ∈ R, |a + b| ≤ |a| + |b| and |ab| = |a||b|, we have ¯ k ¯ ¯ ∂ g(µ1 (ξ1 , . . . , ξd ), . . . , µs (ξ1 , . . . , ξd )) ¯ ¯ ¯

Suggest Documents