Efficient Learning of Continuous Neural Networks Abstract ... - CiteSeerX

Efficient

Learning

of Continuous

Pascal

Neural

Networks

Koiran*

DIMACS Rutgers

University

New Brunswick, NJ 08901 e-mail: [email protected] .edu

Abstract

generality has a drawback: the hypothesis produced by a learning algorithm is no longer guaranteed to generalize correctly on future instances of a problem. It is only guaranteed to perform nearly as well as the best hypothesis in a certain “touchstone” class. Moreover, the efficient learning algorithm proposed in [9] applies only to neural nets of fixed size and fixed input dimension.

We describe an efficient algorithm for learning from examples a class of feedforward neural networks with real inputs and outputs in a real-valued generalization of the Probably Approximately Correct (PAC) model. These networks can approximate an arbitrary function with an arbitrary precision. The learning algorithm can accommodate a fairly general worstcase noise model. The main improvement over previous previous work is that the running time of the algorithm grows only polynomially as the size of the target network increases (there is still an exponential dependence on the dimension of the input space, however). The main computational tool is an iterative “loading” algorithm which adds new hidden units to the hypothesis network sequentially. This avoids the difficult problem of optimizing the weights of all units simultaneously.

1

In this paper, we describe an efficient algorithm for learning in fixed dimension a class C of two-layer feedforward neural networks. This class is precisely defined in section 2. The hypothesis produced by the algorithm is a neural net of the same class which is guaranteed to generalize correctly on future examples, as in the traditional PAC model. Our framework for learning is a natural generalization of the PAC model to the continuous setting. A learning algorithm is allowed to query examples from an oracle O. Each request is of the form (X, Y) := O(p), where p E N is a precision parameter. For each call, the oracle draws a random point X’ from a fixed but unknown probability distribution D on Rd. It returns in constant time an example (X, Y) such that IY – ~(X’) I s 2-P and llX–X’ll~ =l~,~dlX,–X~l 0 and every z such that IIz]IM > 2“, D(z) = O. The traditional PAC model does not make any assumption at all about D, but we believe that this restriction is of little practical significance.

Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantaqe, the ACM copyright notice and the title of the publication and Its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. COLT 94- 7/94 New Brunswick, N..J. USA @ 1994 ACM 0-89791 -655-7/94/0007..$3.50

The goal of the learner

348

is to produce

from

these exam-

pies an hypothesis function F E C which is “close” to the target. Closeness ismeasured by the expected error d(~, F) such that

F)2= J’d w(~)?

w!

the specific case of neural networks are from Maass [9]. They are all reviewed in the next section. Some important computational tools are also due to Maass [10].

w~))~(~).

APPROXIMATION-THEORETIC AND STATISTICAL TOOLS

2

The loss function

‘@v)=

(1)

(Gih)2

is a measure of the relative difference between u and v. More precisely, the learner tries to find F such that d(~,

&’)

~-,

fn,

n=

1,2,

f’’0, ~~=1 y~ = 1 and gk E G, for some sufficiently large p. The average value of the inner product (F’ – ~, g – ~) forgc{gl,..., gP}is ~yk(~-f,g,-!)

= (~-~,~”-!)

the following ifs

statement

holds:

samples XI, . . . , z ~ are independently

then

with

probability

y at least

drawn from

A,

1 – 6,

S (m+c$)ll~-tll.

k=l

The

average

value

of I lg – ~112 is

for

every

f

E F

(E

denotes

the

expecta-

mathematical

tion). =

~~kllgk-fll’

~~kllgk-full’+llf”-fll’ k=l

k=l

~~~llgk112-

=

In this paper, we take X = Rid x R and % = & is y), the class of functions of the form f(z, y) = L(.F(z), where F E ch and L is the loss function (1). Although this case is not explicitly covered in [9], it is not difficult to see that the same bound holds: the pseudo-dimension of rh is 0(d2h2).

llf*112+llf*-f112

k=l

B’ + (m+d)’.

OC(B!

but this

G Rd; (z, t(z))

e S}).

x ~ O,

As in the proof of Theorem 1, (,) denotes the inner product. Here are a few useful notations: ch denotes the Given class of all such networks with h hidden units. two weight bounds B, B’ > 0, the subclass Ch(B, B’)

ch(B~

~ CIZ – z’I.

We consider only distributions on IRd x R that are inD on II?d and an underlying duced by a distribution targer function t :INd+ R, i.e., for any S C Rid x l?:

h

=

y)\

It can be shown that c = 2 is a suitable value, approach will not be further discussed here.

h

form

F(z)

y) – L(z’,

yfm

of

case,

E(m,v)G~L(F(z),

~, L(F(z),

y) =

f(z))

D(z).

/ Except

3

in section

5, the target

STATISTICAL

function

is also in C.

ASPECTS

“)”

This section presents a learning algorithm for our of neural networks. The claimed polynomial time follows from the existence of a similar bound for tain loading problem. This problem is analyzed next section.

It it well known that successfully loading some examples on a network does not guarantee a correct generalization. The number of examples must be related to the power of the network architecture. In the boolean setting, the concept of Vapnik-Chervonenkis dimension makes a rigorous analysis of this phenomenon possible. The pseudo-dimension is a generalization of this concept to the continuous setting due to David Haussler. He proved the following key result ([5], Corollary 2).

1F must satisfy called permissibility.

a benign measure-theoretic This condition is satisfied

natural classes of functions, considered in this paper.

350

and in particular

class C bound a cerin the

condition

for most for the classes

Theorem ezamples

4

TheTe

ezi.rts

all function~

an algorithm

of the form

L learning

(5).

Assume

from

that

where

KB

the

t is in C(BO, B&). When the dimension running time of L is polynomial in Bo, and in the parameter 1/.s, 1/6 and T defined in sec-

target

function

d is jized, B; tion

1. Its

complexity

is polynomial

in

and also in d. This is al$o true for precision p required in the oracle calls, quantities are actually independent of T.

all

KB is the Lipschitz constant of MB. Clearly, BdB’(l?), and the terms l&(Xi) – ~ I are

bounded

the

sample

0, let G be the set of functions form g(z) = a+((w, z) –@) with Ial s B and llwll~ < B’. Given F E Ch(B, B’) and q > 0, we seek a new neural net F’ of the form F! = cvF + Eg where a = a = l/n satisfying the following (n–l)/nand E=l– (approximate) optimality criterion:

since obviously g(NB ) < E’ (NB). One problem here is that C never gets to see the true random variables X;, but only noisy approximations Xi. Hence the algorithm will actually minimize

E(NB)

= :

‘@WXi)

- X)2

E(F’)

1=1

with an algorithm which is described in the next section. This is good enough because of the bound

E’(N~)

< .E(NB) + (.KB + 1)22-2P +2(KB + 1)2-P ~~$,B lNB(xi) --

– xl

0,

4.(X)

let

da be defined

Xi,

{

in

Yi.

by:

(lo)

~ O for z c S;

The transformation If a > 0, ~.(a) = a@(z/a). The contribution of the new gate to F’ is g’(z) = a’qi((w, z) - 8), with a’ = ?Ia. For a’ >0, one can write g’(z) = q$a~((w’, z) – 0’) with w’ = a7w and 61 = a’@. For a’ ~ 0, one can of course write g’(z) = –+_. J((w’, z) – 0’), with w’ = –a’w and & = –a’O. The advantage of this parametrization is that g’ is now a piecewise-linear function of the new parameters a’, w’ and 0’.

1 and that we have already computed the dichotomies on S’ = {Z1, . . . . Zn - l}. For each dichotomy D’ on S, there are either one or two dichotomies on S which extend D’. The main task of the algorithm is to determine exactly how each dichotomy can be extended. The following well-known fact is crucial.

[8] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning. In Proc. 5th ACM Workshop on Computational Learning Theory, pages 341– 352, 1992.

Lemma 8 A dichotomy D’ on S’ = {Z1, . . . . Z“-l} can be extended by two dichotomies on S iff it has a representative i such that l(zn) = O.

[6] K. Hornik. feedforward

Approximation capabilities of multilayer 4:251-257, Neural Networks, networks.

1991.

[9] W. Maass. Agnostic PAC-learning of functions on analog neural nets. In Proc. 7th Annual IEEE Conference on Neural Information Processing Systems, 1993.

Proof. Assume that D has a representative 1 such that l(zn) = O, 1(z) < 0 for z E S’- and Z(Z) > 0 for x G S’+. Let x~O be a non-zero component of Xn, let

power and [10] W. Maass. Bounds for the computational learning complexity of analog neural nets. In Proc. .25th STOC, pages 335-344, 1993. Convergence [11] D. Pollard. Springer Verlag, 1984.

of Stochastic

Theory of Linear [12] A. Schrijver. ming. Wfley, New-York, 1986.

[13] L. G. Valiant.

Learning

and Integer

disjunctions

APPENDIX: STRICT

Processes.

A = m~~5~ lZiO[ and let a = minZ~5~ Il(z)l. The two distinct dichotomies of representatives i(z)+ aZ;O/(2A) are extensions of Dt on S.

Program-

Assume now that D has two extensions DI and Dz of respective representatives 11 and 12 such that Z(Zm) >0 and 12(z” ) < 0. The linear l(z)

on Artificial

POLYNOMIAL

= i~(zn)l~(z)

is a representative

ENUMERATING DICHOTOMIES

function

of conjunctions. In

Proc. 9th International Joint Conference Intelligence, pages 560-566, 1985.

7

THE

of D and l(zn ) = O. •l

This characterization

IN

Lemma

TIME

9 Assume

x“. Consider defined by

In this section, we describe an algorithm for enumerating all strict dichotomies on a set S = {zl . . . . Zn } of n points in Rd. Recall that a dichotomy S = S- U S0 U S+ is strict if S0 = 0. For each fixed d, the algorithm works in polynomial time in the real number model of computation. This fact is established in section 7.2, where we also show that it runs in polynomial time in the bit model of computation,

– lz(zn)l~(z)

can be used as follows. that x:

is a non-zero

the subset T =

{yl,

component

. . . . yn-l}

~=zj_Ylz$

(13)

Xd fOTj=

l,...,

n–landi=

of

of Rid-l

l,...,

d – 1. A dichotomy

D

on S’ has a representative l(x) = ~~=1 a;xi satisfying 1(x”) = O iff iheTe is a dichotomy D’ on T such that e(D’) = e(D). If this is the case, a representative of D’ d– 1 is l’(z) = ~i=l a~z~.

A dichotomy D can be characterized by a vector e(D) 6 {O,l}n: ej(~) = O ifx’ ESandej(ll) = 1 ifz’ E S+, A representative of this dichotomy is an affine function 1 such that i(z) 0 ifx c S+. The enumeration algorithm will determine one representative for each dichotomy.

PTOOf.

The

condition

; ~~~~ a:x~/z~. ❑ ,.. .,l–l.

In this

l(zn )

=

case, i(z~)

O holds = /’(@’)

iff

ad

=

for -j =

If Zn #O, a similar transformation can be performed using any non-zero component of Zm. The condition x; #O is therefore not restrictive. Note also that if Z1, . . . . Zn are distinct in projective space, then the @“’s are dis-

In fact, we will only consider homogeneous dichotomies, i.e., dichotomies having a representative 1 such that 1(z) = O. The original problem of enumerating all homogeneous or non-homogenous dichotomies can be solved by going to Illd+l and adding the same non-zero component to the points of S. We can assume without loss of generality that the points ZIY . . . . Zn are all distinct in projective space, i.e., x’ #Ax~ for every i#j

tinct from O: if #“ = O then z~ = (Z~/Z;)Zn would be equal to Xn in projective space.

and x~

We are now ready to describe the enumeration agorithm. Let M(n, d) be a procedure enumerating the homogeneous dichotomies on n points of I%d. For n >1

354

and d > 1, we start by enumerating the dichotomies on {Z1, . . ..Z ‘-’} by a recursive call to %(n -1, d). The output is a list L of dichotomies, We then determine which dichotomies in L can be extended by two dichotomies on {zl,..., z“ }. According to lemmas 8 and 9, this can be done with a call to H(n – 1, d – 1) on the points vI, . . .,vn-l (we start by eliminating duplicates if these points are not distinct in projective space). The call to fi(n – 1, d – 1) produces a list L’ of dichotomies. For every dichotomy D = (s’- ,s’+) of L, we now look for a dichotomy D1 of L’ such that e(D) = e(D’).

induction

If no such D’ is found, D has only one extension. Any representative of D will also be a representative of this extension. If such a D’ is found, D must be split in two dichotomies DI and D2 on {Z1,. . . . Zn }. The representative of D’ in L’ provides a linear function 1’ such that i’(z) 0 for z c S’+ and l’(zn) = O. We have seen in the proof of Lemma 8 how 11can be modified in order to obtain representatives of D1 and D2. In order to recover the optimal time complexity results of [4], the lists must be implemented by efficient data structures. This issue will not be addressed here. We only want to establish the existence of a polynomial time algorithm in each dimension. 7.2

ANALYSIS

OF

THE

ALGORITHM

Let Z’(n, d) be the running time of ?i(n, d). Then Z’(n, d)= Z’(n-l, d)+ T(n-l, d-l)+ U(n-l, d-1), where 17(n – 1, d – 1) is (dominated by) the time it takes to update the list L after the recursive calls to Tt(n – 1, d) and ?t(n – 1, d – 1). Expanding this relation yields n-1

T(n, d) = ~[T(k, k=l

d–

1) + U(k, d – 1)] + T(1> d).

Since there are O(nd- 1, homogeneous dichotomies on n points of lRd, the updating algorithm works in time O(na) for some constant a >0. By induction hypothIt esis, !T(k, d – 1) = O(n~) for some constant b >0. follows that Z’(n, d) = O(nc) for c = max(a, b) + 1. In order to establish a similar polynomial time bound in the bit model, we just have to show that the blt size of the numbers manipulated by the algorithm is polynomial. In fact, we shall first see that the size of the points is polynomial in n and in d. At first sight, it looks like we could get in trouble because of transformation (13). This is not the case because this transformation is just one step of Gaussian elimination on the matrix (~ij). It is known that Gaussian elimination runs in polynomial time in the bit model [12]. In order to complete the proof, we also need to bound the size of the representatives constructed in the proofs of Lemma 8 and Lemma 9. Here we can only show that the bound is polynomial for each fixed d. This simple

355

is left to the reader.

Efficient Learning of Continuous Neural Networks Abstract ... - CiteSeerX

Efficient Learning of Continuous Neural Networks Abstract ... - CiteSeerX

Suggest Documents

Continuous Neural Networks - Proceedings of Machine Learning

Lipschitz Continuous Neural Networks On L_P - CiteSeerX

Neural Networks and Continuous Time

Neural Networks and Continuous Time

Neural Networks Learning transform invariant object ... - CiteSeerX

Incremental Backpropagation Learning Networks - Neural ... - CiteSeerX

Incremental Backpropagation Learning Networks - Neural ... - CiteSeerX

Randomized Neural Networks for Learning Stochastic ... - CiteSeerX

Continuous-Time Recurrent Neural Networks for Generative - CiteSeerX

Efficient estimators : the use of neural networks

Deep BLSTM Neural Networks for Unconstrained Continuous ...

NEURAL NETWORKS FOR TIME-VARYING DATA Abstract - CiteSeerX

Learning Opposites Using Neural Networks

Neural Networks and Statistical Learning

Neural Networks Learning Relative Distances

Neural Networks and Statistical Learning

Neural Networks and Statistical Learning

Learning Polynomials with Neural Networks

Neural Networks and Statistical Learning

CS536: Machine Learning Artificial Neural Networks Neural Networks

Differential theory of learning for efficient neural network ... - CiteSeerX

Transparent Neural Networks - CiteSeerX

"ecological" neural networks - CiteSeerX

Constructive Neural Networks - CiteSeerX