Efficient
Learning
of Continuous
Pascal
Neural
Networks
Koiran*
DIMACS Rutgers
University
New Brunswick, NJ 08901 e-mail:
[email protected] .edu
Abstract
generality has a drawback: the hypothesis produced by a learning algorithm is no longer guaranteed to generalize correctly on future instances of a problem. It is only guaranteed to perform nearly as well as the best hypothesis in a certain “touchstone” class. Moreover, the efficient learning algorithm proposed in [9] applies only to neural nets of fixed size and fixed input dimension.
We describe an efficient algorithm for learning from examples a class of feedforward neural networks with real inputs and outputs in a real-valued generalization of the Probably Approximately Correct (PAC) model. These networks can approximate an arbitrary function with an arbitrary precision. The learning algorithm can accommodate a fairly general worstcase noise model. The main improvement over previous previous work is that the running time of the algorithm grows only polynomially as the size of the target network increases (there is still an exponential dependence on the dimension of the input space, however). The main computational tool is an iterative “loading” algorithm which adds new hidden units to the hypothesis network sequentially. This avoids the difficult problem of optimizing the weights of all units simultaneously.
1
In this paper, we describe an efficient algorithm for learning in fixed dimension a class C of two-layer feedforward neural networks. This class is precisely defined in section 2. The hypothesis produced by the algorithm is a neural net of the same class which is guaranteed to generalize correctly on future examples, as in the traditional PAC model. Our framework for learning is a natural generalization of the PAC model to the continuous setting. A learning algorithm is allowed to query examples from an oracle O. Each request is of the form (X, Y) := O(p), where p E N is a precision parameter. For each call, the oracle draws a random point X’ from a fixed but unknown probability distribution D on Rd. It returns in constant time an example (X, Y) such that IY – ~(X’) I s 2-P and llX–X’ll~ =l~,~dlX,–X~l 0 and every z such that IIz]IM > 2“, D(z) = O. The traditional PAC model does not make any assumption at all about D, but we believe that this restriction is of little practical significance.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantaqe, the ACM copyright notice and the title of the publication and Its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. COLT 94- 7/94 New Brunswick, N..J. USA @ 1994 ACM 0-89791 -655-7/94/0007..$3.50
The goal of the learner
348
is to produce
from
these exam-
pies an hypothesis function F E C which is “close” to the target. Closeness ismeasured by the expected error d(~, F) such that
F)2= J’d w(~)?
w!
the specific case of neural networks are from Maass [9]. They are all reviewed in the next section. Some important computational tools are also due to Maass [10].
w~))~(~).
APPROXIMATION-THEORETIC AND STATISTICAL TOOLS
2
The loss function
‘@v)=
(1)
(Gih)2
is a measure of the relative difference between u and v. More precisely, the learner tries to find F such that d(~,
&’)
~-,
fn,
n=
1,2,
f’’0, ~~=1 y~ = 1 and gk E G, for some sufficiently large p. The average value of the inner product (F’ – ~, g – ~) forgc{gl,..., gP}is ~yk(~-f,g,-!)
= (~-~,~”-!)
the following ifs
statement
holds:
samples XI, . . . , z ~ are independently
then
with
probability
y at least
drawn from
A,
1 – 6,
S (m+c$)ll~-tll.
k=l
The
average
value
of I lg – ~112 is
for
every
f
E F
(E
denotes
the
expecta-
mathematical
tion). =
~~kllgk-fll’
~~kllgk-full’+llf”-fll’ k=l
k=l
~~~llgk112-
=
In this paper, we take X = Rid x R and % = & is y), the class of functions of the form f(z, y) = L(.F(z), where F E ch and L is the loss function (1). Although this case is not explicitly covered in [9], it is not difficult to see that the same bound holds: the pseudo-dimension of rh is 0(d2h2).
llf*112+llf*-f112
k=l
B’ + (m+d)’.
OC(B!
but this
G Rd; (z, t(z))
e S}).
x ~ O,
As in the proof of Theorem 1, (,) denotes the inner product. Here are a few useful notations: ch denotes the Given class of all such networks with h hidden units. two weight bounds B, B’ > 0, the subclass Ch(B, B’)
ch(B~
~ CIZ – z’I.
We consider only distributions on IRd x R that are inD on II?d and an underlying duced by a distribution targer function t :INd+ R, i.e., for any S C Rid x l?:
h
=
y)\
It can be shown that c = 2 is a suitable value, approach will not be further discussed here.
h
form
F(z)
y) – L(z’,
yfm
of
case,
E(m,v)G~L(F(z),
~, L(F(z),
y) =
f(z))
D(z).
/ Except
3
in section
5, the target
STATISTICAL
function
is also in C.
ASPECTS
“)”
This section presents a learning algorithm for our of neural networks. The claimed polynomial time follows from the existence of a similar bound for tain loading problem. This problem is analyzed next section.
It it well known that successfully loading some examples on a network does not guarantee a correct generalization. The number of examples must be related to the power of the network architecture. In the boolean setting, the concept of Vapnik-Chervonenkis dimension makes a rigorous analysis of this phenomenon possible. The pseudo-dimension is a generalization of this concept to the continuous setting due to David Haussler. He proved the following key result ([5], Corollary 2).
1F must satisfy called permissibility.
a benign measure-theoretic This condition is satisfied
natural classes of functions, considered in this paper.
350
and in particular
class C bound a cerin the
condition
for most for the classes
Theorem ezamples
4
TheTe
ezi.rts
all function~
an algorithm
of the form
L learning
(5).
Assume
from
that
where
KB
the
t is in C(BO, B&). When the dimension running time of L is polynomial in Bo, and in the parameter 1/.s, 1/6 and T defined in sec-
target
function
d is jized, B; tion
1. Its
complexity
is polynomial
in
and also in d. This is al$o true for precision p required in the oracle calls, quantities are actually independent of T.
all
KB is the Lipschitz constant of MB. Clearly, BdB’(l?), and the terms l&(Xi) – ~ I are
bounded
the
sample
0, let G be the set of functions form g(z) = a+((w, z) –@) with Ial s B and llwll~ < B’. Given F E Ch(B, B’) and q > 0, we seek a new neural net F’ of the form F! = cvF + Eg where a = a = l/n satisfying the following (n–l)/nand E=l– (approximate) optimality criterion:
since obviously g(NB ) < E’ (NB). One problem here is that C never gets to see the true random variables X;, but only noisy approximations Xi. Hence the algorithm will actually minimize
E(NB)
= :
‘@WXi)
- X)2
E(F’)
1=1
with an algorithm which is described in the next section. This is good enough because of the bound
E’(N~)
< .E(NB) + (.KB + 1)22-2P +2(KB + 1)2-P ~~$,B lNB(xi) --
– xl
0,
4.(X)
let
da be defined
Xi,
{
in
Yi.
by:
(lo)
~ O for z c S;
The transformation If a > 0, ~.(a) = a@(z/a). The contribution of the new gate to F’ is g’(z) = a’qi((w, z) - 8), with a’ = ?Ia. For a’ >0, one can write g’(z) = q$a~((w’, z) – 0’) with w’ = a7w and 61 = a’@. For a’ ~ 0, one can of course write g’(z) = –+_. J((w’, z) – 0’), with w’ = –a’w and & = –a’O. The advantage of this parametrization is that g’ is now a piecewise-linear function of the new parameters a’, w’ and 0’.
1 and that we have already computed the dichotomies on S’ = {Z1, . . . . Zn - l}. For each dichotomy D’ on S, there are either one or two dichotomies on S which extend D’. The main task of the algorithm is to determine exactly how each dichotomy can be extended. The following well-known fact is crucial.
[8] M. J. Kearns, R. E. Schapire, and L. M. Sellie. Toward efficient agnostic learning. In Proc. 5th ACM Workshop on Computational Learning Theory, pages 341– 352, 1992.
Lemma 8 A dichotomy D’ on S’ = {Z1, . . . . Z“-l} can be extended by two dichotomies on S iff it has a representative i such that l(zn) = O.
[6] K. Hornik. feedforward
Approximation capabilities of multilayer 4:251-257, Neural Networks, networks.
1991.
[9] W. Maass. Agnostic PAC-learning of functions on analog neural nets. In Proc. 7th Annual IEEE Conference on Neural Information Processing Systems, 1993.
Proof. Assume that D has a representative 1 such that l(zn) = O, 1(z) < 0 for z E S’- and Z(Z) > 0 for x G S’+. Let x~O be a non-zero component of Xn, let
power and [10] W. Maass. Bounds for the computational learning complexity of analog neural nets. In Proc. .25th STOC, pages 335-344, 1993. Convergence [11] D. Pollard. Springer Verlag, 1984.
of Stochastic
Theory of Linear [12] A. Schrijver. ming. Wfley, New-York, 1986.
[13] L. G. Valiant.
Learning
and Integer
disjunctions
APPENDIX: STRICT
Processes.
A = m~~5~ lZiO[ and let a = minZ~5~ Il(z)l. The two distinct dichotomies of representatives i(z)+ aZ;O/(2A) are extensions of Dt on S.
Program-
Assume now that D has two extensions DI and Dz of respective representatives 11 and 12 such that Z(Zm) >0 and 12(z” ) < 0. The linear l(z)
on Artificial
POLYNOMIAL
= i~(zn)l~(z)
is a representative
ENUMERATING DICHOTOMIES
function
of conjunctions. In
Proc. 9th International Joint Conference Intelligence, pages 560-566, 1985.
7
THE
of D and l(zn ) = O. •l
This characterization
IN
Lemma
TIME
9 Assume
x“. Consider defined by
In this section, we describe an algorithm for enumerating all strict dichotomies on a set S = {zl . . . . Zn } of n points in Rd. Recall that a dichotomy S = S- U S0 U S+ is strict if S0 = 0. For each fixed d, the algorithm works in polynomial time in the real number model of computation. This fact is established in section 7.2, where we also show that it runs in polynomial time in the bit model of computation,
– lz(zn)l~(z)
can be used as follows. that x:
is a non-zero
the subset T =
{yl,
component
. . . . yn-l}
~=zj_Ylz$
(13)
Xd fOTj=
l,...,
n–landi=
of
of Rid-l
l,...,
d – 1. A dichotomy
D
on S’ has a representative l(x) = ~~=1 a;xi satisfying 1(x”) = O iff iheTe is a dichotomy D’ on T such that e(D’) = e(D). If this is the case, a representative of D’ d– 1 is l’(z) = ~i=l a~z~.
A dichotomy D can be characterized by a vector e(D) 6 {O,l}n: ej(~) = O ifx’ ESandej(ll) = 1 ifz’ E S+, A representative of this dichotomy is an affine function 1 such that i(z) 0 ifx c S+. The enumeration algorithm will determine one representative for each dichotomy.
PTOOf.
The
condition
; ~~~~ a:x~/z~. ❑ ,.. .,l–l.
In this
l(zn )
=
case, i(z~)
O holds = /’(@’)
iff
ad
=
for -j =
If Zn #O, a similar transformation can be performed using any non-zero component of Zm. The condition x; #O is therefore not restrictive. Note also that if Z1, . . . . Zn are distinct in projective space, then the @“’s are dis-
In fact, we will only consider homogeneous dichotomies, i.e., dichotomies having a representative 1 such that 1(z) = O. The original problem of enumerating all homogeneous or non-homogenous dichotomies can be solved by going to Illd+l and adding the same non-zero component to the points of S. We can assume without loss of generality that the points ZIY . . . . Zn are all distinct in projective space, i.e., x’ #Ax~ for every i#j
tinct from O: if #“ = O then z~ = (Z~/Z;)Zn would be equal to Xn in projective space.
and x~
We are now ready to describe the enumeration agorithm. Let M(n, d) be a procedure enumerating the homogeneous dichotomies on n points of I%d. For n >1
354
and d > 1, we start by enumerating the dichotomies on {Z1, . . ..Z ‘-’} by a recursive call to %(n -1, d). The output is a list L of dichotomies, We then determine which dichotomies in L can be extended by two dichotomies on {zl,..., z“ }. According to lemmas 8 and 9, this can be done with a call to H(n – 1, d – 1) on the points vI, . . .,vn-l (we start by eliminating duplicates if these points are not distinct in projective space). The call to fi(n – 1, d – 1) produces a list L’ of dichotomies. For every dichotomy D = (s’- ,s’+) of L, we now look for a dichotomy D1 of L’ such that e(D) = e(D’).
induction
If no such D’ is found, D has only one extension. Any representative of D will also be a representative of this extension. If such a D’ is found, D must be split in two dichotomies DI and D2 on {Z1,. . . . Zn }. The representative of D’ in L’ provides a linear function 1’ such that i’(z) 0 for z c S’+ and l’(zn) = O. We have seen in the proof of Lemma 8 how 11can be modified in order to obtain representatives of D1 and D2. In order to recover the optimal time complexity results of [4], the lists must be implemented by efficient data structures. This issue will not be addressed here. We only want to establish the existence of a polynomial time algorithm in each dimension. 7.2
ANALYSIS
OF
THE
ALGORITHM
Let Z’(n, d) be the running time of ?i(n, d). Then Z’(n, d)= Z’(n-l, d)+ T(n-l, d-l)+ U(n-l, d-1), where 17(n – 1, d – 1) is (dominated by) the time it takes to update the list L after the recursive calls to Tt(n – 1, d) and ?t(n – 1, d – 1). Expanding this relation yields n-1
T(n, d) = ~[T(k, k=l
d–
1) + U(k, d – 1)] + T(1> d).
Since there are O(nd- 1, homogeneous dichotomies on n points of lRd, the updating algorithm works in time O(na) for some constant a >0. By induction hypothIt esis, !T(k, d – 1) = O(n~) for some constant b >0. follows that Z’(n, d) = O(nc) for c = max(a, b) + 1. In order to establish a similar polynomial time bound in the bit model, we just have to show that the blt size of the numbers manipulated by the algorithm is polynomial. In fact, we shall first see that the size of the points is polynomial in n and in d. At first sight, it looks like we could get in trouble because of transformation (13). This is not the case because this transformation is just one step of Gaussian elimination on the matrix (~ij). It is known that Gaussian elimination runs in polynomial time in the bit model [12]. In order to complete the proof, we also need to bound the size of the representatives constructed in the proofs of Lemma 8 and Lemma 9. Here we can only show that the bound is polynomial for each fixed d. This simple
355
is left to the reader.