Dynamics of Batch Learning in Multilayer Neural Networks {the ...

1 downloads 0 Views 472KB Size Report
Next, under the assumption of asymptotical limit, we mathematically prove the existence of overtraining in overrealizable cases of linear neural networks.
Dynamics of Batch Learning in Multilayer Neural Networks { the existence of overtraining { Kenji Fukumizu The Institute of Physical and Chemical Research (RIKEN) E-mail: [email protected] April 28, 1998 (Ver.1.11) April 20, 1998 (Ver.1.10, revised) April 16, 1998 (Ver.1.01) April 13, 1998 (Ver.1.00) Abstract

This paper investigates the dynamics of batch learning in multilayer neural networks. First, we present experimental results on the behavior in the steepest descent learning of multilayer perceptrons and linear neural networks. From the results of both models, we see that strong overtraining, the increase of generalization error, occurs in overrealizable cases where the target function is realized by a smaller number of hidden units than the model. Next, under the assumption of asymptotical limit, we mathematically prove the existence of overtraining in overrealizable cases of linear neural networks. From this theoretical analysis, we know that the overtraining is not a feature observed in the nal stage of learning, but it occurs in the intermediate interval of time and forms the global shape of a learning curve.

1 Introduction This paper discusses the dynamics of batch learning in multilayer neural networks. Multilayer networks like multilayer perceptrons have been extensively used in many applications, in hope that their nonlinear structure inspired by biological neural networks causes superior performance. Because of the nonlinearity, some gradient method like the steepest descent is usually used to nd the optimal weight and bias parameters. The error back-propagation is a famous example of such gradient methods ([1]). The dynamical behavior of the parameters in training is of much interest from both the practical and theoretical points of view. It is important to know how the empirical error, the objective function of the minimization, decreases, 1

Error

Empirical error

Generalization error Time (number of iteration)

Figure 1: Overtraining in batch learning. The empirical error decreases during iterative training, while the generalization increases after some time point. and how the generalization error, the di erence from the target function, behaves in training procedure. One interesting phenomena is plateau, which is a long at interval in a learning curve. The learning does not proceed well during this intervals. Amari ([2]) proposes a method of avoiding plateaus using the natural gradient for the minimization. Another interesting aspect of learning is overtraining ([3]). Although the nal purpose of learning is to minimize the generalization error, only nite examples of inputs and outputs are available in general. The training, then, primarily aims at minimizing the empirical error de ned with these samples. A Gradient method iteratively updates the parameter of a network to decrease the empirical error and to reach its minimum nally. If the number of training examples is very large, the empirical error is very close to the generalization error in general. However, owing to the small di erence, the decrease of the empirical error in training does not necessarily ensure the decrease of the generalization error. There are many experimental results showing overtraining: at early stage of training, the generalization error decreases in accordance with the decrease of the empirical error, but, after some point, generalization error becomes an increasing function, and converges to the nal state (Fig.1). This phenomena is practically important because we should use an early stopping technique if overtraining actually happens. There is a controversy about the existence of overtraining. Some practitioners assert that overtraining often happens and an early stopping criterion is very important to obtain a better performance. Amari et al. ([3]) analyze overtraining theoretically and conclude that the e ect of overtraining is much smaller than what is believed by practitioners. Their analysis is true if the parameter approaches to the maximum likelihood estimator following the statistical asymptotic theory ([4]). However, the usual asymptotic theory cannot 2

be applied in overrealizable cases, where the target function is realized by a network with a smaller number of hidden units than the model ([5],[6]). The existence of overtraining has still been an open problem in such cases. In this paper, we investigate the dynamics of the steepest descent training in three-layer neural networks. We discuss two models: one is multilayer perceptrons with the sigmoidal activation functions, and the other is linear neural networks, which we introduce for the simplicity of analysis. After presenting experimental results on these two models, we give the solution of the steepest descent training of linear neural networks. Based on this theoretical solution, we prove that overtraining always exists in overrealizable cases of linear neural networks.

2 Preliminaries

2.1 Framework of statistical learning

In this section, we present basic de nitions and terminologies used in this paper. A feed-forward neural network model can be described as a parametric family of functions ff (x; )g, where x is an input vector and  is a parameter vector. The function f (x; ) is a function from RL to RM . A three-layer network with H hidden units is de ned by H X f i (x; ) = wij j =1

s

L X

k=1

!

ujk xk + j + i ; (1  i  M )

(1)

where  = (w11 ; : : : ; wMH ; 1 ; : : : ; M ; u11; : : : ; uHL ; 1 ; : : : ; H ). The function s(t) is called an activation function. In the case of a multilayer perceptron, the sigmoidal function (2) s(t) = 1 +1e?t is used. We use such a model for regression problems, assuming that an output of the target system is observed with a measurement noise. A sample (x; y) from the target system satis es y = f (x) + v ;

(3)

where f (x) is the target function and v is a random vector whose distribution is N (0; 2 IM ), a normal distribution with 0 as its mean and 2 IM as its variance-covariance matrix. We write IM for the M -dimensional unit matrix. An input vector x is generated randomly with its probability density function q(x), which is unknown to a learner. Training data f(x( ) ; y( ) )j = 1; : : : ; N g are independent samples from the joint distribution of q(x) and eq.(3). The parameter  is estimated so that a neural network can give a good estimation of target function f (x). 3

Target

Model

Figure 2: Illustration of an overrealizable case. We assume that f (x) is perfectly realized by the prepared model; that is, there is a true parameter 0 such that f (x; 0 ) = f (x). If the target function f (x) is realized by a network with a smaller number of hidden units, we call it overrealizable (see Fig.2). Otherwise, we call it regular. We focus on the di erence of behaviors between these two cases. We discuss the square error loss function for training. The purpose of training is to nd the parameter that minimizes the following empirical error: Eemp =

N X

 =1

ky( ) ? f (x( ) ; )k2:

(4)

If we assume a parametric statistical model,   (5) p(yjx; ) = (212 )M=2 exp ? 21 2 ky ? f (x; )k2 ; for the conditional probability of y given x, the parameters that minimizes Eemp coincides with the maximum likelihood estimator (MLE). The behavior of the MLE for a large number of training data is given by the statistical asymptotic theory ([4]). Generally, the MLE cannot be calculated analytically if the model is nonlinear. Some numerical optimization method is needed to obtain an approximation of the MLE. One widely-used method is the steepest descent method, which iteratively updates the parameter vector  in the steepest descent direction of the error surface de ned by Eemp (). The learning rule is given by @E  (t + 1) =  (t) ?  emp ; (6)

@

4

where  is a learning constant. In this paper, we discuss only batch learning, in which we calculate the gradient using all the xed training data. There are many researches that discuss on-line learning, in which the parameter is always updated for a newly generated data ([7]). The performance of a network is often evaluated by the generalization error, which represents the average error of the estimation f (x; ) for a statistically generated input x. More precisely, it is de ned by Egen 

Z

kf (x; ) ? f (x)k2 q(x)dx:

(7)

It is easy to see

Z

ky ? f (x; )k2 p(yjx; 0 )q(x)dx = M2 + Egen :

(8)

Since the empirical error divided by N approaches to the left hand side of eq.(8) when N goes to in nity, the minimization of the empirical error roughly approximates the minimization of the generalization error. However, they are not exactly the same, and the decrease of the training error in the learning process does not ensure the decrease of the generalization error. Then, it is extremely important to clarify the dynamical behavior of the generalization error in learning. A curve showing the empirical/generalization error as a function of time is called a learning curve.

2.2 Linear neural networks

We must be very careful in discussing experimental results on multilayer perceptrons especially in overrealizable cases. If the target function is overrealizable, there exists an almost at subvariety around the global minima ([6]), and the gradient is very small around this subvariety. The convergence of learning is much slower than the convergence in regular cases. In addition, of course, learning with a gradient method su ers from local minima, which is a common problem to all nonlinear models. We cannot exclude their e ects, and this often makes derived conclusions obscure. Therefore, we introduce three-layer linear neural networks as a model on which theoretical analysis is possible. A linear neural network has the identity function as its activation. In this paper, we do not use the bias term in linear neural networks for simplicity. Thus, a linear neural network is de ned by f (x; A; B ) = BAx;

(9)

where A is a H  L matrix and B is a M  H matrix. We use A and B for the parameter of a linear neural network to distinguish it from a perceptron. We assume H  L and H  M throughout this paper. Although the function f (x; A; B ) is a linear function, the parameterization is quadratic, therefore, nonlinear. Note that the above model is not equivalent to the usual linear 5

model

f (x; C ) = C x;

where C is a M  L matrix, because the maximum rank of the coecient matrix BA is restricted to H in eq.(9), that is, the parameter space is the set of M  L matrixes whose rank is no greater than H . Therefore, the MLE and the dynamics of learning in model eq.(9) are di erent from those of the usual two-layer linear model.

3 Learning curves { experimental study { In this section, we experimentally investigate the generalization error of multilayer perceptrons and linear neural networks to see their actual behaviors, emphasizing on overtraining. Our main purpose is to see that overtraining is commonly seen in both models in overrealizable cases.

3.1 Experiments on multilayer perceptrons

The steepest descent method in multilayer perceptron leads the well-known error back propagation rule ([1]), which is given by

wij (t + 1) = wij (t) + i (t + 1) = i (t) +

N X

 =1

N X

i( ) (t)sj( ) (t);

i( ) (t);

 =1 N X M X

ujk (t + 1) = ujk (t) + j (t + 1) = j (t) + where

 =1 i=1 N M XX

 =1 i=1

wij i( ) (t)sj( ) (t)(1 ? sj( ) (t))xk( ) ;

wij i( ) (t)sj( ) (t)(1 ? sj( ) (t));

i( ) (t) = yi( ) ? fi (x( ) ; (t)); sj( ) (t) = s(

L X k=1

ujk (t)xk( ) + j (t)):

(10) (11)

To avoid the problems discussed in Section 2.2 as much as possible, we adopt the following con guration in our experiments. The model in use has 1 input unit, 2 hidden units, and 1 output unit. For a xed set of training examples, 30 di erent values are tried for an initial parameter vector. We select the trial that gives the least training error at the end. Figure 3 shows the average of generalization errors over 30 di erent simulations changing the set of training data. It shows clear overtraining in the overrealizable case. On the other hand, the learning curve in the regular case shows no meaningful overtraining. 6

1E-03

Gen. error in training 1E-04

1E-05

Mean square error

Mean square error

1E-03

1E-06

Gen. error in training

1E-04

1E-05

1E-06 100

1000

10000

100000

100

The number of iterations (a) Regular case

1000

10000

100000

The number of iterations (b) Overrealizable case

Figure 3: Average learning curves of MLP. The input samples are independent samples from N (0; 4), and the output samples include the observation noise subject to N (0; 10?4). The total number of training data is 100. The constant zero function is used as an overrealizable target function, and f (x) = s(x + 1) + s(x ? 1) is used as an regular target. The dotted line shows the theoretical expectation of the generalization error of the MLE in the case that the target function is regular and the usual asymptotic theory is applicable.

3.2 Experiments on linear neural networks

It is known that the MLE of the linear neural network model can be analytically solvable ([8]). Although it is practically absurd, of course, to use the steepest descent method, our interest here is not the MLE itself but the dynamics of learning. Therefore, we investigate the steepest descent learning of linear neural networks here. For the training data of a linear neural network, we use the notations;

0 X=B @

1 CA ;

0 Y =B @

1 CA :

(12)

Eemp = Tr[(Y ? XAT B T )T (Y ? XAT B T )];

(13)

x(1)T

.. .

x(N )T

y (1)T

.. .

y (N )T

Then, the empirical error can be written by The batch learning rule of a linear neural network is given by

 A(t + 1)

= A(t) + fB T Y T X ? B T BAX T X g; B (t + 1) = B (t) + fY T XAT ? BAX T XAT g:

(14)

It is also known that steepest descent learning of a linear neural network is free from local minima ([8]). 7

0.01

Average square error (100 trials)

Average square error (100 trials)

Gen. error in training Gen. error of MLE

0.001

0.0001

0.00001

Gen. error in training Gen. error of MLE (overrealizable) Gen. error of MLE (regular) 0.001

0.0001

0.00001 10

100

10

The number of iterations (a) Regular

100

1000

10000

100000

The number of iterations (b) Overrealizable

Figure 4: Average learning curves of linear neural networks. The input samples are independent samples from N (0; I2 ), and the output samples include the observation noise subject to N (0; 10?2). The total number of training data is 100. The constant zero function is used as an overrealizable target function, and f (x) = (1 0)T (1 0)x is used as a regular target. The dashed / dotted line shows the theoretical expectation of the generalization error of the MLE in the regular / overrealizable case respectively. Figure 4 shows the average of learning curves for 100 simulations with various training data from the same probability. The number of input units, hidden units, and output units are 2, 1, and 2 respectively. The two curves represent totally di erent behavior. The overrealizable case shows eminent overtraining in the middle, while the regular case does not show meaningful overtraining. From these results, we can conjecture that there is an essential di erence in dynamics of learning between regular and overrealizable cases, and overtraining is a universal property of the latter cases. If we use a good stopping criterion, the multilayer networks may have an advantage over conventional linear models, in that the degrade of the generalization error by redundant parameters is not so critical as conventional linear models. The result of the overrealizable case has not been explained theoretically yet. In the next section, we give a theoretical explanation of the results in linear neural networks, by obtaining the solution of the steepest descent learning equation.

4 Solvable dynamics of learning in linear neural networks We give a theoretical veri cation of the existence of overtraining. We analyze learning dynamics of a linear neural network in overrealizable cases, and derive an approximated solution of the continuous-time di erential equation of the steepest descent method. 8

4.1 Solution of learning dynamics

In the rest of the paper, we put the following assumptions: (a) H  L = M , (b) f (x) = B0 A0 x, (c) E[xxT ] =  2 IL , (d) A(0)A(0)T = B (0)T B (0). (e) The rank of A(0) ( or, equivalently under (d), the rank of B (0) ) is H . Under these assumptions, AT and B are L  H matrixes. The assumption (d) is not restrictive since the parameterization has H 2 dimensional redundancy and (d) removes it. The assumption (e) is essential in the following discussion. If the rank of A(0) ( or B (0)) is less than H , the nal state of the di erential equation changes, while we do not discuss this point in this paper. We discuss the continuous-time di erential equation instead of discrete time update rule. If we divide the output matrix as Y = XAT0 B0T + V; (15) the di erential equation of the steepest descent learning is given by  A_ = (BT B A X T X + BT V T X ? BT BAX T X ); 0 0 (16) B_ = (B0 A0 X T XAT + V T XAT ? BAX T XAT ): The behavior of eq.(16) does not coincide with that of the discrete-time learning rule (14). However, if we decide the learning coecient in eq.(14) suciently small, the solution of the discrete-time rule is approximately equal to that of the continuous-time equation. Let ZO := 1 V T X (X T X )?1=2 . It is easy to see that ZO is independent of X , and that all the elements of ZO are mutually independent and subject to N (0; 1). To derive a solvable equation, we approximate X T X by

p (17) X T X =  2 NIL +  2 NZI ; where the o -diagonal elements of ZI are subject to N (0; 1) and the diagonal elements are subject to N (0; 2) if N goes to in nity. Let F = B0 A0 + p1 (B0 A0 ZI +  ZO ): (18) N We approximate eq.(16) by  A_ =  2 NBT F ?  2 NBT BA; (19) B_ =  2 NFAT ?  2 NBAAT : The original equation eq.(16) can be considered as a perturbation of eq.(19), and the latter is a good approximation if N is very large. We further put an assumption: 9

(f) The rank of B (0) + FA(0)T is H . This assumption is a technical one, which is used to prove overtraining. The assumption (f) is almost always satis ed if the initial condition is randomly chosen independent of F , which is generally unknown. We have AAT = B T B because of the fact dtd (AAT ) = dtd (B T B ) and the assumption (d). If we introduce 2L  H matrix

 

T R = AB ;

then, R satis es the di erential equation

dR =  2 NSR ?  2 N RRT R; dt 2

where S =

(20)





0 FT F 0

:

(21)

This is very similar to Oja's learning equation ([9],[10]), or the generalized Rayleigh quotient ([11]), which is known to be a solvable nonlinear di erential equation ([10]). The key fact to solve eq.(21) is to derive the di erential equation on RRT :

d T 2 T 2 T 2 T 2 dt (RR ) =  NSRR +  NRR S ?  N (RR ) :

(22)

This is a matrix Riccati di erential equation and well-known to be solvable. We have the following Theorem 1. Assume that the rank of R(0) is full. Then, the Riccati di erential equation (22) has a unique solution for all t  0, and the solution is given by

h

i?1





R(t)RT (t) = e  2NSt R(0) IH + 21 R(0)T e  2NSt S ?1 e  2NSt ?S ?1 R(0)

R(0)T e  2NSt :

(23)

The detailed derivation is given in Appendix. Using this theorem, we easily obtain the following theorem in the same manner as [10] (Theorem 2.1). Theorem 2. Assume that the rank of R(0) is full. Then, the di erential equation (21) has a unique solution for all t, and the solution is expressed as

h





i? 12

R(t) = e  2NSt R(0) IH + 21 R(0)T e  2NSt S ?1 e  2NSt ? S ?1 R(0) where U (t) is a H  H orthogonal matrix.

10

U (t);

(24)

4.2 Dynamics of generalization error

4.2.1 The existence of overtraining In this subsection, we use M (l  n; R) for the set of real l  n matrixes. Using the assumption (c), the generalization error is written as

Egen = Tr[(BA ? B0 A0 )(BA ? B0 A0 )T ]:

(25)

It is easy to see that the derivative of Egen is

d E =  2 N Tr(F ? BA)AT A(BA ? B A )T  +  2 N TrBB T (F ? BA)(BA ? B A )T : 0 0 0 0 dt gen (26)

Note that the transform of the input and output vectors by constant orthogonal matrixes does not change the generalization error. Therefore, using the singular value decomposition of B0 A0 , we can assume without loss of generality that B0 A0 is diagonal; that is,

B0 A0 =

0 (0) B1 (0) = B @

(0) 0

0 0 ;

0

...

1 CA ;

0C (0)

r

(27)

(0) where r is the rank of B0 A0 , and (0) 1      r > 0 are the singular values of B0 A0 . The rank r is equal to or less than H , and the target function is overrealizable if and only if r < H . We employ the singular value decomposition of F ;

F = W V T :

(28)

In this equation, U and W are L dimensional orthogonal matrixes, and

0 1 =B @ 0

...

0 L

1 CA;

(29)

where 1  2     L  0 are the singular values of F . We assume that 1 > 2 >    > L > 0. This is satis ed almost surely, since F includes noise. We further assume that the singular values of B0 A0 is suciently larger than the second and the third terms of F (eq.(18). This is satis ed in high probability if N is very large and    . For simplicity, we write F as

F = B0 A0 + "Z; (30) where " = p1N and Z has the same order as B0 A0 . We call the order of the singular values of B0 A0 as the constant order, which is much larger than ". 11

It is known that a small perturbation of a matrix causes a perturbation of the same order to the singular values (see, for example, [12] Section III). The diagonal matrix  is decomposed as 1 01 0 A =@ "~ 2 (31) ~ 0 " 3 where 1 0 1 0(0) + O(") 0 0 1 CC ; (32) B1 ... 1 = B @ . . . CA = B@ A 0 r 0 (0) + O ( " ) r 1 0 1 0"~ r+1 0 0 r+1 CA = B@ CA ; ... ... "~ 2 = B (33) @ 0 H 0 "~H 0"~ 1 0 1 H +1 0 0 H +1 CA = B@ CA : ... ... "~ 3 = B (34) @ ~ 0  0 "L L Note that the normalized singular values ~j (r + 1  j  L) are of the constant order. The purpose of this subsection is to show

d dt Egen > 0; if r < H (overrealizable) and time t satis es

(35)

p

p 1  t  2 p log~ N ~ : (36)  2 N (~H ? ~H +1 )  N (H ? H +1 ) If this is proved, it shows the existence of overtraining in overrealizable cases. We utilize singular value decompositions to simplify eq.(26). If we introduce a 2L  2L orthogonal matrix U U   = p1 W (37) ?W ; 2 we have  0  S =  0 ? T : (38) Since we assume A(0)A(0)T = B (0)T B (0), the singular values and the right eigenvectors of A(0)T and B (0) are the same. Thus, the decomposition of R(0) has the following form; R(0) = J ?GT ; (39) 12

where

P

0I 1 H B CC ; 0 J =B @I A

 0

= 0 Q ;

and

H

0

0 1 B ?=@

...

0

0

H

1 CA :

(40)

The matrixes P and Q are L dimensional orthogonal matrixes, and G is a H dimensional orthogonal matrix. Because of the assumption (e), ? is invertible. Using these decompositions, we obtain the following expression of RRT ;

 e  2 N t

RRT = 

h



T

 J 0 e?  2 N t 1 J T T  ?1 e2  2 N t

 ??2 + 2

n

0

0

??1 e?2  2 N t

0

 J T T 

  ?1 0 o T i?1 ? 0 ??1  J  e  2 N t 0  T e?  2 N t

0

 : (41)

Let C = (U T P + W T Q) and D = (U T P ? W T Q), and decompose these two matrixes as









C = CCH3  ; D = DDH3  ; (42) where CH ; DH 2 M (H  H ; R), and C3 ; D3 2 M ((L ? H )  H ; R). Then, we

have

e  2N t 0



0

!

e?  2N t

0  2N H t 1 e 2 p CH B e  2 N ~ 3 t C3 C CC ; T J = p1 B B 2 @e 2pN H t DH A e 



 

N ~ 3 t D3

(43)

 

where H = 01 02 . Since B (0) + FA(0)T = Q O? GT + W U T P O? GT , the assumption (f) assures that CH is invertible. Then, if we introduce a matrix

0 1 IH p 1 1 2 2 B 3? 2 e  N ~ 3 tC3 CH?1e?  N H tH2 CC K=B B@p?1? 12 e?  2N H tDH C ?1e?  2N H t 21 CA ; H H1 p p H1 ?13? 2 e?  2

(44)

N ~ 3 t D3 C ?1 e?  2 N H t  2 H H

the solution can be rewritten as  21 0   21 0  ? 1 T T RR = 2 0 p?1 21 K K 0 p?1 21 T ;

13

(45)

where 1 1 2 2 =K T K + H2 e?  N H t CHT ??2 CH e?  N H t H2

?  2N H t C ?1 T C T ?1 C3 C e 3 3 He H H 1 ?  2N  t ?1 T T ?1 2 ?1 ?  N H t  21 H C ? H2 e H DH H DH CH e H 1 ?  2N  t ?1 T T ?1 1 2 ? 1 ?  N H t  2 : H C ? H2 e H D3 3 D3 CH e H ?2  2N H t +  21

+e

?1 ?  2N H t  12

(46)

H (47)

(48) (49)

Note that K ?1K T is exponentially close to the orthogonal projection to the subspace spanned by the rst H components. In the matrix K , the (H + 1; H )-component has the slowest order in conp vergence, and the order is expf?  2 N (~H ? ~H +1 )tg: Hereafter, We assume that there is a time interval such that the value of this component is very small but suciently larger than "; that is,

p "  expf?  2 N (~H ? ~H +1 )tg  1:

(50)

This is equivalent to the condition of eq.(36). We consider the main part of K ?1K T , neglecting terms of faster convergence. It is easy to see that K can be further decomposed as

1 0 I 1 0 Ir 0 H BB 0 IH ?r C C B K 2 C C B p K = K =B @p?1K3A B@p?121K p?K122K CCA ; p 31 p 32 ?1K4

(51)

?1K41 ?1K42 where K21; K41 2 M ((L ? H )  r; R), K22 ; K42 2 M ((L ? H )  (H ? r)), K31 2 M (H  r; R), and K32 2 M (H  (H ? r)). The convergence of K21 , K3 , and K41 is equal to or faster than the order of expf?  2 N (r ? "~r+1 )tg. We further assume that, in the time interval of eq.(50),

p expf?  2 N (r ? "~r+1 )tg  " 23 expf?2  2 N (~H ? ~H +1 )tg

(52)

holds. This assumption is naturally satis ed under the condition of eq.(36) if N is suciently large, since it is equivalent to

t



 2 N r +

p 3 2 log N p1 (~ ? ~ N H

r+1 + ~ H +1 )

:

(53)

Hereafter, we use  for an approximation up to leading order of each com-

14

ponent in matrixes. Then, the solution can be rewritten as









1 1 RRT  2 02 p?01 21 K (IH ? K2T K2 )K T 02 p?01 21 T 1 0  12 0  BIH ?KK2T K2 KKK2T T ??KKK3T T ??KKK4T T C  21 0   2 0  21 B @ ?K23 ?K23K22T K32K3T3 K32K4T4 CA 0  12 T : ?K4 ?K4 K2T K4 K3T K4 K4T

(54)

Therefore,





T T T 2 K2 K2 ? K4  21 U T ; AT A  U  21 IHK?2 ?KK T K K 4 2 2   1 1 IH ? K2T K2 K2T + K4T 2 T BA  W  2 K2 + K4 K2 K2T  U ;   1 1 IH ? K2T K2 K2T ? K4T T 2 T BB  W  2 K2 + K4 K2 K2T  W :

(55)

Now, we are in the position of proving the overtraining. The rst term in the right hand side of eq.(26) is written as



Tr (F ? BA)AT A(BA ? B0 A0 )T



 1  KT K T ? K T  IH ? K T K2 K T + K T  1 ? K 2 2 2 2 2 4 2 2  Tr  ?K2 + K4 IL?H ? K2K4 T  K2 ? K K2 K2T  4 2   1 I ? K T K K T + K T  1 H

T 2 4 2 K2 K2T  ? W B0 A0 U : (56) Lemma 1 and further decomposition of K2 and K4 enable us to approximate

 2

2

K2 + K4

2

eq.(56) as   Tr (F ? BA)AT A(BA ? B0 A0 )T 0 1 3 12 T 3 12 T T "B 112 K21T K21 123 ~ 232 ~ 2 K22 ~ 212 2 1 K21 K22  2 1 K21 K22  " " CC 3 1 3 1 1 1 1 2~ 2 T ~ T ~ 12  Tr B "  @" 12 ~ 122 K122T K21 312 "2~ 22 K22T K1 22 ~ 22 ?3 "2~ 22 K3 22T ~ 3 K122 ~ 22 A 2 K22 K22 2 K22 3 1 3 1 1 ?" 2 ~ 32 K21 12 ?"2 ~ 32 K22 ~ 22 + "2 ~ 32 K22 ~ 22 ?"2 ~ 32 K22 ~ 2 K22T ~ 32 + "2 ~ 32 K22 K22T ~ 32 0 1# 1 1 T 1 12 T ~ 21 ~ 212 + O(") 2 12 K21 2 1 K21 3 + O(") O ( " ) ? " K " 22  B 1 ~ 21 T C 1 1 1 1 1 B @?" 2 12 1K22 K21112 + O(") "~ 2 ? "~1 22 K22T K1 22 ~ 22 + O("2) "1~ 22 K22T ~ 32 +1 O("2 ) CA : " 2 ~ 32 K21 12 + O(") "~ 32 K22 ~ 22 + O("2 ) "~ 32 K22 K22T ~ 32 + O("2 ) (57)

p The leading orderp is "3 expf?2  2 N (~H ? ~H +1 )tg. Note that the order O("4 )  expf?  2 N (~H ? ~H +1 )tg, appeared in (3; 2) 3 (2; 3) part in eq.(57), is smaller under the condition eq.(50), and the order O(" 2 )  K21 , appeared in 15

the (3; 1)  (1; 3) part, is also smaller under the assumption of eq.(52). Thus, the main part of the trace is included in





Tr (F ? BA)AT A(BA ? B0 A0 )T h 1 5 3 1 1i 1 1 3  Tr "3 ~ 22 K22T K22 ~ 22 ? "3 ~ 22 K22T ~ 3 K22 ~ 22 ? "3 ~ 32 K22 ~ 22 K22T ~ 32 + "3 ~ 32 K22 ~ 2 K22T ~ 32 = "3

LX ?H HX ?r i=1 j =1

~r+j (~r+j ? ~H +i )2 f(K22 )ij g2:

(58)

All the terms in the above approximation are positive, and this proves the trace is strictly positive. It can be proved in almost the same way that the second term in the right hand side of eq.(26) is also strictly positive. Thus, we obtain that

d dt Egen > 0;

(59)

when the time t is in the interval of eq.(36).

4.2.2 E ect of overtraining

We can easily see that eq.(26) is negative if the di erence between BA and B0 A0 is much larger than the order of ". It means that the generalization error is, in general, a decreasing function in the beginning of training. As we have shown, the increase of the generalization error can be seen in the intermediate interval of training. We assumed in the previous subsection that the small constant value " is much smaller than the slowest decay term. However, in the last stage of training, all the decayed terms are, of course, 3 1 2 2 2 ~ ~ smaller than ". The nal leading term in eq.(57) is " 3 K22 2 O("2 ) appeared in (3; 2)  (2; 3) component, which can be positive or negative depending on the noise. In this subsection, we roughly estimate the amount of the increase caused by the overtraining in the intermediate interval, and the e ect of the nal variance. p Let t2 =  2pNlog(~HN?~H+1 ) , and t1 be a time in the interval of eq.(36) such that t1 is suciently smaller than t2 . We approximate the increase of generalization error caused by the dominant term in eq.(58), using

I1 =

Z t2 t1

"3 e?2t dt;

(60)

p where we use  =  2 N (~H ? ~H +1 ) for simplicity. We approximate the amount of the variation by the nal leading term, using I2 =

Z1 t1

"4 e?t dt:

16

(61)

We have

3 4 I1 = 2" (e?2t1 ? e?2t2 ) and I2 = " e?t1 :

(62)

4

3

Since t1  t2 , we have I1  2" e?2t1  " e?t1 , this shows that the e ect of the nal variance is much smaller than that of the intermediate overtraining. We can conclude that the gure of the graphs shown in Section 3 depicts the intermediate overtraining phenomena proved in Section 4.2.1.

5 Conclusion We showed that strong overtraining is observed in batch learning of multilayer neural networks when the target is overrealizable. According to the experimental results on multilayer perceptrons and linear neural networks, this overtraining seems to be a universal property of multilayer models. We also gave a theoretical analysis of the batch training of linear neural networks, showed the existence of overtraining in an intermediate time interval, and explained that this intermediate overtraining gives the general gure of learning curves. Although theoretical analysis of paper is only on the linear neural network model, it is very suggestive to the phenomena of overtraining, which is observed in many application of multilayer neural networks. It will be extremely important to clarify this overtraining phenomena theoretically also in other models.

References [1] D.E. Rumelhart, G.E. Hinton, and R.J. Williams. Learning internal representations by error propagation. In D.E. Rumelhart, J.L. McClelland, and the PDP Research Group, editors, Parallel distributed processing, Vol.1, pages 318{362. MIT Press, Cambridge, MA, 1986. [2] S. Amari. Natural gradient works eciently in learning. Neural Computation, 10:251{276, 1998. [3] S. Amari, N. Murata, and K.R. Muller. Statistical theory of overtraining { is cross-validation asymptotically e ective? In David S. Touretzky, Michael C. Mozer, and Michael E. Hasselmo, editors, Advances in Neural Information Processing Systems 8, pages 176{182. MIT Press, 1996. [4] H. Cramer. Mathematical method of statistics, pages 497{506. Princeton University Press, Princeton, NJ, 1946. [5] K. Fukumizu. A regularity condition of the information matrix of a multilayer perceptron network. Neural Networks, 9(5):871{879, 1996. [6] K. Fukumizu. Special statistical properties of neural network learning. In Proceedings of 1997 International Symposium on Nonlinear Theory and Its Applications, pages 747{750, 1997. 17

[7] Y. Kabashima. Perfect loss of generalization due to noise in K = 2 parity machines. J. Phys. A: Math. Gen. 27:1917. 1994. [8] P.F. Baldi and K. Hornik. Learning in linear neural networks: a survey. IEEE Transactions on neural networks, 6(4):837{858, 1995. [9] E. Oja. A simpli ed neuron model as a principal component analyzer. J. Math. Noiol., 15:267{273, 1989. [10] W. Yan, U. Helmke, and J.B. Moore. Global analysis of Oja's ow for neural networks. IEEE Transactions on Neural Networks, 5(5):674{683, 1994. [11] U. Helmke and J.B. Moore. Optimization and Dynamical Systems, Springer-Verlag, London, 1994. [12] R.Bartia. Matrix Analysis, Springer-Verlag, New York, 1997. [13] T. Sasagawa. On the nite escape phenomena for matrix Riccati equations. In IEEE Trans. Automatic Control, 27(4):977-979, 1982.

Acknowledgment The author would like to thank Dr. Shun-ichi Amari, the director of BrainStyle Information Systems Research Group in BSI, RIKEN, Dr. Yoshimasa Nakamura, professor of Osaka University, and Dr. Noboru Murata in BSI, RIKEN, for their valuable comments.

Appendix A Solution of a general Riccati di erential equation system We summerize the well-known results on Riccati di erential equation system. In this appendix, we use notations di erent from the main body of the paper. Let C , G and F are constant n  n matrixes, and K is a n  n matrix. We do not need any conditions on C , G, and F here. Let us consider the following system.

dK = CK + KC T ? KGK + F: dt

The associated linear di erential equation,

d D = ?C T GD ; F C E dt E

D(0)  I  n E (0) = K (0) ;

(63) (64)

plays an important role to analyze the Riccati equation. The following is the well-known theorem (for example, see [13]). 18

Theorem 3. The matrix Riccati di erential equation (63) has a solution in the interval [0; t1] if and only if D(t) in eq.(64) is invertible for all t 2 [0; t1 ]. The solution is unique and is given by

K (t) = E (t)D(t)?1 :

(65)

Since the di erential equation (64) is linear, the solution is explicitly obtained and given by



D(t) = e?C~ T t In +



Zt 0

~ eC~ T s GeCs ds(K (0) ? K )

 ?C~ T t + Ke  ?C~ T t E (t) = Ke

Zt 0



;



~ eC~ T s GeCs ds + eCt~ (K (0) ? K ); (66)

where K is a solution of the algebraic Riccati equation  T ? KG  K + F = 0; C K + KC

(67)

 . and C~ = C ? KG We derive Theorem 1. In the Riccati di erential equation (22), C = NS , G = NIL , and F = 0. Then, K = O and C~ = C = NS . Therefore, the solution is given by eq.(23). Note that h  i D(t) = e? NSt IL + 21 e NSt S ?1e NSt ? S ?1 R(0)R(0)T (68) in this case. Since the matrix S is positive de nite and R(0) is of full rank, the matrix IH + 21 R(0)T [e NStS ?1 e NSt ? S ?1 ]R(0) is invertible for all t  0. Thus, D(t) is also invertible for all t  0, since the inverse of e NStD(t) is given by



?1

IL ? fe NStS ?1 e NSt ? S ?1 gR(0) IH + 12 R(0)T fe NStS ?1 e NSt ? S ?1 gR(0)

(69)

B Perturbation of singular value decomposition Here, we prove the following Lemma 1. Let Z be a L  L matrix, and C be a L  L matrix of the form;





(0) C = C0 00 ;

(70)

where C (0) is a r  r matrix. For a small positive number ", we de ne C" := C + ". Let

C" = W " U T 19

(71)

R(0)T :

be the singular value decomposition of C" , where W and U are orthogonal matrixes and " is a diagonal matrix which has the singular values of C" . Then, the matrix W T CU has the following form;



(0) W T CU = C O+("O) (") OO((""2)):



(72)

Proof. It is known ([12] Section III) that " has the form;





(0) " =  +0 O(") O0(") ;

(73)

where (0) is the diagonal matrix of the singular values of C (0) . If we decompose W and U as

W=

W

11 W21

W12 W22



and U =

U

11 U21



U12 ; U22

(74)

we have

W T W T (0) + O(") 0 U U  C (0) + "Z "Z  11 12 11 12 11 21 W12T W22T "Z21 "Z22 : 0 O(") U21 U22 =

(75)

Then, the equation of the upper-left block

W11T ((0) + O("))U11 + W21T O(")U21 = C (0) + "Z11 (76) shows that W11 and U11 are of the constant order and T (0) U11 = C (0) + O("): w11 (77) We know also that W11 and U11 are invertible for a small ". From the equation of the upper-right block

W11T ((0) + O("))U12 + W21T O(")U22 = "Z12 ; (78) we obtain U12 is of order O("). Likewise, W12 = O("). Then, the assertion is straightforward.

20

Suggest Documents