BSIS Technical Report No.99-1.
Generalization error of linear neural networks in unidenti able cases Kenji Fukumizu Lab for Information Synthesis, RIKEN Brain Science Institute Hirosawa 2-1, Wako, Saitama 351-0198, JAPAN E-mail:
[email protected] Ver.2.00 July 22, 1999
Abstract
The statistical asymptotic theory is often used in many theoretical results in computational and statistical learning theory. It describes the limiting distribution of the maximum likelihood estimator as an normal distribution. However, in layered models such as neural networks, the regularity condition of the asymptotic theory is not necessarily satis ed. If the true function is realized by a smaller-sized network than the model, the target parameter is not identi able because it consists of a union of high dimensional submanifolds. In such cases, the maximum likelihood estimator is not subject to the asymptotic theory. There has been little known on the behavior in these cases of neural networks. In this paper, we analyze the expectation of the generalization error of three-layer linear neural networks in asymptotic situations, and elucidate a strange behavior in unidenti able cases. We show that the expectation of the generalization error in the unidenti able cases is larger than what is given by the usual asymptotic theory. From this result, we can conclude that many statistical discussion must be reconsidered in the case of layered models.
1 Introduction This paper discusses a non-regular property of multilayer network models, caused by its structural characteristics. It is well-known that learning in neural networks can be described as the parametric estimation from the 1
viewpoint of statistics. The least square error estimator is equal to the maximal likelihood estimator (MLE), whose asymptotic behavior is known in detail. Therefore, many researchers have believed that the statistical behavior of neural networks is perfectly described within the framework of the well-known theory, and have applied theoretical tools such as MDL and AIC. Recently, it has been clari ed that the usual statistical asymptotic theory does not necessarily hold in neural networks ([Fuk97]). This always happens if we consider the model selection problem in neural networks. Assume that we have a neural network model with H hidden units as a hypothesis space, and that the target function can be realized by a network with a smaller number of hidden units than H . In this case, as we explain in Section 2, the true parameter in the hypothesis class, which realizes the target function, is not identi able, and the MLE is not subject to the ordinary asymptotic theory. We cannot apply any methods based on the asymptotic theory. In this paper, we discuss the generalization error of linear neural networks, as the simplest multilayer model. Also in this simple model, the true parameter loses identi ability if and only if the target is realized by a network with a smaller number of hidden units than the model. We directly calculate the expectation of the generalization error of this model in asymptotic situations, and derive an approximate formula for large-scale networks. From these results, we see that the generalization error in unidenti able cases is larger than what is derived from the usual asymptotic theory.
2 Neural Networks and Identi ability
2.1 Neural networks and identi ability of parameter
A neural network model can be described as a parametric family of functions ff (; ) : R L ! R M g, where is a parameter vector. A three-layer neural network with H hidden units is de ned by
f i(x; ) =
PH j =1wij
'
P
L u x + j k=1 jk k
+ i; (1 i M )
(1)
where = (wij ; i; ; ujk; j ). The function '(t) is called an activation function. In the case of a multilayer perceptron, tanh(t) is often used. We consider regression problems, assuming that an output of the target 2
system is observed with a noise. An observed sample (x; y) satis es y = f (x) + z ; (2) where f (x) is the target function, which is unknown to a learner, and z N (0; 2IM ) is a random vector, where N (; ) is a normal distribution with as its mean and as its variance-covariance matrix. We use IM for the M M unit matrix. An input vector x is generated randomly with its probability q(x)dx. Training data f(x() ; y())gN=1 is an independent sample from the joint distribution of q(x) and p(yjx) de ned by eq.(2). We discuss the maximum likelihood estimator (MLE), denoted by ^, assuming the model p(yjx; )dx = N (f (x; ); 2IM ), which has the same noise model as the target. In this case, the MLE is equivalent to the least square error estimator, which minimizes the following empirical error: P Eemp = N=1ky() ? f (x(); )k2: (3) We evaluate the accuracy of the estimation using the expectation of generalization error: (we sometimes call it generalization error if not confusing.) R Egen Ef(x ;y )g [ kf (x; ^) ? f (x)k2q(x)dx]: (4) It is easy to RseeR that the expected log likelihood is directly related to Egen as Ef(x ;y )g [ p(yjx)q(x)(? log p(yjx; ))dydx = 21 Egen + const: One of the special properties of neural networks is that, if the target can be realized by a network with a smaller number of hidden units than the model, the true parameter that realizes the target function is not a point but a union of high-dimensional manifolds (Fig.1). Indeed, the target can be realized in the parameter set where wi1 = 0 (8i) holds and u1k takes an arbitrary value, and also in the set where u1k = 0 (8i) and wi1 takes an arbitrary value if '(0) = 0. We say that the target is unidenti able if the true parameter is high dimensional. The usual asymptotic theory cannot be applied in such cases, because the Fisher information is singular. In the presence of measurement noise in the output data, the MLE is located a little apart from the high-dimensional set. ( )
( )
( )
( )
2
2.2 Linear neural networks (LNN)
We focus on linear neural networks hereafter, as the simplest multilayer model. A linear neural network (LNN) with H hidden units is de ned by f (x; A; B ) = BAx; (5) 3
=
0
0
Target
Aj
Bj
Figure 1: Unidenti able cases in neural networks where A is a H L matrix and B is a M H matrix. We assume that
HM L throughout this paper. Although f (x; A; B ) is just a linear map, the model is not equal to the set of all linear maps from R L to R M , but is the set of linear maps of rank not greater than H . Then, the model is not equivalent to the linear model fC x j C : M L matrixg. In this sense, the three-layer linear neural network model is the simplest multilayer model. The parameterization in eq.(5) has trivial redundancy. The transform (A; B ) 7! (GA; BG?1) does not change the map for any non-singular matrix G. Given a linear map of rank H , the set of parameters that realize the map consists of a H H dimensional manifold. However, we can easily eliminate this redundancy if we restrict the parameterization so that the rst H rows of A make the unit matrix. Therefore, the essential number of parameters is H (L + M ? H ). In other words, we can consider BA as a point in an H (L + M ? H ) dimensional space. More essential redundancy arises when the rank of a map is less than H . Even if we use the above restriction, the set of parameters that realize such a map is still high-dimensional. Then, in LNN, the parameter BA is identi able if and only if the rank of the target is equal to H . If the rank of the target is H , the usual asymptotic theory holds, and Egen is given by Egen = H (L + M ? H ) + O(N ?3=2): N 2
4
(6)
3 Generalization Error of LNN 3.1 Exact results
It is known that the MLE of a LNN is exactly solved. We introduce the following notations;
X = (x(1) ; : : : x(N ) )T ; Y = (y(1); : : : y(N ) )T ; and Z = (z(1) ; : : : z(N ) )T : (7) Proposition 1 ([BH95]). Let VH be an M H matrix whose i th column is the eigenvector corresponding to the i th largest eigenvalue of Y T X (X T X )?1 X T Y . Then, the MLE of a linear neural network is given by ? B^ A^ = VH VHT Y T X X T X ?1 :
(8)
Note that the MLE is unique even when the target is not identi able, because the statistical data include noise. It distributes along the true parameter set. The expectation of the generalization error is given by the following Theorem 1. Assume that the rank of the target is r ( H ), and the variancecovariance matrix of the input x is positive de nite. Then, the expectation of the generalization error of a linear neural network is 2 Egen = fr(L + M ? r) + (M ? r; L ? r; H ? r)g + O(N ?3=2); (9) N where (p; n; q) is the expectation of the sum of the q largest eigenvalues of a random matrix subject to the Wishart distribution Wp(n; Ip ). (The proof is given in Appendix.) The density function of the eigenvalues 1 : : : p 0 of Wp(n; Ip) is known as p n?p? Y 1 exp?? 1 Pp Y ( ? ); (10) Zn 2 1=1 i i=1 i 1ijp i j 2
1
where Zn is a normalizing constant. However, the explicit formula of (p; n; q) is not known in general. In the following, we derive an exact formula in a simple case and an approximation for large-scale networks. 5
By thenexact calculation of the integral, we can obtain (2; n; 1) as (2; n; 1) = p n + ?(?( n ) ) , where ?(n) is the Gamma function. From this fact, we can calculate the expectation of the generalization error in a special case. Theorem 2. Assume M = 2, H = 1, and L 2. Then, we have +1 2 2
(
Egen =
) 2 L + p ?( L+1 2 N ?( L2 ) 2 (L + 1) N
if r = 0 (unidenti able), if r = H = 1 (identi able).
(11)
(For the proof, see appendix.) The interesting point is that the error changes depending p generalization L ) > L + 1 for L 3, E ) = ?( on the identi ability. Since L + ?( L+1 gen 2 2 for an unidenti able target is larger than that for an identi able target. If the number of p input units is very large, from the Stirling's formula, we have Egen N fL + L=2g for the constant zero target. This reveals much worse generalization in unidenti able cases. 2
3.2 Generalization error of large scale networks
We analyze the generalization error of a large scale network in the limit when L, M , and H go to in nity in the same order. Let S Wp(n; Ip) be a random matrix, and 1 2 p 0 be the the eigenvalues of n?1 S . The empirical eigenvalue distribution of n?1 S is de ned by Pn := p1 ((1 ) + (2) + + (p)); (12) where ( ) is the Dirac measure at . The strong limit of Pn is given by Proposition 2 ([Wat78]). Let 0 < 1. If n ! 1, p ! 1 and p=n ! , then Pn converges almost everywhere to p
1 (u ? um)(uM ? u) (u)du (13) (u) = 2 u p p where um = ( ? 1)2 , uM = ( + 1)2 , and (u) denotes the characteristic function of [um ; uM ]. 6
0.8
0.6
0.4
0.2
0.5
1
1.5
2
2.5
Figure 2: Density of eigen values 0:5 (u) Figure 2 shows the graph of (u) for = 0:5. R We de ne u as the -percentile point of (u); uu M (u)du = . If we ? p transform the variable as t = u ? um +2 uM =(2 ), the density of t is
p
2 (t) = 2 2p t1+?1t + ;
and -percentile point t is given by
R1 t (t)dt
(14)
= . Then, we can calculate
1 (p; n; p) = R uM u(u)du = 1 ncos?1(t ) ? t q1 ? t2 o : (15) lim u n;p!1 np p=n! From Theorem 1, we obtain Theorem 3. Let r (r H ) be the rank of the target. Then, we have
q 2 Egen r(L + M ? r) + (L ? r)(M ? r) 1 cos?1(t ) ? t 1 ? t2 N + O(N ?3=2 ); (16)
when L; M; H; r ! 1 with ML??rr ! and MH ??rr ! .
q
From elementary calculus, we can prove 1 fcos?1 (t ) ? t 1 ? t2 g (1 + (1 ? )) for 0 1 and 0 1. Therefore, we see that in unidenti able cases (i.e. r < H ), the Egen is greater than N H (L + M ? H ). Also in these results, the Egen depends on the target. This shows clear dierence from usual discussion in which Egen does not depend even on the model but only depends on the number of parameters (eq.(6)). 2
7
Singular target - experimental Regular target - experimental Singular target - theoretical Regular - theoretical
Average square error (100 trials)
Average square error (100 trials)
0.00004
0.0004
0.0002
0.0000
0.00003
0.00002
Experimental results
0.00001
Asymptotic theory in regular case Error in singular case 0.00000
1
2
3
4
5
6
7
8
9
10
0.01
0.1
The number of hidden units
Epsilon
IN=50, OUT=10, #DATA=10000
DATA = 1000
1
Figure 3: Experimental results
3.3 Numerical simulations
First, we make experiments using LNN with 50 input and 10 output units. We prepare two target functions; the constant zero function for an unidenti able case, and a linear map of rank H for an identi able case. The generalization error of the MLEs with respect to 10000 training data is calculated. The left graph of Fig.3 shows the average of the generalization errors over 100 data sets and the theoretical result in Theorem 3. We see that the experimental results coincides with the theoretical predictions very much. Next we investigate the generalization error for an almost unidenti able target, which is identi able but has very small singular values. We prepare a LNN with 2 input, 1 hidden, and 2 output units. The true function is f (x; 0) = 0" (" 0)x, where " is a small positive number. The target is identi able for a non-zero ". The right graph of Fig.3 shows the average of the generalization errors for 1000 training data. Surprisingly, even in 1000 training data for only 3 parameters, the result shows that the generalization errors for small "s are much larger than what is given by the usual asymptotic theory. They are rather close to Egen of the unidenti able case marked by .
8
4 Concluding remarks This paper discussed the behavior of the MLE in unidenti able cases of multilayer neural networks. The ordinary methods based on the asymptotic theory cannot be applied to neural networks, if the target is realized by a smaller number of hidden units than the model. As the rst step to clarifying the correct behavior of multilayer models, we elucidated the theoretical expression of the expectation of the generalization error for three-layer linear neural networks, and showed that the generalization error in unidenti able cases is larger than that in identi able cases. This analysis theoretically clari es a special property of multi-layer network model.
A Proof of Theorem 1 Let C0 = B0 A0 be the coecient of the true function, and be the variancecovariance matrix of the input vector x. From the assumption of the theorem, is positive de nite. The expectation of the generalization error is given by Egen = EX;Y [Tr[(B^ A^ ? C0)(B^ A^ ? C0)T ]]: (17) We de ne an M L random matrix W by
W = Z T X (X T X )?1=2:
(18)
Note that all the elements of W are subject to the normal distribution N (0; 2) and independent of X . From Proposition 1, we have B^ A^ ? C0 = (VH VHT ? IM )C0 + VH VHT W (X T X )?1=2: (19) This leads the following decomposition Egen = EX;W [Tr[C0 C0T (IM ? VH VHT )]] + EX;W [Tr[VH VHT W (X T X )? (X T X )? W T ]]: (20) 1 2
We expand (X T X )1=2 and X T X as
p
(X T X )1=2 = N 1=p2 + F; X T X = N + NK: 9
1 2
(21)
Then, the matrices F and K are of the order O(1) as N goes to in nity. We write " = p1N hereafter for notational simplicity, and obtain the expansion of N1 Y T X (X T X )?1X T Y as where
T (") N1 Y T X (X T X )?1X T Y = T (0) + "T (1) + "2T (2) ; T (0) = C0 C0T ; T (1) = C0 KC0T + C01=2 W T + W 1=2 C0T ; T (2) = WW T + WFC0T + C0FW T :
(22)
(23)
Since the column vectors of VH are the eigenvectors of Y T X (X T X )?1X T Y , they are obtained by the perturbation of the eigenvectors of T (0) = C0C0T . Following the method of Kato ([Kat76], Section II), we will calculate the projection Pj (") onto the eigenspace corresponding to the eigenvalue j (") of T ("), We call Pj (") an eigenprojection. Let 1 : : : r > 0 be the positive eigenvalues of T (0) = C0C0T , Pi (1 i r) be the corresponding eigenprojections, and P0 be the eigenprojections corresponding to the eigenvalue 0 of T (0) . Then, from the singular value decomposition of C0 1=2 , we see that there exist projections Qi (1 i r) of R L such that their images are 1 dimensional subspaces that are mutually orthogonal and the equalities 1=2 C0T PiC01=2 = iQi
(24)
are satis ed. We de ne the total projection Q~ by
Q~ =
r X i=1
Qi:
(25)
First, let i(") (1 i r) be the eigenvalue obtained by the perturbation of i, and Pi(") be the eigenprojection corresponding to i("). Clearly, the equality
Pi(") = Pi + O(") holds. 10
(26)
Next, we consider the perturbation of P0. Generally, by the perturbation of eq.(22), the eigenvalue 0 of T (0) splits into several eigenvalues. Since the perturbation is caused by a positive de nite random matrix, these eigenvalues are positive and dierent from each other almost surely. Let r+1(") > > M (") > 0 be the eigenvalues, and Pr+j (") be the corresponding eigenprojections. We de ne the total projection of the eigenvalues r+j by
P0(") =
M ?r X j =1
We expand T (")P0(") as
T (")P0(") =
Pr+j ("):
1 X n=1
"nT~(n)
(27)
(28)
to obtain the expansion of r+j ("). Then, from Kato ([Kat76],(2.20)), we see that the coecient matrices of eq.(28) are given by T~(1) = P0 T (1) P0; T~(2) = P0 T (2) P0 ? P0T (1) P0T (1) S ? P0T (1) ST (1) P0 ? ST (1) P0 T (1) P0; T~(3) = ?P0 T (1) P0 T (2) S ? P0T (2) P0 T (1) S ? P0 T (1) ST (2) P0 ? P0T (2) ST (1) P0 ? ST (1) P0T (2) P0 ? ST (2) P0 T (1) P0 + P0 T (1) P0T (1) ST (1) S + P0T (1) ST (1) P0T (1) S + P0 T (1) ST (1) ST (1)P0 + ST (1) P0T (1) P0 T (1) S + ST (1) P0T (1) ST (1) P0 + ST (1) ST (1) P0T (1) P0 ? P0T (1) P0 T (1) P0T (1) S 2 ? P0T (1) P0 S 2T (1) P0 ? P0T (1) S 2 T (1) P0T (1) P0 ? S 2T (1) P0 T (1) P0T (1) P0; (29) where S is de ned by r X S = 1 Pi; (30) i=1 i which is the inverse of T (0) in the image of I ? P0. Note that from eq (24), (25), and (30), the equality 1=2 C0T SC01=2 = Q~ (31) 11
holds. From the fact T (0) P0 = 0, we have
C0P0 = 0 Using eq.(31) and eq.(32), we can derive T~(1) = 0; ~ T )P0: T~(2) = P0 (WW T ? W QW
(32) (33) (34)
In particular, Pr+j (") is the eigenprojection of 1 T (")P (") = T~(2) + "T~(3) + "2 T~(4) + : (35) 0 "2 In the leading term T~(2) , P0 W (IL ? Q~ ) is the orthogonal projection of W onto an M ? r dimensional subspace in the range and onto an L ? r dimensional subspace in the domain respectively. Thus, the distribution of the random variable T~(2) is equal to the Wishart distribution WM ?r (L ? r; 2IM ?r ). We expand Pr+j (") as
Pr+j (") = Pr+j + "Pr(1)+j + "2Pr(2)+j + O("3):
(36)
From Kato ([Kat76], (2.14)), the coecients are given by
Pr(1)+j = ? Pr+j T~(3) Sj ? Sj T~(3) Pr+j ; Pr(2)+j = ? Pr+j T~(4) Sj ? Sj T~(4) Pr+j + Pr+j T~(3) Sj T~(3) Sj + Sj T~(3) Pr+j T~(3) Sj + Sj T~(3) Sj T~(3) Pr+j ? Pr+j T~(3) Pr+j T~(3) Sj2 ? Pr+j T~(3) Sj2 Pr+j ? Sj2T~(3) Pr+j T~(3) Pr+j ; (37) where Sj is de ned by
Sj = ?
1 P ? 1 (I ? P ): 0 ? k r+j j 1kM ?r j X
(38)
k6=j
If we write 1 : : : M ?r for the non-zero eigenvalues of T~(2) , the matrix Sj is equal to the inverse of T~(2) ? j IM in the image of I ? Pr+j . 12
The rst term of eq.(20) can be rewritten as M X j =H +1?r
EX;W [Tr[C0 C0T Pr+j (")]]:
(39)
having Pr+j C0 = 0 in mind, we obtain Tr[C0C0T Pr(1) +j ] = 0; 1 T ~(3) ~(3) Tr[C0C0T Pr(2) +j ] = 2 Tr[C0 C0 (I ? P0 )T Pr+j T (I ? P0 )]: j
(40)
Using eq.(32), eq.(33), and the fact that C0 C0T (I ? P0 )S = I ? P0, we can derive from eq.(29) (1) (2) (1) (1) (1) Tr[C0C0T Pr(2) +j ] = Tr[(T P0 T ? T P0 T ST ) Pr+j (T (2) P0T (1) ? T (1) ST (1) P0T (1) )S ]: (41)
Furthermore, eq.(23) leads
T (1) P0T (2) Pr+j ? T (1) P0T (1) ST (1) Pr+j ~ T Pr+j = C0 W T P0 WW T Pr+j ? C0 W T P0W QW ~ T )Pr+j = C0 W T P0 (WW T ? W QW = j C0 W T Pr+j : (42) Finally, we obtain 1=2 T 1=2 T Tr[C0C0T Pr(2) +j ] = Tr[C0 W Pr+j W C0 S ] ~ T ]: = Tr[Pr+j W QW (43) ~ T are independent, because W Q~ and The random matrices Pr+j and W QW W (IL ? Q~ ) are independent. Therefore, 1 2
1 2
1 2
1 2
EX;W [Tr[C0
C T (I 0
T M ? VH VH )]]
= "
2
M ?r X
j =H ?r+1 2 2 (
~ T ]] EX;W [Tr[Pr+j W QW
= " r M ? H ) + O("3) is obtained for the rst term of eq.(20). 13
(44)
Using eqs.(21) and (26), the second term of eq.(20) is rewritten as EX;W [Tr[VH VHT W (X T X )? (X T X )? W T ]] 1 2
="
2E
X;W
1 2
r hX i=1
Tr[PiWW T ] +
H ?r X j =1
i
Tr[Pr+j WW T ] : (45)
P
Because ri=1 Pi is a non-random orthogonal projection onto an r-dimensional subspace and each element of W is subject to N (0; 2), we have h
r hX
EX;W Tr
i=1
ii
PiWW T = 2 rL:
(46)
We can calculate the second part of the right hand side of eq.(45) as Tr[Pr+j WW T ] ~ T ] + Tr[Pr+j (WW T ? W QW ~ T )] = Tr[Pr+j W QW ~ T ] + j : = Tr[Pr+j W QW
(47)
Because j is the j -th largest eigenvalues of a random matrix from the Wishart distribution WM ?r (L ? r; 2 IM ?r ), we obtain H ?r X
EX;W [
j =1
Tr[Pr+j WW T ]] = 2 fr(H ? r) + (M ? r; L ? r; H ? r)g: (48)
From eqs.(44), (46), and (48), now we prove the the theorem.
B Exact calculation of (2; n; 1) The probability density function of W2(n; I2) is m? 1 1 + 2 ( exp ? (1 ? 2) (49) 1 2 ) 4?(m ? 1) 2 for 1 > 2 . Since the expectation of the trace of a matrix from the distribution W2 (n; I2) is equal to 2, we have 2
E[1 + 2 ] = 2: 14
3
(50)
Thus, we have only to calculate E[1 ? 2 ]. We transform the variable as r = 1 +2 2 ; ? (51) ! = cos?1 p1 2 : Noting that d1 d2 = 2r sin !drd!; (52) we can easily derive Z 1Z 1 ?r rn?3 sinn?3 ! (2r cos ! )2 2r sin !d!dr e E [ 1 ? 2 ] = 4?(n ? 1) 0 0 Z 1 Z 2 ? r n = ?(n ? 1) e r dr sinn?2 ! cos2 !d! 0 0 3 )?( n?1 ) ?( 2 ?(n + 1) 2 2 = n +2 ?(n ? 1) 2?( 2 ) n +1 p ?( ) = 2 n2 : (53) ?( 2 ) Therefore, we obtain p ?( n+1 ) (54) E[1 ] = n + ?( n2 ) : 2 2
2
References [BH95] P.F. Baldi and K. Hornik. Learning in linear neural networks: a survey. IEEE Transactions on neural networks, 6(4):837{858, 1995. [Fuk97] K. Fukumizu. Special statistical properties of neural network learning. In Proceedings of 1997 Internatiolnal Symposium on Nonlinar Theory and its Applications (NOLTA'97), pages 747{750, 1997. [Kat76] T. Kato. Perturbation Theory for Linear Operators (2nd ed). Springer-Verlag, New York, NY, 1976. 15
[Wat78] K. W. Watcher. The strong limits of random matrix spectra for sample matrices of independent elements. The Annals of Probability, 6(1):1{18, 1978.
16