electrical engineering, the non-linear behavior of a MOSFET device depends on the gradi- ent of the device ... In this paper, we show how to impose symmetry constraints and ... If we were to observe a sample (on;xn), where on = (xn) + en, n = 1;::: ..... Tech. rep. 4193, Royal Signals and Radar Establishment Memorandum,.
Improving Generalization with Symmetry Constraints by
N. Scott Cardell, Wayne H. Joerding, and Ying Li Washington State University, Pullman, WA 99164
Abstract1
This paper presents research on the bene ts of using a priori information about the symmetry of cross-partial derivatives to improve generalization. We show how to impose the symmetry constraint on a global training algorithm and demonstrate its ecacious use with a problem in economics.
1. Introduction
This paper presents preliminary results from our research into imposing a priori information on feedforward neural networks. We take as an example the imposition of symmetry constraints suitable when using a feedforward network to approximate a system of nonlinear equations derived as the gradient of some known or unknown function. This problem can arise in many elds. For example, in geology detection of magnetic anomalies depends on the gradient of a gravitational potential function. In economics, the condition for pro t maximization sets the gradient of the production function equal to the real input prices. In electrical engineering, the non-linear behavior of a MOSFET device depends on the gradient of the device response function with respect to drain and gate source voltages. In each of these cases observations on the gradient of a non-linear response function can represent important, or even the only, information about the phenomena of interest, such as in the magnetic anomaly example. The universal approximation capabilities of feedforward networks make them good candidates for a semi-nonparametric approach to modeling non-linear functions, but traditional implementations ignore a priori information about the problem implied by the symmetry of cross-partial derivatives. In this paper, we show how to impose symmetry constraints and demonstrate their usefulness in an example taken from economics.
2. Symmetry in gradient vector equations
Let : R 0 ! R represent a twice dierentiable function of k0 inputs, and (x) r (x) its k0 dimensional gradient vector. If we were to observe a sample (o ; x ), where o = (x ) + e , n = 1; : : :; N , e a mean zero noise term, then we could use the data to train a network to approximate the unknown function and its derivatives on a compact set (see, for example, Hornik, Stinchcombe, and White (1989, 1990)). Sometimes, however, we do not observe a number o , but instead observe a vector y = (x ) + e where e represents a vector of mean zero noise terms. In other cases we observe both o and y . k
n
n
n
n
n
n
n
n
1
n
n
n
n
n
This research was partially funded by National Science Foundation Grant No. SES-9022773.
1
In either of these cases, approximating the unknown response functions and/or can bene t from using the a priori information that the Hessian matrix for (x), de ned by the k0 k0 Jacobian r (x), must be a k0 k0 symmetric matrix. In other words, the symmetry of cross-partial derivatives de nes a property that a network approximation of (x) should also satisfy. In this paper we consider using a single hidden layer feedforward network to approximate when one only observes y .2 Let (x) represent a feedforward network with k0 inputs, k2 = k0 outputs, and connection weights W . We seek to approximate : X ! R 0 for some set X R 0 using the network . The above reasoning demonstrates that we should require (x) to satisfy symmetry for all x 2 X. We de ne (x) W1F (W0x) as a single hidden layer network with k1 hidden units and connection weights W = (W0; W1), where the W 's represent k +1 k weight matrices, F(W0x) (f (w0 1x); : : :; f (w0 1 x))>, a vector of activation functions, and w0 represents the hth row of W0. Let w represent the (`; i) element of W , f 0 ( ) and f 0 f 0(w0 x) for some x 2 X. Then n
k
k
i
i
i
;
;k
;h
k;`;i
@f z
k
;i
i
@z
r (x) = F0(W0x)W1>; where F0 @@Fx = ( f10 w0>1 f20 w0>2 : : : f 0 w0> ) W1> =
k1
;
;
0 f10 0 : : : B 0 f20 : : : W0> B @ .. .. . .
0 0 . ...
. . 0 0 : : : f 01 (x) de ne the (i; j ) term of r (x). Then
;k1
1 C C A W1>:
(1)
feqn:gradg
(2)
feqn:gradijg
(3)
feqn:symijg
k
Let
ij
ij
(x) =
X k1
=1
w0 f 0(w0 x)w1 : ;`;i
;`
;j;`
`
Therefore symmetry requires that ji
(x) ?
ij
(x) =
X k1
=1
f 0(w0 x)[w0 w1 ? w0 w1 ] = 0 ;`
;`;j
;i;`
;`;i
;j;`
`
for i = 1; : : : ; k0 ? 1 and j = i + 1; : : : ; k0. We can express the by (3) more compactly by
!> F0(W
( ?1) 2
k0 k0
constraints de ned
i = 1; : : : ; k ? 1 0 0 x) 0 for all x 2 X
(4) j = i + 1; : : : ; k0 where F0(x) (f 0(w0 1x); : : :; f 0(w0 1 x))> and ! represents a k1 1 vector of the i;j
;
2
;k
i;j
We defer the more general problem in which one observes both on and yn to future work.
2
feqn:symg
[w0 w1 ? w0 w1 ] terms from (3). Cardell, Joerding, and Li (1993) show that under fairly general conditions (4) can be satis ed if and only if ! 0. From the de nition of ! we see that this requires ( ` = 1; : : : ; k1, w0 w1 = w0 w1 for i = 1; : : : ; k0 ? 1, (5) j = i + 1; : : :; k0. Or, alternatively, ` = 1; : : : ; k , 1 w1 = w0 for (6) i = 1; : : : ; k ;`;j
;i;`
;`;i
;j;`
i;j
;`;j
i;j
;i;`
i`
;`;i
`
;j;`
`i
0
for some k1 1 vector of constants = ( 1; : : : ; 1 )>. The universal approximation capabilities of feedforward networks, as described in Carroll and Dickinson (1989), Cybenko (1989), Hornik et al. (1989, 1990), and Ito (1991, 1992), explain much of their usefulness. Thus, we do not want to lose these capabilities when imposing the symmetry constraints described in (6). The Symmetry constraints require satisfying equality conditions, and so pose somewhat more danger of reducing the universal approximation capability than do inequality constraints. (See Gallant Gallant (1982), p307, for an example using inequality constraints.) This derives from the reduced dimension of the function space that satis es the symmetry constraint. That is, because functions satisfying equality constraints occupy a lower dimension subspace of the unconstrained function space, there may not exist a network that satis es the symmetry constraint and comes arbitrarily close to any function with a symmetric Hessian matrix. Fortunately, it turns out that networks satisfying the kind of constraints de ned in (6) possess the same type of approximation capability described in Hornik et al. (1989, 1990), see Cardell et al. (1993). Finally, we note that one can use results in Cardell et al. (1993) with the results in White (1990) to show that a constrained network that minimizes the sum of squared errors converges consistently to the gradient system r (x). Thus, there exist appropriate growth rates for the number of hidden units to insure that trained networks converge almost surely to the true gradient system. k
3. Training
P
Training seeks values for the weights that minimize the sum-of-squared errors SS = =1 (y ? (x ))>(y ? (x )) subject to the constraints (6). The constraints in (6) provide a straightforward extension for many training methods but especially so for hybrid methods such as described in Li, Joerding, and Genz (1993) or Webb and Lowe (1988). At each iteration these hybrid methods update the W0 matrix and then solve k2 systems of overidenti ed linear equations, fy = w1 F(W0x ), n = 1; : : :; N g, i = 1; : : : ; k2 to compute the W1 weights given W0. We can take the same approach to the constrained problem by altering the nature of the linear least squares sub-problem. Speci cally, we solve a system of linear equations with the N n
n
n
n
i;n
;i
n
3
n
feqn:constraintsg
feqn:traing
typical equation
y = ( >; > )[w0 1 f (w0 1x + 1); : : : ; w0 1 f (w0 1 x + 1 ); I1(i); : : :; I 1 (i)]>; (7) where n = 1; : : : ; N , i = 1; : : : ; k2 and represents a bias parameter for the ith hidden unit and = (1; : : :; 2 )> represents bias parameters for each of the output units, and I1(i); : : :; I 1 (i) form a vector of indicator variables such that I (i) = 1 if i = h, 0 otherwise. Thus, instead of having k2 systems of equations with N equations each and a total of k1 (k0 + 1)+k2 (k1+1) parameters, the constrained sub-problem has a single system of k2N equations and k1(k0 + 1) + k1 + k2 parameters. Since computation time in the sub-problem increases i;n
; ;i
;
n
;k ;i
;k
n
k
k
i
k
k
h
as the square of the number of parameters, each iteration of the constrained algorithm takes more time than the unconstrained algorithm.
4. Example
Presumably, the use of a priori information can improve the ability of a network to generalize out of sample. (See Joerding and Meador (1991) for more discussion.) To demonstrate this eect we take an example from our own eld of economics. A well-known result in economic theory concludes that a pro t-maximizing rm sets the gradient of the production function with respect to factor inputs (such as capital and labor) equal to the real input prices. Sometimes economists do not observe output levels of a rm but do observe input levels and real factor prices. From these data economists can recover some characteristics of the unobserved production process by relating input prices to factor input levels, in other words, by approximating a relationship of the form y = (x ) + e . To make the best use of expensive data and to improve generalization the network approximator to should satisfy the symmetry constraints described above. Of course, the a priori information must be correct for it to bene t generalization. We also expect a priori information to have the most value for small sample sizes. Thus, for our demonstration we generate a modest amount of data from a known data-generating process (DGP) and then seek to approximate that process with various single hidden layer feedforward networks. Speci cally we generate 10 dierent samples of 50 observations each from the gradient of (x) x12x28 where x1 represents capital and x2 represents labor inputs. The input data come from random selection of points in the square [1; 20] [1; 20]. We then use these data to train networks with 2,4,6,...,28 hidden units, measuring the approximation error (AE ) of the resulting networks by summing the absolute deviation of the network from the true value at each point on a mesh covering the domain of the input data. Lines on the mesh have a .5 spacing. As noted above, the number of free weights grows more quickly in the unconstrained networks than in the constrained. Thus, we limit training of unconstrained networks to 26 hidden units. This results in the number of free weights varying from 12 to 132 for the unconstrained networks and from 10 to 88 for the constrained. We train the network using a hybrid algorithm based on simulated annealing to nd a global minimum to the sum-of-squares function, see Li et al. (1993). Taken together we have 130 observations on n
:
:
4
n
n
f???g
the approximation error for unconstrained networks and 140 observations on constrained networks. We then t these AE values to quadratic and cubic equations in the number of hidden units, H , and the number of free parameters, K . Plots of these tted polynomials are displayed in Figures 1 and 2. Note, although this is not clear in the picture, the N line lies everywhere above the N line in the right side panel of Figure 2. Also, the N cubic lines decline for very low numbers of hidden units and free parameters, an anomaly that does not appear in a quartic polynomial. The plots show that the symmetry constraint lowers the approximation error almost uniformly and postpones the onset of degraded approximation as the number of hidden units increases. The postponement eect shows up most strongly when plotted against the number of hidden units (left side of gures). Because the number of hidden units is the only complexity control parameter in a feedforward network, this represents an important advantage for constrained networks. c
AE 1000
AE 1000
N
800
Nc
800
600
600
Nc
400
400
200
0
N
200
5
10
15
20
25
H
0
20
40
60
80
100
120
K
Figure 1: Quadratic approximation of approximation error, AE , for unconstrained N , and constrained, N , networks as a function of the number of hidden units, H , and the number of free parameters K . f g:CD g1g c
References Cardell, N. S., Joerding, W. H., & Li, Y. (1993). Symmetry constraints for feedforward network models of gradient systems. Tech. rep., Washington State University, Department of Economics, Pullman, WA 99164. Carroll, B. W., & Dickinson, B. D. (1989). Construction of neural net using the radon transform. In Proceedings of the International Joint Conference on Neural Networks, Vol. I, pp. 607{611. Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematical Control Signal Systems, 2, 303{ 314. 5
Gallant, A. (1982). Unbiased determination of production technologies. Journal of Econometrics, 20, 285{323. Hornik, K., Stinchcombe, M., & White, H. (1989). Multilayer feedforward networks are universal approximators. Neural Networks, 2, 359{366. Hornik, K., Stinchcombe, M., & White, H. (1990). Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks, 3 (5), 551{560. Ito, Y. (1991). Approximation of functions on a compact set by nite sums of a sigmoid function without scaling. Neural Networks, 4 (6), 817{826. Ito, Y. (1992). Approximation of continuous functions on r by linear combinations of shifted rotations of a sigmoid function with and without scaling. Neural Networks, 5 (1), 105{116. Joerding, W. H., & Meador, J. L. (1991). Encoding a priori information in feedforward networks. Neural Networks, 4 (6), 847{856. Li, Y., Joerding, W. H., & Genz, A. (1993). Global estimation of feedforward neural networks with hybrid lls/simulated annealing. In Proceedings of the World Congress on Neural Networks, pp. 443{447 New York. IEEE Press. Webb, A., & Lowe, D. (1988). A hybrid optimization strategy for adaptive feed-forward layered networks. Tech. rep. 4193, Royal Signals and Radar Establishment Memorandum, Ministry of Defence, Malvern, UK. White, H. (1990). Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3 (5), 535{549. d
6
N
AE 1000
N
AE 1000
Nc 800
800
600
600
Nc 400
400
200
200
0
5
10
15
20
25
H
0
20
40
60
80
100
120
K
Figure 2: Cubic approximation of approximation error, AE , for unconstrained N , and constrained, N , networks as a function of the number of hidden units, H , and the number of free parameters K . f g:CD g2g c
7