This paper concerns the use of a priori information on the symmetry of cross di er- entials available ... For example, in studying a MOSFET device one normally only measures the ... over the entire X. Thus, for any nite sample one can nd regions of Xwhich violate ..... Memorandum, Ministry of Defence, Malvern, UK, 1988. 16 ...
Symmetry constraints for feedforward network models of gradient systems1 N. Scott Cardell, Wayne H. Joerding, and Ying Li Department of Economics Washington State University January 1994
This research was partially funded by National Science Foundation Grant No. SES9022773. Send comments to Wayne Joerding, Department of Economics, Washington State University, Pullman, WA 99164. 1
Abstract This paper concerns the use of a priori information on the symmetry of cross dierentials available for problems that seek to approximate the gradient of a dierentiable function. We derive the appropriate network constraints to incorporate the symmetry information, show that the constraints do not reduce the universal approximation capabilities of feedforward networks, and demonstrate how the constraints can improve generalization.
Keywords: A priori information, constrained training, feedforward networks.
1. Introduction Across a variety of elds researchers need to model systems of nonlinear dierential equations, say (), derived as the gradient of some unknown nonlinear function, (). For example, geological detection of mass anomalies depends on the gradient of a gravitational potential function. In this case one can only observe the gradient data, i.e. the data from (), from which the gravitational potential function, (), must be constructed. In other cases, the researchers may observe data generated by both a nonlinear function and its gradient, i.e. both () and (). For example, in economics, pro t maximization implies that the gradient of the production function equals the real input prices where real input prices are often easier to measure than output. In yet other situations the gradient may not normally be observed, but could be measured if useful. For example, in studying a MOSFET device one normally only measures the drain source current, but could measure the gradient of the drain source current with respect to drain and gate source voltages if desired. The universal approximation capabilities of feedforward networks make them good candidates for semi-nonparametric models of () and/or (). One approach would use a feedforward network () to approximate () and then rely on the results in Hornik et al [7] and White [14] to argue that r () converges to () as the number of observations increases to in nity. This paper concerns a second approach in which one simply approximates () with a multi-output feedforward network, say (). Standard implementations of this approach ignore a priori information about the problem implied by the symmetry of cross-partial derivatives, and thus one should expect () to violate the symmetry condition. Violation of the symmetry condition does not happen in the rst method because r () automatically has the symmetry characteristic. However, 1
in cases where only the gradient data can be observed, one cannot implement the rst method, and in cases where one measures both the nonlinear function and its gradient, including the gradient data can represent an important source of information about the phenomena. Using such information can improve generalization by attenuating the trade-o between bias and variance [4]. White [14] shows that feedforward networks have the consistency property, which means that the trained network converges toward the true function being estimated as the number of samples in the training set increases toward in nity. Thus, very large samples should produce trained networks that closely satisfy the symmetry constraints present in (). One interpretation for this result holds that the consistency property means that the region of the input space Xfor which the network violates the symmetry constraint degenerates to the null set as the number of samples increases. However, only in the limit can one be certain that the network satis es symmetry over the entire X. Thus, for any nite sample one can nd regions of X which violate the symmetry constraint, resulting in reduced accuracy and reliability when using the network over the violation region. Therefore, the sample size at which one can safely ignore a priori information depends on the accuracy requirements of the problem solution and on the complexity of the unknown gradient system. In this paper, we identify the a priori information on symmetry available for approximating any gradient vector system of equations, derive the appropriate network constraints to incorporate the information, show that the constraints do not reduce the universal approximation capabilities of feedforward networks, and demonstrate a training strategy. We specialize our work to the case in which one only observes or uses data generated by the gradient process and leave to future research the extension of this method to the case where one observes both () and (). 2
2. Symmetry in gradient vector equations Consider a twice dierentiable function of k inputs : Rk0 ! R and its k dimensional gradient vector (x) r (x). If we were to observe a sample (oi ; xi), where oi = (xi) + ei, i = 1; : : : ; N , ei a mean zero noise term, then we could use the data to train a network to approximate the unknown function and its derivatives on a compact set (see, Hornik et al [6, 7]). Sometimes, however, we do not observe (xi ) + ei , but instead observe a vector yi = (xi ) + "i . Thus, the data (yi; xi ) derives from a system of k dierential equations yi = (xi) + "i. In this case we seek to approximate (x). A multi-output feedforward network would again constitute a good choice for a semi-nonparametric model of (x). However, traditional training methods would ignore important a priori information about the problem that could improve generalization and accuracy for any given number of training samples. In particular, we know that for dierentiable, the k k Jacobian r (x) de nes the Hessian matrix for (x), and thus represents a k k symmetric matrix (in general, R i = @x@ j (x)dxj ). This symmetry of cross-partial derivatives de nes a property that a network approximation of (x) should also satisfy, but almost surely will not for nite samples. Let (x) represent a feedforward network with k inputs, k = k outputs, and connection weights W . Later, we generalize this situation to k < k gradient equations in Corollary 1. We seek to approximate : X ! Rk0 for some set X Rk0 using the network . The above reasoning demonstrates that we should require (x) to satisfy symmetry for all x 2 X. While in principle this reasoning applies to multilayer networks, practical considerations of computational demands and training algorithms require us to restrict attention to the single hidden layer network. We 0
0
0
0
0
0
0
i
0
2
2
3
0
0
de ne (x) W F (W x) as a single hidden layer network with k hidden units and connection weights W = (W ; W ), where the Wi's represent ki ki weight matrices, and F (W x) (f (w ; x); : : :; f (w ;k1 x))>, a vector of activation functions. (w ;h represents the h row of W .) Let wk;`;i represent the (`; i) element of Wk , f 0 @f@zz . Then 1
0
1
0
0
1
+1
01
th
0
0
( )
0
r (x) = @ F(@Wx x) W > 0
1
= ( f 0(w ; x)w>; f 0(w ; x)w>; : : : f 0(w ;k1 x))w>;k1 ) W > 01
02
01
0
02
0
1
= W > Diag[F0(x)]W >; 0
(1)
1
where F0(x) (f 0(w ; x); : : :; f 0(w ;k1 x))>. Let ij (x) de ne the (i; j ) term of r (x). Then 01
0
ij (x) =
k1 X `=1
w ;`;if 0(w ;` x)w ;j;`: 0
0
(2)
1
Therefore symmetry requires that ji (x) ?
ij (x) =
k1 X `=1
f 0(w ;`x)[w ;`;j w ;i;` ? w ;`;iw ;j;`] = 0 0
0
1
0
for i = 1; : : : ; k ? 1 and j = i + 1; : : : ; k . We can express the de ned by (3) more compactly by 2
2
!i;j> F0(W0x) 0
(3)
1
k2 (k2 ?1) 2
constraints
i = 1; : : : ; k ? 1 for all x 2 X j = i + 1; : : : ; k
(4)
2
2
where !i;j represents a k 1 vector of the [w ;`;j w ;i;` ? w ;`;iw ;j;`] terms from (3). 1
0
1
0
1
Proposition 1. Assume F(W x) is a vector of activation functions with nonlinear rst derivatives and the range space of F0(W x), say R, has dimension k . Then !i;j> F0(W x) 0 for all x 2 X if and only if !i;j 0. 0
0
0
4
1
Proof: Suciency is obvious. To prove necessity we show that for any x and !i;j 6= 0 such that !i;j> F0(W x ) = 0, there exists another x such that !i;j> F0(W x ) 6= 0. Suppose !i;j> F0(W x ) = 0 and !i;j 6= 0. The orthogonal space of !i;j has dimension k ? 1. Thus, non-trivial portions of R lie outside the orthogonal space of of !i;j . Consequently, there exists x 2 Xsuch that F0(W x ) is not collinear with F0(W x ), and for which !i;j> F0(W x ) 6= !i;j> F0(W x ) 0. 0
0
0
0
1
0
1
0
1
1
0
0
0
1
1
0
0
0
The assumption that the range space of F0(W x) has dimension k rules out certain perverse cases. For example, suppose f represents the logistic activation function, k = k = 1, and k = 3. Then W = (1; 0; ?1)> generates a range space with dimension k ? 1. However, such examples seem rare and unlikely to aect networks estimated with noisy data. Furthermore, it can be shown that such cases cannot be present in a W that corresponds to a locally unique minimum of ky ? k for any norm k k. From the de nition of !i;j we see that we must impose 0
2
0
1
1
0
1
0
w ;`;j w ;i;` = w ;`;iw ;j;` 0
1
0
1
8 ` = 1; : : : ; k , < : i = 1; : : : ; k ? 1, j = i + 1; : : : ; k . 1
for
2
(5)
2
Finally, note that one can impose the constraints in (5) by setting 8 < ` = 1; : : : ; k , (6) w i` = `w `i for : i = 1; : : : ; k for some k 1 vector of constants = ( ; : : : ; k1 ). To see this, simply use (6) to substitute for w ;i;` and w ;j;` in (5) to produce w ;`;j `w ;`;i = w ;`;i `w ;`;j , an identity. This means that the free parameters of the network are the k k weights of the W matrix and k elements of a vector . The number of constraints implied by (6), k (k ? 1), may seem smaller than the number implied by (5), k k (k ? 1)=2, but this ignores the redundancy contained in (5) since some constraints are implied by others. 1
1
0
2
1
1
1
1
0
0
0
0
0 1
0
1
1
2
1 2
5
2
For example, given arbitrary `, multiply the constraint in (5) for (i = 1; j = 2) by the constraint for (i = 1; j = 3) to obtain [w ;`; w ; ;`w ;`; w ; ;` = w ;`; w ; ;`w ;`; w ; ;`] = [w ;`; w ; ;` = w ;`; w ; ;`], the constraint for (i = 2; j = 3). We use this result in the next two sections. 0
0
2
13
0
3
2
11
0
1
13
0
1
12
0
3
11
12
3. Universal Approximation The universal approximation capabilities of feedforward networks, as described in [1, 2, 6, 7, 8, 9], explain much of their usefulness. Thus, we do not want to lose this capability when imposing the symmetry constraints described in (6). The symmetry constraint requires satisfying equality conditions, and so poses somewhat more danger of reducing the universal approximation capability than do inequality constraints. (See Gallant [3], p307, for an example using inequality constraints.) The danger derives from the reduced dimension of the function space satisfying the symmetry constraint. That is, because functions satisfying equality constraints occupy a lower dimension subspace of the unconstrained function space, there may not exist a network that satis es the symmetry constraint and comes arbitrarily close to any function with a symmetric Hessian matrix. On the other hand, around any function that strictly satis es an inequality constraint, one can nd an open ball containing only functions that also satisfy the constraint. Therefore, inequality constraints tend to generate subsets of functions that have the same dimension as the unconstrained function space. Consider an example from Euclidean vector space R . We know the set Q = f(x; y) j x rational; y rationalg is dense in R . Now suppose we seek to approximate the p p subspace Rc = f(x; y) 2 R j y = 3xg with elements of Qc = f(x; y) 2 Q j y = 3xg. Since only (0; 0) satis es the constraint, we can nd points in Rc arbitrarily far from 2
2
2
2
2
6
Qc. Clearly, Qc is not dense in R2c .
To show that the constrained networks retain the universal approximation capability, we use the results in [7], and so follow some of their notation. Let de ne a multi-index, @ a distributional derivative (see [7]), U an open subset of Rr, r 2 N, R the Lebesgue integral, and a Lebesgue measure. De ne kgkp as the usual L p p;U; norm generalized to work for function vectors G(x) = (g (x); : : :; gh (x)) by setting 8 ! =p > < Ph kgi kp ; if G a vector-valued function, p;U; (7) kGkp;U; = > i : (R jGjpd) =p ; if G a scalar-valued function, U for 1 p < 1. Let S represent a Sobolev space with norm 0 1 =p X p kgkm;p;U; @ k@ gkp;U;A < 1 for all g 2 S: (8) 1
1
=1
1
1
jjm
In the case of continuous functions, the distributional derivative yields the classical derivative. In other words, S represents a space of piecewise dierentiable functions with a norm (metric) that depends not only on the function but also on its derivatives. Let S S represent the space of functions with a symmetric Hessian matrix. Notice, ! =p P p p p r for g : R ! R, krgkp;U; kgkm;p;U; = kgkp;U; + krgkp;U; + k@ gkp;U; . mjj Expanding our previous notation for to specify the number of hidden units in the network, we de ne a network with k = h hidden units as h (x) Ph` w ;;`f (w ;` x), where w ;;` represents the ` column of W . Also de ne the sets of functions (X ) h r (f ) w ;;`f (w ;` x) j k = r; k = h 2 N; k = r; f ` ? nite (9) 1
2
1
=1
th
1
`=1
1
1
0
1
0
0
1
2
and r (f ) f 2 r (f ) j w i` = `w `i; ` = 1; : : : ; h; i = 1; : : : ; k g : (10) For ` 2 f0g[ N, a function G(x) 2 C `(R) is ` ? nite if and only if 0 < R j @ G(x)jdx < 1
0
0
`
@x`
1.
7
Proposition 2. If S such that f `? nite implies r (f ) is dense in S using the metric induced by the norm k km;p;U;, then Pr (f ) is dense in S .
Proof: For activation function f , let g = R f and h(x) = Phi `g(w ;ix). Then =1
r h(x) =
h X `=1
0
`w>;`f (w ;` x):
(11)
0
0
Since f `? nite implies g is `+1? nite, r (g) is dense in S S . Therefore, for every 2 S and any > 0, there exists h and h 2 r (g ) such that k ? hkm;p;U; < . From the de nition of k km;p;U;, we see that kr ? r hkp;U; k ? hkm;p;U;. By de nition r , and for every r h setting w i` = `w `i; ` = 1; : : : ; h; i = 1; : : : ; k , makes h r h. Thus, there exists h such that k ? hkp;U; < . 1
0
2
Corollary 1. Let x contain k > k elements, the rst k of which form a k element gradient system r (x), where the gradient is with respect to the rst k elements of x only. Then there exists a single hidden layer feedforward network arbitrarily close to r (x). 0
2
2
2
2
Proof: Clearly, the argument of Proposition 2 still applies when one ignores the presence of the k ? k elements of x not relevant to the gradient system r (x). 0
2
Corollary 2. may include the gradient of an arbitrary ` ? nite function Q(x). Proof: Note that k(r + rQ) ? r hk = kr ? (r h ? rQ)k. For an example using Corollary 2, consider letting Q(x) = >x with a k vector. Then represents a bias vector of the feedforward network, as in 2
r h(x) = +
h X `=1
8
`w>;`f (w ;` x): 0
0
(12)
4. Training Training seeks values for the weights that minimize the sum-of-squared errors SS = PNn (yn ? (xn))> (yn ? (xn)) subject to the constraints in (6). The constraints in (6) provide a straightforward extension for many training methods but especially so for hybrid methods such as described in [12] or [13]. At each iteration c for example, and then solve k these hybrid methods update the W matrix, to W systems of overidenti ed linear equations fyi;n = w ;iZn , n = 1; : : : ; N; i = 1; : : : ; k g, c xn), for W c = [wb ;i] weights given W c. where Zn = F(W We use the same approach to the constrained problem by altering the nature of the linear least squares sub-problem. Speci cally, we solve a system of linear equations with the typical equation =1
0
0
2
1
0
1
2
1
0
yi;n = ( >; > )[w ; ;if (w ; xn + ); : : : ; w ;k1;if (w ;k1 xn + k1 ); I (i); : : :; Ik2 (i)]>; (13) where n = 1; : : : ; N , i = 1; : : : ; k , ` represents a bias parameter for the ` hidden unit, = ( ; : : :; k2 )> represents bias parameters for each of the output units, and I (i); : : :; Ik2 (i) form a vector of indicator variables such that Ih(i) = 1 if i = h, 0 otherwise. This speci cation allows the bias term for each of the k output units to dier from each other while imposing the constraints represented by (6). Thus, instead of having k systems of equations with N equations each for a total of k (k + 1)+ k (k +1) parameters, the constrained sub-problem has a single system of k N equations for k (k + 1) + k + k parameters. Since computation time in the subproblem increases as the square of the number of parameters, each iteration of the constrained algorithm takes more time than the unconstrained algorithm. Finally, we note that one can use Proposition 2 with the results in White [14] to show that a constrained network that minimizes the sum of squared errors converges 01
01
1
0
0
1
th
2
1
1
2
2
2
1
1
2
1
0
1
2
9
0
consistently to the gradient system r (x). Thus, there exist appropriate growth rates for the number of hidden units to insure that trained networks converge almost surely to the true gradient system.
5. Examples
A priori information should improve the ability of a network to generalize out of sample by attenuating the trade-o between bias and variance [4, 10]. To demonstrate this eect we present two examples, one from economics and the other from geophysics. We have chosen Monte Carlo examples for two reasons. First, tting models to actual data frequently suers from gross misspeci cation, due, for example, to not knowing all the relevant input variables. Such gross misspeci cation errors can easily dominate possible advantages from imposing the symmetry constraint, making any such advantages dicult to detect. Because this is especially a problem in economics, economists typically have more interest in using constrained estimation to test hypotheses than to improve generalization. Second, Monte Carlo examples t with our future research on how imposing a priori information aects the choice of optimal number of hidden units. Intuitively, it seems that imposing a priori information should increase the optimal number of hidden units for a particular data set, or, alternatively, reduce the degradation in generalization that results from having too many hidden units. We can only examine these possibilities carefully with data generated in a way known to satisfy the a priori information. In each of the following examples we generate 10 dierent samples of 50 observations each from the gradient of a known data-generating process (DGP). The 10 samples dier by adding dierent noise to the dependent variable. A priori constraints should have the most value when applied to small or moderate-sized samples. We
10
then approximate the DGP with feedforward networks having a single hidden layer and various numbers of hidden units. We measure the approximation error (AE ) of the resulting networks by summing the absolute deviation of the network from the true value of the DGP at each point on a mesh covering the domain of the input data. Lines on the mesh have a .5 spacing. The resulting AE 's are regressed on a quadratic function of the number of hidden units to summarize the results. (See [5] for methods of analyzing Monte Carlo results.) We train the network using a hybrid algorithm based on simulated annealing to nd a global minimum to the sum-of-squares function (see [12] and [11]).
5.1. Estimating a Production Relation Economists are often interested in modeling the relationships between prices and inputs. Economic theory assumes that rms maximize pro ts subject to some production relation, q = (x). Here, q represents output and x = (x ; x ) represents a vector of inputs, say capital and labor. Pro t maximization requires that the vector of real prices for the inputs, w = (w ; w ), equal the gradient of the production function, i.e. w = (x) r (x): (14) 1
1
For our simulated data we set
2
2
() equal to
a CES production function,
(x) = [x + x]1=; 1
2
(15)
with parameter values = :2, = :8, and = 2. We approximated the functional form using each of 10 data sets with networks having H = 2; 4; 6; : : : ; H , where H = 26 for constrained networks and H = 28 for unconstrained networks. Noise added to the independent variable had a signal-to-noise ratio of 4. We computed an 11
AE for each network and data set combination to obtain 130 AE 's for the constrained networks and 140 for the unconstrained networks. Figure 1 shows a plot of the quadratic function obtained from regressing AE on the number of hidden units H. Two observations occur: (1) the unconstrained networks appear to attain a smaller minimum AE , and (2) the AE 's for constrained networks show much less sensitivity to the number of hidden units, especially for over-parametrization. The actual AE data substantiate the second observation. However, the rst observation turns out to be an artifact of quadratic functions, since at H = 10 the average AE for unconstrained networks, 1393.47, exceeds the average AE for constrained networks, 1364.79. Additionally, the mean squared residual in the estimated quadratic for the unconstrained network, 9099.7, exceeds that for the constrained network, 283.3. Thus, closer analysis reveals that constrained networks perform better and more reliably, on average, than unconstrained networks, and have less degradation when over-parameterized.
5.2. Detection of Mass Anomalies Analysis for the geophysical example closely parallels the economic example. Geophysicists often search for anomalous concentrations of high (or low) density matter beneath the earth's surface. The eect of a high mass anomaly is to change the gravitational potential function by \moving" a very small fraction of the planetary mass from the center of the planet to a point near the surface. Geophysicists can locate the anomaly, and possibly construct its shape, by using aircraft or satellites to very accurately measure deviations from the expected force of gravity, which equals the gradient of the change in the gravitational potential function, over the possible location of the anomaly. If the noise in the data can be limited enough, the potential 12
dierence function can be constructed from the data and the equipotential contours drawn to show the approximate shape of the anomaly. Since a constrained neural network produces an estimate of the potential dierence function as well as of the gravitational forces, it is a natural way to approach this problem. The geophysicist must subtract the standard gravitational force from the measurements. For our purposes we need not worry about the complexities of that calculation (such as the expansion in spherical harmonics and the elimination of the centrifugal eect) except to note that the procedure is well known and precise. In general, the dierence in gravitational potential with and without the anomaly equals q (x; y; z ) V (m; x; y; z ) ? m= x + y + z ; (16) 2
2
2
where m equals the mass of the anomaly, and the planet is centered at (0; 0; 0). The form of V () depends on the unknown shape and location of the anomaly. The eect of the anomaly on the the gravitational force vector () equals the gradient of (). For our simulation we chose a spherical anomaly located at (x ; y ; z ), in which case m : (17) V (m; x; y; z) = q (x ? x ) + (y ? y ) + (z ? z ) We set m = 10? (in units of the mass of the planet), x = y = 0, z = 3980, and assumed the measurements took place at a distance from the center of the planet of 4000 (x + y + z = 4000 ). The region scanned was 100 by 100. The noise added to the measurement had a signal to noise ratio of 4. For each of the ten data sets we approximated the functional form using networks with H = 2; 4; 6; : : : ; 30 hidden units in a single hidden layer. We computed the AE for each network/data set combination to obtain 150 AE 's for each of the constrained and unconstrained networks. Here, as in the previous example, the unconstrained networks appear to attain the smallest AE , and the constrained network has much less sensitivity to 0
0
2
8
2
2
0
2
0
0
2
2
13
0
0
2
0
0
the number of hidden units, especially for over-parameterization. Again, the rst observation represents an artifact of the quadratic function since the average AE for unconstrained networks with 8 and 10 hidden units exceeds the average AE for the respective constrained networks. As before, the mean squared residual in the estimated quadratic for the unconstrained network, 4.0645368E-8, exceeds that for the constrained network, 2.092607E10. Thus, the results for this example con rm the results from the previous example.
6. Conclusion We have examined the possibility of incorporating a priori information about the symmetry of gradient equation systems in feedforward networks. The weight restrictions required of networks were derived and we showed that networks constrained by the required conditions do not lose their universal approximation capability. We examined the usefulness of this technique with two Monte Carlo examples. Both examples con rmed several intuitively appealing hypotheses about the ecacy of imposing a priori information. First, on average, constrained networks perform equal to or better than unconstrained networks with the same number of hidden units. Second, constraints dramatically reduce the degradation in approximation that results from choosing too many hidden units. Third, constraints reduce the variation in approximation error resulting from sampling noise in the data. The ability to approximate arbitrary functions within the range of the observed data is critical to the value of neural network approaches. However, extrapolation outside the range of the data is also often attempted. The eect of imposing constraints on the accuracy of out-of-sample extrapolations we leave as a topic for future research. Our Monte Carlo research tested the approach in a small sample. We also 14
leave as topics for future research the question of how constraints aect the rate of convergence and other asymptotic properties, and the question of whether the tradeo between the number of independent observations and the signal-to-noise ratio diers for constrained and unconstrained networks.
15
References [1] B. W. Carroll and B. D. Dickinson. Construction of neural net using the radon transform. In Proceedings of the International Joint Conference on Neural Networks, volume I, pages 607{611, 1989. [2] G. Cybenko. Approximation by superpositions of a sigmoidal function. Mathematical Control Signal Systems, 2:303{ 314, 1989. [3] A.R. Gallant. Unbiased determination of production technologies. Journal of Econometrics, 20:285{323, 1982. [4] S. Geman, E. Bienenstock, and R. Doursat. Neural networks and the bias/variance dilemma. Neural Computation, 4:1{58, 1992. [5] David F. Hendry. Monte Carlo experimentation in econometrics. In Z. Griliches and M. D. Intriligator, editors, Handbook of Econometrics, volume 2, pages 939{ 976. Elsevier Science Publishers, 1984. [6] K. Hornik, M. Stinchcombe, and H. White. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359{366, 1989. [7] K. Hornik, M. Stinchcombe, and H. White. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Networks, 3(5):551{560, 1990. [8] Y. Ito. Approximation of functions on a compact set by nite sums of a sigmoid function without scaling. Neural Networks, 4(6):817{826, 1991. [9] Y. Ito. Approximation of continuous functions on rd by linear combinations of shifted rotations of a sigmoid function with and without scaling. Neural Networks, 5(1):105{116, 1992. [10] W. H. Joerding and J. L. Meador. Encoding a priori information in feedforward networks. Neural Networks, 4(6):847{856, 1991. [11] W.H. Joerding and Y. Li. Global training with a priori constraints. Technical report, Washington State University, Department of Economics, Pullman, WA 99164, 1992. [12] Y. Li, W. H. Joerding, and A. Genz. Global estimation of feedforward neural networks with hybrid lls/simulated annealing. In Proceedings of the World Congress on Neural Networks, pages 443{447, New York, 1993. IEEE Press. [13] A. Webb and D. Lowe. A hybrid optimization strategy for adaptive feed-forward layered networks. Technical Report 4193, Royal Signals and Radar Establishment Memorandum, Ministry of Defence, Malvern, UK, 1988. 16
[14] H. White. Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings. Neural Networks, 3(5):535{549, 1990.
17
AE
Unconstrained
1500
1475
1450
1425
1400
Constrained
1375
1350
5
10
15
20
25
H
Figure 1: Approximation error, AE , for CES as a function of the number of hidden units, H .
18
AE 0.001
0.0008
Unconstrained
0.0006
0.0004
0.0002
Constrained
5
10
15
20
25
H
Figure 2: Approximation error, AE , for gravity potential as a function of the number of hidden units, H .
19