Approximating Networks and Extended Ritz Method for the Solution of ...

4 downloads 0 Views 291KB Size Report
functional with respect to functions belonging to a set of feasible solutions. .... the approximate solution of functional optimization problems. Somehow, this ...
JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS: Vol. 112, No. 2, pp. 403–439, February 2002 (2002)

Approximating Networks and Extended Ritz Method for the Solution of Functional Optimization Problems1 R. ZOPPOLI,2 M. SANGUINETI,3

AND

T. PARISINI4

Communicated by Y. C. Ho

Abstract. Functional optimization problems can be solved analytically only if special assumptions are verified; otherwise, approximations are needed. The approximate method that we propose is based on two steps. First, the decision functions are constrained to take on the structure of linear combinations of basis functions containing free parameters to be optimized (hence, this step can be considered as an extension to the Ritz method, for which fixed basis functions are used). Then, the functional optimization problem can be approximated by nonlinear programming problems. Linear combinations of basis functions are called approximating networks when they benefit from suitable density properties. We term such networks nonlinear (linear) approximating networks if their basis functions contain (do not contain) free parameters. For certain classes of d-variable functions to be approximated, nonlinear approximating networks may require a number of parameters increasing moderately with d, whereas linear approximating networks may be ruled out by the curse of dimensionality. Since the cost functions of the resulting nonlinear programming problems include complex averaging operations, we minimize such functions by stochastic approximation algorithms. As important special cases, we consider stochastic optimal control and estimation problems. Numerical examples show the effectiveness of the method in solving optimization problems stated in 1

This work was supported in part by the MURST Project on Identification and Control of Industrial Systems. The authors are indebted to Angelo Alessandri, Angela Di Febbraro, and Simona Sacone for the assistance in developing the simulation examples. They thank P. C. Kainen and V. Ku˚rkova´ for helpful discussions. 2 Professor, Department of Communications, Computer, and System Sciences, University of Genova, Genova, Italy. 3 Research Associate, Department of Communications, Computer, and System Sciences, University of Genova, Genova, Italy. 4 Professor, Department of Electrical, Electronic Engineering and Computer Engineering, DEEI-University of Trieste, Trieste, Italy.

403 0022-3239兾02兾0200-0403兾0  2002 Plenum Publishing Corporation

404

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

high-dimensional settings, involving for instance several tens of state variables. Key Words. Functional optimization, neural approximating networks, nonlinear approximating networks, stochastic approximation, Ritz method, curse of dimensionality, optimal control and estimation.

1. Introduction In functional optimization problems, one has to minimize a given cost functional with respect to functions belonging to a set of feasible solutions. Such problems can be stated as follows: Problem P. Find inf

{γ 1 ,..., γ M } ∈SM

F(γ 1 , . . . , γ M),

(1)

where SM ⊆ H1 B· · ·BHM, for any iG1, . . . , M, Hi is a real linear space of functions γ i : B i > ⺢ni2, B i ⊆ ⺢ni1, and F: SM > ⺢ is a cost functional. 䊐 The approximate method presented in this paper is valid for any M¤ 1. However, to avoid burdening the notation, in the first, methodological part (Sections 2–5), we simply consider the case MG1 without loss of generality. Accordingly, instead of (1), we simply write inf F(γ ),

γ ∈S

where γ .γ 1 , S.S1 ,etc. In the second part (Sections 6–7), where we shall consider two applications of the method, we shall have MH1. The method aims at solving Problem P in its very general form; but we shall focus on stochastic optimization problems, in particular, on stochastic optimal control and state estimation problems. Typically, such problems arise whenever an optimal control law in closed-loop form is sought or a mapping generating the optimal estimate of a dynamic system state has to be derived as a function of the observed data. Hence, the cost functionals that we are going to deal with are expected values of the form F(γ )GE {J[γ (x), z]}, z

where J(·, ·) is a given cost function, x is the random argument of γ (x), and z is a random vector; x is a known function of z.

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

405

As is well known, solving analytically Problem P is feasible in very few cases, typically under the classical linear-quadratic-Gaussian (LQG) assumptions. In some case, such assumptions are not sufficient. For example, this may occur in team optimal control problems with nonclassical information structures (Ref. 1). For non-LQG problems, several methods for solving Problem P approximately have been proposed in the literature. The technique that we propose relies on the following basic approximation. The function γ is constrained to take on a given structure, in which a certain finite number of free parameters have to be determined so as to minimize the cost functional F. This enables us to approximate the original functional optimization problem by a finite-dimensional nonlinear programming problem. Optimizing some parameters for a closed-loop control law of preassigned form dates back to the so-called specific optimal control; see for instance Ref. 2 and the references cited therein. Similarly, in LQG optimal control problems, the designer may wish to simplify the control law by constraining it to be a linear function of the output vector instead of a reconstructed state vector. Then, he has to derive the matrix gain of a suboptimal controller. Analogous simplifications may be sought in determining fixed-structure low-order dynamic controllers and in solving decentralized optimal control problems. A survey of the computational methods for deriving various parametric controllers (e.g., the Levine– Athans and Anderson-Moore methods) is reported in Ref. 3. In spite of the analogy to such parametric approaches in control theory, the purpose of the method proposed in this paper is substantially different. We assign a given structure to the function γ , not to obtain a simplified suboptimal solution, but just because we are not able to derive the optimal solution in analytical form. Then, intuitively, we have to choose the preassigned structure sufficiently rich in free parameters so as to approximate to any desired degree of accuracy the optimal function γ ° that solves Problem P. Hence, our method turns out to be closer to the classical Ritz method for the calculus of variations than to the above-mentioned parametric schemes. In the Ritz method, the preassigned structure has the form of a linear combination of fixed basis functions, and the parameters to be optimized are given by the coefficients of the linear combination. One of the novelties of our approach, as compared with the Ritz method, consists in the use of linear combinations of basis functions, which in turn contain free parameters to be optimized as well. We have termed it the extended Ritz method, and we have successfully applied it already to a variety of complex control and estimation problems; see for instance Refs. 4–7. A linear combination of basis functions, when provided with suitable density properties in the set of admissible decision functions, will be termed an approximating

406

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

network (AN). In turn, ANs will be called nonlinear (linear) if their basis functions contain (do not contain) free parameters. Examples of linear ANs are algebraic and trigonometric polynomials. Examples of nonlinear ANs are feedforward neural networks with single hidden layers and linear output activation functions, radial basis function networks, linear combinations of trigonometric basis functions with adaptable frequencies, linear combinations of hinge functions, etc. In contrast to the Ritz method, our approach reveals itself to be determinant in overcoming one of the major drawbacks inherent in the use of fixed basis functions. We are referring to the complexity issue, i.e., the number of free parameters necessary to obtain a good approximation for the optimal function γ ° and the growth of such a number with the increasing dimension of the argument of γ °. In this respect, even the most recent applications of the Ritz–Galerkin procedure for nonlinear optimal control (see for instance Ref. 8) are limited strongly by the curse of dimensionality; i.e., there is an exponential increase in the number of parameters with the number of variables of the function to be approximated. Instead, we provide a method of approximate solution that proves effective for high-dimensionality related problems. Summing up, the main contributions of this paper are the following: (i)

The Ritz method is extended by exploiting the approximation properties of nonlinear ANs. Such networks are defined in Section 2 and used in Section 3 to solve Problem P approximately. (ii) A comparison of the bounds on the approximation rates obtained by linear and nonlinear ANs is made in Section 4 by introducing a comprehensive formalism, which encompasses both classical and very recent results. To the best of our knowledge, the use of such results on nonlinear ANs has never been proposed before for the approximate solution of functional optimization problems. Somehow, this explains why the methodology presented in the paper has been overlooked until now. (iii) The capabilities of nonlinear ANs are combined with those of stochastic approximation methods. This enables us to determine optimal parametrized solutions to Problem P, while avoiding practically impossible averaging computations; see Section 5. (iv) Specializations of the method are presented to solve non-LQG stochastic optimal control and state estimation problems; see Sections 6–7. For these problems, we provide algorithms easy to use in order to compute the gradients needed in applying stochastic approximation. In both kinds of problems, the method is validated by numerical examples.

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

407

2. Approximating Networks In designing approximate methods for solving Problem P, a possible approach is to replace the original problem with a sequence of approximating problems, which are supposed to be easier to solve than Problem P. To state such problems, let us consider functions γ (x)∈S such that they are constrained to take on the form γ ν (x, wν), where γ ν (·, ·) has a fixed structure, ν ∈⺪ +, and wν is a vector of parameters to be determined. We assume that the jth component γ ν j of the function γ ν has the structure ν

γ ν j (x, wν)G ∑ cij ϕi (x, κ i),

cij ∈⺢, κ i ∈⺢ k, jG1, . . . , n2 ,

(2)

i G1

where the ϕi (·, ·) are given parametrized basis functions, and the cij and the components of κ i are the parameters to be determined, i.e., wν .col(cij , κ i : iG1, . . . , ν and jG1, . . . , n2). Then, wν ∈⺢N(ν),

where N(ν)Gν(kCn2).

Let us define as Aν ⊂S the set of all functions whose components are of the form (2), that is, Aν .{γ ν (x, wν): wν ∈Wν ⊆ ⺢N(ν)}, where Wν is the set defined by the constraints that defined originally S. Clearly, the sequence {Aν }νSG1 has the infinite nested structure A1 ⊂A2 ⊂· · ·⊂Aν ⊂· · ·.

(3)

As we have to address the problem of approximating the functions γ (x)∈S by the functions γ ν (x, wν), we equip the linear space H with a norm 兩兩·兩兩. Then, in the following, we consider a complete normed linear space (i.e., a Banach space) H .(H, 兩兩·兩兩). Accordingly, S is replaced with S ⊆ H , so we have Aν ⊂ S ; we do not change the notation for Aν . Now, we introduce the following assumption: (A1)

H -Density Property. +S

set *ν G1 Aν is dense in

The functions γ ν (x, wν) are such that the H.

We call H -approximating networks the functions γ ν that have components of the form (2) and are equipped with the H -density property; for such networks, the sequence {Aν }νSG1 is called H -approximation scheme. We also define as linear H -approximating networks the H -approximating networks whose basis functions do not contain the parameter vectors κ i ;

408

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

i.e., they are simply functions ϕi (x). Instead, if the vectors κ i are present in the basis functions, the networks are termed nonlinear H -approximating networks. In this case, we drop the subscript i in ϕi (·, ·), as each basis function is specified by κ i . Accordingly, we have linear and nonlinear H approximation schemes. As the dimension N(ν) of wν is linear with ν, this integer is considered a measure of the complexity of the network. We borrow the term network from the parlance of neural networks, and we justify the use of this term by recalling that functions with components of the form (2) have the same structure as feedforward neural networks with one hidden layer and linear output activation functions. Two spaces H are particularly important in optimal control and estimation theory: (i)

The space C(K, ⺢n2) of continuous functions γ (x): K > ⺢n2; K⊂⺢n1 is a given compact set, equipped with the supremum norm 兩兩γ 兩兩S Gmax 兩兩γ (x)兩兩; x ∈K

we let

C (K, ⺢n ).(C(K, ⺢n ), 兩兩·兩兩S ); 2

(ii)

2

The space L2 (K, ⺢n2) of measurable, square integrable functions γ (x): K > ⺢n2, equipped with the L 2 norm 兩兩γ 兩兩2 G

冤冮

兩兩γ (x)兩兩2 dx

K



1兾2

;

we let

L 2 (K, ⺢n ).(L2 (K, ⺢n ),兩兩·兩兩2). 2

2

Accordingly, we have C -approximating networks and L 2-approximating networks. The space C (K, ⺢n2) with the supremum norm is important whenever we need that the error 兩兩γ Aγ ν 兩兩S be bounded by a given threshold (. This occurs when guarantees are required to ensure asymptotic properties, like stability for approximate controllers or convergence of estimation errors for approximate state estimators. If such guarantees are not required, as often happens in discrete-time optimal control and state estimation problems stated over a finite number of temporal stages, the weaker L 2-density property may be sufficient. Classical linear approximators such as algebraic and trigonometric polynomials are examples of linear C -ANs and L 2-ANs. As to nonlinear H -ANs, following Ref. 9 we describe three widely used methods for constructing the functions ϕ (x, κ i).

409

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

(i)

Tensor Product Construction. Here, we build ϕ (x, κ i) as

ϕ (x, κ i)Gh1 (x1 , τ i1) · · · hn1 (xn1 , τ in1), where x1 , . . . , xn1 are the components x∈⺢n1, τ ij ∈⺢lij is a parameter vector, and

of

the

vector

κ i .col (τ ij : jG1, . . . , n1). This is the usual procedure to build polynomial and spline bases for grid-based approximations. (ii)

Ridge Construction. Here, we shrink the n1-dimensional vector x into a one-dimensional variable by the inner product (ridge functions are defined as mappings that are constant for all x∈⺢n1 such that xTα Gconst, for some α ∈⺢n1); i.e.,

ϕ (x, κ i)Gh(xTα i Cβ i), where

κ i .col(α i , β i). Neural networks with one hidden layer and linear output activation functions are examples of ridge-construction-based ANs. Their jth components are expressed as ν

γ ν j (x, wν)G ∑ cij σ (xTα i Cβ i),

(4)

i G1

where σ (·) is usually a sigmoidal function, i.e., a bounded measurable function on the real line such that lim σ (z)G1,

z → CS

lim σ (z)G0.

z → AS

(iii) Radial Construction. Here, x is shrunk into a scalar variable by some norm, that is,

ϕ (x, κ i)Gh(兩兩xAτ i 兩兩2Γi ), where 兩兩x兩兩2Γi. xTΓi x,

Γi GΓiT ,

Γi H0,

κ i Gcol(τ i , distinct elements of Γi ). An example of basis functions derived from radial construction is given by the Gaussian basis functions, i.e., ν

γ ν j (x, wν)G ∑ cij exp(−兩兩xAτ i 兩兩Γ2 i). i G1

(5)

410

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

The proofs of the fact that the functions γ ν j (x, wν) given by (4)–(5) are provided with the density property in C (K, ⺢n2) and L 2 (K, ⺢n2) can be found, for example, in Ref. 10 and references therein for feedforward neural networks and in Refs. 11–12 for radial basis functions. Tensor-product gridbased multivariable approximations such as polynomial and spline approximations are seldom used in high-dimensional settings, as in general the construction of multivariable basis functions requires a number of points growing exponentially with the dimension of x. In this paper, we focus on ridge and radial basis functions, for they do not exhibit such a drawback. In the following, we shall omit the prefix H and its particularizations to the specific spaces addressed, unless they are required by the context. Then, we shall simply write density property, approximating networks, and approximation schemes.

3. Approximate Solutions to Problem P In this section, we shall use ANs to state a sequence of problems that approximate better and better Problem P. First, the following assumption is introduced. (A2) An optimal solution γ ° to Problem P exists; i.e., the infimum is a minimum and is attained for γ °. Next, we substitute γ ν into F(γ ) and perform the operations required by F(γ ) itself, such as differentiation, summation, integration, etc. Clearly, the functional F(γ ) becomes a function of the vector wν . We denote this function by Fν (wν). To sum up, for ν G1, 2, . . . , we have to solve a sequence of approximating nonlinear (in general) programming problems. Each of them can be stated as follows. Problem Pν . Find inf Fν (wν).

wν ∈Wν



We point out a basic difference between linear and nonlinear ANs. Remark 3.1. If the ANs are linear, then wν .col(cij : iG1, . . . , ν and jG1, . . . , n2) is the vector of the coefficients of the linear combinations (2), and the procedure that leads to Problems Pν turns out to be the classical Ritz method

411

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

for the calculus of variations; see for example Ref. 13. Applications of this method can be found in the literature to determine suboptimal control laws; see for example Ref. 14 and references cited therein. As to the solutions to Problems Pν , we introduce the following assumption. (A3) For any ν ∈⺪ +, an optimal solution w°ν to Problem Pν exists; i.e., the infimum is a minimum and is attained for w°ν . To simplify the notation, let us define

γ °ν .γ ν (x, w°ν ), F°.min F(γ )GF(γ °), γ

F°ν .min Fν (wν)GF(γ °ν ). wν

Once Problem P and the sequence of approximating Problems Pν have been stated, it is necessary to establish what is the meaning of the term ‘‘approximating.’’ Clearly, the concept of approximation is related to (i) the convergence of the sequence of minima {F°ν }iSG1 to F° and (ii) the convergence of the sequence of ANs {γ °ν }iSG1 to γ °. In the following, for the sake of brevity, we shall simply write {zi} instead of {zi }iSG1 . It is important to note that, in general, the two types of convergence do not imply each other. The concept that is best suited to the convergence of the sequence of approximating Problems Pν , ν G1, 2, . . . , is that of epiconvergence; see for example Ref. 15. If the epigraphs associated with Problems Pν converge to the epigraph associated with Problem P, then the sequences {F°ν } and {γ °ν } converge to F° and γ °, respectively. Without resorting to epiconvergence, the density property ensures only that an optimal solution γ o to Problem P is an accumulation point for some sequence {γ ν}, but not necessarily for {γ °ν }. Proving that γ ν° → γ ° is a difficult task also in most applications of the Ritz ν→S method. As to the convergence of {F°ν }, it is related to the regularity of F. In this respect, let the following assumption be verified: (A4) The functional F(γ ) is continuous in the norm of

H.

Then, proceeding as in Ref. 13, p. 196, the following fact can be established. Fact 3.1. If Assumptions A1–A4 are verified, then lim F°ν GF°.

ν→S

(6)

412

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

4. Rates of Approximation and Comparisons among Approximating Networks From the point of view of the density property stated in Assumption A1, which can be regarded as a necessary condition that functions with the structure (2) have to satisfy, linear and nonlinear ANs are equivalent. The difference between them is the rate of approximation, i.e., the relationship between the approximation accuracy achievable through a given approximation scheme and the complexity of this scheme, measured by the number ν of basis functions; we recall that N(ν) is linear with ν. Such a relationship shows how rapidly the approximation error goes to zero when ν tends to infinity. In terms of the approximation rates, both experimental data and recent theoretical results that we are going to discuss point out the advantageous behavior of the nonlinear approximation schemes considered in this paper. 4.1. For the sake of simplicity and without loss of generality, let the elements of the Banach space H be scalar functions γ : ⺢d > ⺢; i.e., let n1 Gd and n2 G1 in Problem P. Given the set S ⊆ H of feasible solutions to Problem P, to be approximated by the ANs belonging to Aν , we measure the worst-case error of approximation of the functions in S by networks belonging to Aν by the deviation of S from the set Aν ⊂ H , defined as

δ (S , Aν).sup inf 兩兩γ Ag兩兩, γ ∈S g ∈Aν

where 兩兩·兩兩 is the norm of the space H . Note that, if the approximation scheme {Aν} is linear, if the ν basis functions are linearly independent and if S GH , then each Aν is a ν-dimensional subspace of H . In general, this is not the case, as Aν is obtained by restricting the elements of a ν-dimensional subspace H ν to those that satisfy the constraints defining the set S of feasible solutions. Then,

δ (S , Aν)¤ δ (S , H ν),

as Aν ⊆ H ν .

Hence, to obtain a lower bound on the rate of the best linear approximation scheme in approximating functions of S , we shall use the Kolmogorov νwidth (or ν-width for short) of S in H , defined as dν (S )Ginf δ (S , H ν), H

ν

where the infimum is taken over all ν-dimensional subspaces H ν of H ; see Ref. 16. Concerning the performance of the nonlinear approximation schemes {Aν} with different structures, we resort to the deviation of S from

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

413

the various parametrized sets Aν . The definition of ν-width requires that one more infimum be performed than in the definition of deviation. Indeed we are interested in comparing the best linear scheme with one of all possible nonlinear schemes. Summing up, the comparison between the approximation rates of two different nonlinear approximation schemes {Aν} and {A′ν } is in terms of δ (S , Aν) and δ (S , A′ν). In the case of a nonlinear approximation scheme {Aν} and the best linear approximation scheme, the comparison is between δ (S , Aν ) and dν (S ). Although we expect that, for a fixed complexity ν, δ (S , Aν) and dν (S ) will grow with the dimension d of x, we would like to avoid schemes in which such a growth is too fast, in the sense that it requires an unacceptably large number ν of fixed or parametrized basis functions for large values of d. Rates of approximation with a favorable behavior with respect to d are those for which an upper bound on dν (S ) or δ (S , Aν) of order O(1兾νq ) exists, where q∈⺢+. Instead, if the best available upper bound is of order O(1兾ν1兾d ), approximating functions of complexity O(1兾( d ) may be required to achieve a worst-case error (. Such an exponential dependence on the dimension d gives rise to the curse of dimensionality. Any approximation scheme affected by it is inapplicable in high-dimensional settings. 4.2. When comparing the approximation rates of different ANs, a first source of problems lies in the fact that, to the best of our knowledge, the available estimates of approximation rates are of the form C(1兾ν)q, q∈⺢+, or C ′(1兾ν)1兾d, where C and C′ are constant with respect to ν but may depend on d. For instance, when they exhibit a fast increase with d, the upper bound becomes less and less meaningful with increasing the dimension. Examples of classes of functions (to be approximated) with different behaviors of such quantities are reported in Ref. 17. In Ref. 17, the following class of functions is considered:



Gc . γ : ⺢d > ⺢ such that



⺢d



兩ω 兩兩Γ(ω )兩dω ⁄c ,

(7)

where Γ(ω ) is the Fourier transform of γ , 兩ω 兩G(ω Tω )1兾2, and c is a finite positive scalar. In Ref. 17, the approximation scheme {Aν} represented by sigmoidal neural ANs [see (4)] is considered, and it is proved that, in L 2 (B 1d , ⺢), where B 1d denotes the closed ball of radius 1 centered in the origin of ⺢d ,

δ (Gc , Aν)⁄C (1兾ν)1兾2,

where CG2c.

It follows that the number of parameters required by a sigmoidal neural AN to achieve an L 2 approximation error of order O(1兾1ν) is O(d), that

414

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

is, it grows only linearly with d. In Ref. 17, a comparison for the class Gc between neural ANs and linear ANs is also presented. For such linear ANs, it is demonstrated that, in L 2 ([0, 1]d, ⺢), one has dν (Gc)¤ b(ν, c, d ).(kc兾d )(1兾ν)1兾d,

where k¤1兾(8π e π A1 ).

The interpretation of such a lower bound on dû (Gc ) is not an easy task; indeed, as noted in Ref. 17, c may depend on d in different ways for functions belonging to different subsets of Gc . For example, let us suppose that c increases linearly (then moderately) with d. It can be seen that, for a given value of the lower bound on dν (Gc), the integer ν grows exponentially with d, thus giving rise to the curse of dimensionality. Instead, neural ANs behave much better: for a given value of the upper bound 2c(1兾ν)1兾2 on δ (Gc , Aν), ν grows only as d 2. It should be pointed out also that, if c is constant with respect to d, in b(ν, c, d) the factor 1兾d determines the decrease to zero of the lower bound, which becomes less meaningful for high values of d; note the unfortunate misprint in the abstract and in Section II of Ref. 17, where the factor 1兾d has been omitted. In Fig. 1, to clarify pictorially what we said above, we show qualitatively the behaviors of the bounds 2c(1兾ν)1兾2 and b(ν, c, d) as functions of ν for a given value of d. In the same figure, for the case cGα d, ( α a positive constant), the diagrams of the upper bound on ν [i.e., (2α 兾δ )2d 2] and the lower bound on ν [i.e., (kα 兾dν)d ] are also plotted. We set

δ (Gc , Aν)G2c(1兾ν)1兾2 Gdν (Gc)Gb(ν, c, d ).

Fig. 1. Comparison between sigmoidal neural approximating networks and linear approximating networks for the sets Gc .

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

415

Note that the lower bound b(ν, c, d) on the worst-case error in linear ANs does not hold for functions characterized by a higher degree of smoothness, like functions with square-integrable partial derivatives of order up to s (hence, belonging to Sobolev spaces), provided that s¤d兾2∫C2. Denote such a space by W (s) 2 . It can be shown (Ref. 17) that these functions belong to Gc , for some c∈⺢+. Then, if c is such that Gc ⊃W (s) 2 , neural ANs should behave better than linear ANs in the difference sets Gc \W (s) 2 . Summing up, due to the dependence of c on d, the nice results reported in Ref. 17 do not yet fully explain the advantageous behavior of neural ANs in high-dimensional settings, although they represent a first attempt at comparing linear and nonlinear ANs for the approximation of a wide class of functions. 4.3. A second source of difficulty in comparing the rates of approximation achievable by different ANs is the fact that each network has been derived to approximate functions that verify different a priori regularity assumptions; see the table in Ref. 18, p. 255. In Ref. 19, various linear and nonlinear approximation schemes are described, and the sets of functions they can approximate with rates of order O(1兾1ν) are compared. It turns out that in general there are nonvoid intersections among such sets, but no inclusions. This means that, among the commonly used approximators considered in Ref. 19, there is no one outperforming all the others, i.e., achieving a rate of order O(1兾1ν) in a set of functions larger than all the sets in which such rates are obtained by some other approximator. 4.4. We mention some recent lines of research that will hopefully shed some light on the problem of a general comparison between the approximation rates of linear and nonlinear ANs. In Ref. 20, it is proved that, given a nonlinear scheme {Aν}, under some hypotheses there exists a set B of dvariable functions, dependent on the particular scheme {Aν} and on d (generally, B becomes smaller with increasing dimension), such that: (a) {Aν} achieves in B a rate of order O(1兾1ν); see also Ref. 21; (b) the best linear approximation scheme has in B a rate bounded from below by a value that may depend exponentially on d. More precisely,

δ (B, A)⁄C(1兾1ν), where C is constant with respect to both ν and d, whereas dν (B)¤ C ′(1兾ν)1兾d,

416

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

where C′ is independent of ν, but may depend on d. In the cases in which C′ does not go to 0 with d (for example, when it does not depend on the dimension), this implies that in B every linear approximation scheme is ruled out by the curse of dimensionality, in striking contrast with the very favorable approximation rate supplied by {Aν} for the same set. Moreover, it has been proved recently (Ref. 22) that the hypothesis of a continuous dependence of the optimal parameters on the function to be approximated (see Ref. 23) does not hold for nonlinear approximation schemes. This makes some classical lower bounds on the rates of linear approximation schemes, obtained by using such a hypothesis, a priori inapplicable to nonlinear ANs, whose rates might consequently be better. 4.5. To conclude, the use of nonlinear schemes for the approximate solution of Problem P is justified: (i) from a practical point of view, by their highly advantageous performances, as shown by a variety of numerical results; see, for example, Refs. 4–7; (ii) from a theoretical perspective, by the possibility of achieving favorable rates of convergence not only in the sets of functions where linear approximators obtain such rates, but also in sets where the use of linear schemes is ruled out by the curse of dimensionality. However, it should be stressed that further research is needed to transfer the powerful properties of nonlinear ANs from the field of approximation of functions to the field of approximation of optimization problems, in which the interest is in the convergence of both the suboptimal values of the cost functionals and the suboptimal solutions.

5. Stochastic Optimization Problems As said in Section 1, we are interested particularly in solving stochastic optimization problems. The cost functionals of such problems can be written in the form F(γ ).E {J[γ (x), z]}. z

By substituting the ANs γ ν into F(γ ), as explained in Section 3, we obtain the cost function Fν (wν).E [Jν (wν , z)]. z

Then, for each ν, we can state the following particular version of Problem Pν , which we call P ν ; the slight excess of notations is aimed only at recalling that we are minimizing a cost that includes averaging operations.

417

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

Problem

P ν.

Find 䊐

inf E [Jν (wν , z)].

wν ∈Wν z

We now make an assumption similar to A3. (A3′) For any û∈⺪ +, an optimal solution w°ν to Problem

P ν exists.

In the following, we shall suppose that Assumptions A1, A2, A3′, A4 are verified. Hence, Fact 3.1 holds true for the sequence {γ °ν } solving Problem P ν . Remark 5.1. It is not always possible or appropriate to write the cost functional in the form F(γ )GE {J[γ (x), z]} z

and accordingly state Problem P ν . This requires (i) the knowledge of the density function p(z) and (ii) the explicit purpose of using the expectation of the cost as a design criterion. If requirements (i) and兾or (ii) are not fulfilled, a min–max approach may be sought. To this end, one has to define the cost functional as Fν (wν).sup Jν (wν , z), z ∈Z

where Z is a known set from which the random vector z takes its values, and hence to restate Problem P ν . To solve Problem P ν , we focus our attention on gradient algorithms mainly for their simplicity. This leads us to introduce the concept of stochastic approximation in a straightforward way. Just to avoid notational difficulties, in the following we shall opt for penalty functions techniques, that is, for a soft interpretation of the various constraints on wν that define the set Wν . This is reasonable in many practical situations and does not involve conceptual losses in our treatment. More specifically, we shall suppose that Wν G⺢N(ν), thus reducing Problem P ν to an unconstrained nonlinear programming problem. If constraints on wν are present and must be fulfilled exactly, stochastic approximation techniques exist that can handle them as required; see for instance Ref. 24. Clearly, the use of gradient algorithms needs the following assumption: (A5) Jν (wν , z) is a

C 1 function with respect to wν for all z.

If Assumption A5 is verified, then under some additional regularity hypotheses, Fν (wν) is also a C 1 function. However, due to the very general

418

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

framework in which Problem P ν has been stated, we are unable to compute analytically the gradient ∇wν Ez [Jν (wν , z)]. Indeed, at any iteration step, we should calculate a multiple integral and its gradient. Then, to overcome such computational difficulties, we shall resort to a stochastic approximation technique. The main idea of this technique is to use only the gradient ∇wνJν (wν , z), instead of ∇wνEz [Jν (wν , z)], thus avoiding the computation of integrals. On the contrary, as will be seen later on, ∇wνJν (wν , z) can be computed easily. To sum up, we use the updating algorithm wν (kC1)Gwν (k)Aα k ∇wνJν [wν (k), z(k)],

kG0, 1, . . . ,

(8)

where the sequence {z(k)} is generated randomly according to the known probability density function of z and α k is a suitably decreasing positive stepsize. The algorithm (8) is one of the simplest and widely used stochastic approximation algorithms. Sufficient conditions for the convergence (with probability one) of the algorithm (8) can be found for instance in Refs. 24– 25. Some of such conditions are related to the behavior of α k. Usually, the following conditions are given:

α k H0,

S

∑ α k GS,

k G0

S

2 ∑ α k FS.

(9)

k G0

In the examples presented in the following sections, we take the stepsize

α k Gc1 兾(c2 Ck),

c1 , c2 H0,

which satisfies the conditions (9). The other conditions are related to the shape of the cost surface Fν (w) and are very difficult to assess, due to the high complexity of such a surface. It is deemed that some techniques accelerating the convergence of general stochastic approximation algorithms can be applied to modify the algorithm (8) usefully; probably, they would allow a faster convergence in the examples given later on.

6. Application of Approximating Networks to a Stochastic Optimal Control Problem If the LQG assumptions are not verified, solving a T-stage stochastic optimal control problem is an important but difficult task. Let us consider the discrete-time dynamic system xtC1 Gf (xt , ut , ξ t),

tG0, 1, . . . , TA1,

(10)

yt Gg(xt , η t),

tG0, 1, . . . , TA1,

(11)

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

419

where xt ∈⺢n is the state vector, ut ∈⺢ m is the control vector, yt ∈⺢p is the observation vector, ξ t ∈⺢q is a random process noise, and η t ∈⺢r is a random measurement noise. The cost functional is given by TA1

JG ∑ h(xt , ut)ChT (xT).

(12)

t G0

To verify Assumption A5, f, g, h, hT are assumed to be C 1 functions. The random vectors x0 , ξ 0 , . . . , ξ TA1 , η 0 , . . . , η TA1 are characterized by known probability density functions (such random vectors do not necessarily have to be mutually independent). Let us assume that the controller has perfect memory. This implies that it can make its decisions on the basis of the information vector It .col(y0 , . . . , yt , u0 , . . . , utA1),

tG0, 1, . . . , TA1,

and that it uses a control of the feedback form ut Gγ t (It). Finally, we introduce the sets St ⊆ Ht of admissible control functions γ t : ⺢dt > ⺢ m for the tth stage, where dt . dim(It) and Ht is a suitable normed linear space. We can now state the following problem.

Problem C. Find the optimal control law {u°t Gγ °t (It): γ °t ∈M t , tG0, 1, . . . , TA1} that minimizes the expected value of the cost (12). 䊐 Of course, this problem is of the form of Problem P with MGT. It is well-known that, at least in principle, Problem C can be solved by dynamic programming; in such a case, the mutual independence among the random vectors is needed. This requires preliminarily the recursive computation of the conditional probability density functions p(x0 兩I0), p(x1 兩I1), . . . , p(xTA1 兩ITA1). However, unless the classical LQG assumptions are satisfied or the dimensions of the vectors involved and the number of temporal stages are very small (thus allowing a numerical approach), dynamic programming is of little use.

420

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

Following the approximation technique described in Section 3, we assume that each control function γ t (It) has the structure specified by (2) and that the density property A1 holds in each Ht ; then, it is an AN and takes on the form νt

ut Gγˆ t,νt (It , wt,νt )Gcol[ ∑ cijt ϕ (It , κ it ): jG1, . . . , m], i G1

κ it ∈⺢ kt,

tG0, 1, . . . , TA1,

(13)

where wt,νt .col(cijt , κ it : iG1, . . . , νt and jG1, . . . , m). Next, we define the vector

νq .col(νt : tG0, 1, . . . , TA1). Note that the information vector It in (13) differs from the information vector in γt(It) as the control functions are different. However, we do not change the notation for the sake of simplicity; see Ref. 4. Clearly, Aνq .{γˆ t,νt (It , wt,νt): tG0, 1, . . . , TA1, wt,νt ∈⺢νt(ktCm)}. Then, if one wants the families Aνq of ANs to have the same infinite nestedness as in (3), one has to organize the growth of the structure of the ANs in such a way that

νq 1 Fνq 2 F· · ·Fνq i F· · ·, where νq i Fνq iC1 means that the vector νq iC1 is obtained from νq i by adding 1 to one and only one of the components of νq i . Equivalently, Aνq iC1 is obtained from Aνq i by adding one basis function to one and only one of the T control networks (13) belonging to Aνq i . Such a growth may be achieved easily by choosing suitable nondecreasing functions

νt Gjt (ν),

where

jt : Z + > Z +,

tG0, 1, . . . , TA1,

tending to infinity when ν→ S and such that TA1

TA1

t G0

t G0

∑ jt (νC1)G ∑ jt (ν)C1.

Clearly, the integer ν characterizes the complexity of the entire chain of the T ANs. In the following, we shall assume that the above growth procedure is used to attain the infinite nested structure Aνq 1 ⊂Aνq 2⊂· · ·⊂Aνq i ⊂· · ·.

(14)

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

421

Also note that there is no risk of ambiguity in writing Aν° ⊂Aν°C1 ⊂· · ·⊂Aν°Ci ⊂· · ·

(15)

instead of (14). ν ° is the number of basis functions contained in the smallest chain of T control networks. As such a chain is composed of networks containing only one basis function, we have ν °GT. Of course, (15) is similar to (3), with

ν Gν °Ci,

iG0, 1, . . . ,

and what has been stated in the preceding sections for the case MG1 applies immediately to the case MGT. Now, if we substitute (13) into (10) and (13), (10) into (12), and if we express all the vectors making up It as functions of the primitive random vectors, the cost function (12) takes on the form Jν GJν (wν , x0 , ξ , η ), where wν . col(wt,νt : tG0, 1, . . . , TA1),

ξ . col(ξ 0 , . . . , ξ TA1), η . (η 0 , . . . , η TA1). Then, for a given ν ¤ ν °, the cost function Fν (wν) is given by Fν (wν) . E

x0 ,ξ ,η

[Jν (wν , x0 , ξ , η )].

It follows that the functional optimization Problem C has been reduced to a nonlinear programming problem of the type of Problem Pν ; let z . col(x0 , ξ , η ). Then, for a given ν, the approximate form of Problem C can be stated as follows. Problem

Cν .

inf E

wν x0 ,ξ ,η

Find [Jν (wν , x0 , ξ , η )].



To avoid too heavy notations, in the following we shall drop the subscript νt in both the functions γˆ t,νt and the vectors wt,νt . We shall also drop ν in the cost Jν . In order to solve Problem C ν by the algorithm (8), at each iteration step of (8) we have to compute the components of the gradient ∇ w J[w(k), x0 (k), ξ (k), η (k)], i.e., the partial derivatives (∂兾∂wit) J[w(k), x0 (k), ξ (k), η (k)],

tG0, 1, . . . , TA1 and iG1, . . . , N(ν),

422

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

where wit is the ith component of the vector wt. We can write ∂J兾∂wit G(∂J兾∂ut) [∂γˆ t (It , wt)兾∂wit ].

(16)

By using some algebra, we can see easily that ∂J兾∂ut is given by T ∂J兾∂ut G(∂兾∂ut)h(xt , ut)Cλ tC1 (∂兾∂ut)f (xt , ut , ξ t) TA1

C ∑ ∂J兾∂I uj t ,

tG0, 1, . . . , TA1,

(17)

j GtC1

where

λ tT .∂J兾∂xt and I uj t is the input to the jth AN corresponding to ut ; i.e., I uj t Gut is a subvector of Ij . Similarly, let us denote by I yj t the input to the jth AN corresponding to yt . Then, λ t can be computed as follows: T λ tT G(∂兾∂xt)h(xt , ut)Cλ tC1 (∂兾∂xt)f (xt , ut , ξ t)

C



TA1



y ∑ ∂J兾∂I j t (∂兾∂xt)g(xt , η t),

j GtC1

λ TT G(∂兾∂xT)hT (xT).

tG0, 1, . . . , TA1, (18a) (18b)

Remark 6.1. It is worth noting that (18) is the classical adjoint equation for T-stage optimal control theory, with the addition of one term (the third in (18a)) to take into account the introduction of the parametrized feedback control law. Also note that the structure of (18) does not depend on the type of AN used to implement the control law. Of course, the special type of AN appears when we render the column vector ∂γˆ t (It , wt)兾∂wit explicit in (16) and when we do the same for the row vectors TA1

TA1

j GtC1

j GtC1

u ∑ (∂J兾∂I j t) and

y ∑ (∂J兾∂I j t)

in (17) and (18), respectively. Remark 6.2. If the optimal control functions γ °0 (I0), . . . , γ °TA1 (ITA1) belong to a suitable class, we can approximate any function γ °t (It) by using a properly chosen parsimonious nonlinear AN γˆ t (It , wt) containing a small number of parameters; for the case of neural networks, see Ref. 4 for details. For example, if each of the m components of the optimal control functions belongs to the space Gc [see (7)] and if c is constant with respect to the dimension of the information vector, the number of parameters required to achieve an L 2 approximation error of order O(1兾1νt ) is O(dt ); i.e., the

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

423

number of parameters grows only linearly with dt (see Section 4). Similarly, if c increases moderately with dt (see the cases considered in Ref. 17), also the growth of the number of parameters is moderate. Remark 6.3. Even though the technique proposed in the paper enables one to approximate arbitrarily well the optimal control functions γ °t (It) without computing the conditional probability density functions p(xt 兩It), we do not escape the need for storing in the memory the information vector It , whose dimension increases over time, and for utilizing a chain composed of an increasing number of ANs. These needs become unacceptable when the number T of stages becomes large and possibly goes to infinity. Such a severe drawback can be circumvented in an approximate way (i) by limiting the storing capacity of the control law, which is constrained to retain, in its limited memory, only the observed variables and the controls related to the recent past (this approximation will be used in Example 6.1), and (ii) by stating Problem C in a receding-horizon context, which allows one to utilize only a fixed number of ANs; see Ref. 6 for details. Example 6.1. Freeway Traffic Optimal Control. 5 We consider a problem of freeway traffic optimal control. This example is justified by its intrinsic engineering importance, by the fact that it deals with a high-order strongly nonlinear dynamic system, and by the possibility of comparing the ANs approach with a technique presented in the literature. We refer to the following Payne model (Ref. 26): ûi,tC1Gûit C(δ T 兾τ ){Vf bit [1A( ρit 兾ρmax) m(3A2bit)]lAûit } C(δ T 兾∆ i)ûit (ûiA1,tAûit)Cνδ T ( ρiC1,tAρit)兾( τ ∆ i ( ρit C χ )) Aδ on (δ T 兾∆ i)ûit rit 兾( ρit C χ ), tG0, 1, . . . , TA1 and iG1, . . . , D,

(19)

ρi,tC1Gρit C(δ T 兾∆ i) [α (1Aγ i )ρiA1,tûiA1,t C(1A2α Cγ i α Aγ i)ρit ûit A(1Aα )ρiC1,tûiC1,tCrit ], tG0, 1, . . . , TA1 and iG1, . . . , D, li,tC1Glit Cδ T (dit Arit),

(20)

tG0, 1, . . . , TA1 and i∈I r,

where DG30 is the number of sections of length ∆ i G1 km , δ T G15 s is the sample time interval, TG60, ûit is the mean traffic speed on section i at time tδ T , ρit is the traffic density, and lit is the number of vehicles queueing on 5

This example has been drawn from Ref. 4.

424

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

the on-ramp of section i∈I r ; the indexes of the sections with on兾off ramps make up the set I r. The on-ramp traffic volume rit is monitored via traffic lights. bit denotes the speed limits set by variable message signs (bit G1 means that there is no speed limit). Then, we let xt . col(ûit , iG1, . . . , D; ρit , iG1, . . . , D; lit , i∈I r ) and ut . col(rit , i∈I r ; bit , iG1, . . . , D). dit is the stochastic demand flow to the on-ramp. The parameters in (19)– (20) take on the values given in Table 3.1 of Ref. 26. We assume 0⁄ûit ⁄200 km兾h, 0⁄ ρit ⁄200 veh兾km, 0⁄lit ⁄200 veh, 0.7⁄bit ⁄1, and max[ri min , dit A(li maxAlit)兾δ T ]⁄rit ⁄min(ri max , dit Clit 兾δ T), where ri min G0 veh兾h,

ri max G20,000 veh兾h,

li max G200 veh,

for any i∈I r.

All the state variables are measured by the noisy channel yt Gxt Cη t . We suppose dit and η it to be mutually independent and uniformly distributed over suitable intervals (see Ref. 4). x0 is assumed to be uniformly distributed over the set of states describing all possible initial congestion situations. As in Problem P3 stated in Ref. 26, the process cost is given by the total time spent on the freeway and in the on-ramps, that is, T

JGδ T ∑

t G1

冤 ∑ ρ ∆ C ∑ l 冥. D

it

i G1

i

it

i ∈I

r

Sections 3, 9, 15, 21, 27 have on-ramps and off-ramps. The variable message signs are placed in sections 1, 7, 13, 19, 25, and each of them acts on 6 successive sections. It follows that

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

dim(xt)G65,

425

dim(ut)G10.

Let us first compare our method with the control scheme proposed in Ref. 27, where all the state variables are assumed to be perfectly measurable. Such a scheme consists in a receding-horizon optimization, performed online periodically. This means that, when the system is in the state xt at stage t, an optimal control problem is solved over a number T′ of stages by using a nonlinear programming technique, thus deriving a sequence of optimal controls u°t , . . . , u°tCT ′A1. Then, the first control u°t of this sequence is applied. The dynamic system moves to the state xtC1 and the online procedure is repeated. As u°t depends on the current state xt , the control law takes on a feedback structure. At each online optimization, the random demands dit are replaced with their expected values, thus obtaining a certainty equivalent open-loop feedback (CEOLF) control. We set T′G5. For both our method and the CEOLF method, we chose to handle the constraints by a penalty functions method, hence using the algorithm (8). Since xt is perfectly measurable, in the functions (13) It is simply replaced with xt ; we assume that all the networks have the same structure. The control functions are implemented by neural networks containing two hidden layers composed of 30 and 15 units. We set c1 G1,

c2 G108.

A severe congestion on section 11 is simulated. Figures 2a and 2b show the evolutions of the variables ρit and ûit under the action of the optimal neural controller. When the CEOLF controller is acting, the surfaces of such variables (as functions of stage t and section i) are nearly the same, so we do not specify them. The neural controller involves an increase in the cost of about 0.4%. Similar behaviors can be observed for different traffic congestions at stage tG0. Then, let us consider the case in which all the state variables are measured by the noisy channel yt Gxt Cη t ; in Ref. 27, this case has not been considered. Due to the large number of stages, we use an approximation for the control law (13); instead of storing the entire information vector It , we use a limited-memory control function of the form ut Gγˆ 1t (yt , µtA1 , w1t ).

µt is a ‘‘contracted estimate’’ of It (see Ref. 4) generated by the recursive neural mapping µt Gγˆ 2t (yt , ut , µtA1 , w2t ),

dim( µt)G75.

426

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

Fig. 2. Evolutions of the traffic density ρit and mean speed ûit in clearing a severe congestion on section 11 under the action of the optimal neural control law: (a, b) random disturbances on the dynamic system and perfect measurements on the state vector; (c, d) random disturbances on the dynamic system and noisy measurements on the state vector.

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

Fig. 2. Continued.

427

428

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

γˆ 1t ( γˆ 2t ) are implemented by networks containing two hidden layers with 30 and 15 units (with 30 and 20 units). In Figs. 2c and 2d, we give the evolutions of ρit and ûit for the same sequence of disturbances dit as in the previous example. It is worth noting that, in both examples and under a variety of initial conditions, the optimal neural control laws drive the mean traffic speed to a neighborhood of the so-called ‘‘free speed’’ Vf (in our case, Vf G123 km兾h), corresponding to very low traffic densities (Ref. 26). In a sense, such a result may be interpreted as a stabilizing property of the neural controller; this property has been considered in Ref. 28. The algorithm (8) converged after about 3·105 iterations in the first example and after about 4·105 iterations in the second example. Remark 6.4. We point out an important advantage of our method with respect to the control scheme proposed in Ref. 27 (as well as in other receding-horizon control mechanisms described in the literature). In Ref. 27, the computational effort is performed online, in that a nonlinear programming problem has to be solved at any stage t to generate u°t . Clearly, such a technique is acceptable only if the dynamics of the plant is sufficiently slow, as compared with the speed of the controller computing system. Instead, we propose to compute the closed-loop optimal control law offline, thus enabling the controller to generate almost instantaneously the optimal control vector for any xt belonging to the set of admissible states. Remark 6.5. We point out explicitly that the example involves a state vector with 65 components! Obviously, even if the state vector is perfectly measurable, this rules out dynamic programming techniques based on state space discretization. We believe that just the use of the nonlinear ANs has enabled us to escape the curse of dimensionality. It is worth noting that the applicability of numerical methods that handle efficiently closed-loop (hence functional) optimization problems (e.g., control of reservoir systems, inventory problems, etc.) is limited to rather small dimensions. Recently, an interesting dynamic programming approach using adaptive regression splines has allowed the solution of an inventory forecasting problem with up to nine state variables (Ref. 29), which represent the largest dimension of the state vector addressed successfully until now in this problem. The first results on the application of our method to such a problem lead us to hope to exceed this dimension (Ref. 30). 7. Application of Approximating Networks to an Optimal Estimation Problem Let us consider the dynamic system (10)–(11). Under the same assumptions on the functions f and g and on the random variables as in Problem

429

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

C, we want to estimate the state and random vectors as functions of the batch of data I T0 . col(y0 , . . . , yT). I T0 too is called information vector. Let dT be its dimension. For notational simplicity, we have dropped the control vectors. Now, we introduce the sets S at ⊆ H at, S bt ⊆ H bt, S tc ⊆ H tc, of admissible control functions xˆ t Gat (I T0 ): ⺢dT > ⺢n,

tG0, 1, . . . , TA1,

ξˆ t Gbt (I T0 ): ⺢dT > ⺢q,

tG0, 1, . . . , TA1,

ηˆ t Gct (I ): ⺢ > ⺢ ,

tG0, 1, . . . , T.

T 0

dT

r

where H at, H bt, and H tc are suitable normed linear spaces. Let us also define the estimation cost T

TA1

T

t G0

t G0

t G0

JG兩兩xˆ 0 Ax¯ 0 兩兩2VxC ∑ 兩兩yt Ag(xˆ t , ηˆ t)兩兩2VyC ∑ 兩兩ξˆ t 兩兩2Vξ C ∑ 兩兩ηˆ t 兩兩2Vη,

(21)

where x¯ 0 .E(x0) and xˆ t , ξˆ t , ηˆ t are the estimates of xt , ξ t , η t . Vx , Vy , Vξ , Vη are positive-definite symmetric matrices; more general error functions might be considered in (21). Then, we state the following problem. Problem E. Find the optimal estimation functions {xˆ t° Ga°t (I T0 ):a°t ∈S at , tG0, 1, . . . , TA1}, {ξˆ t° Gb°t (I T0 ): b°t ∈S bt , tG0, 1, . . . , TA1}, {ηˆ t° Gc°t (I T0 ): c°t ∈S tc , tG0, 1, . . . , T} that minimize the expected value of the cost (21) under the constraints xˆ tC1 Gf (xˆ t , ξˆ t),

tG0, 1, . . . , TA1.

(22) 䊐

We now have 3TC1 unknown functions; then Problem E is of the form of Problem P, with MG3TC1. As observed previously for Problem C, if the LQG assumptions are not verified, solving Problem E is a very hard, almost impossible task. Let us simplify the estimation functions a little, in particular, the argument they

430

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

depend on, i.e., the information vector I T0 ; as we shall see later on, such simplification will enable us to use smaller ANs. It can be shown (Ref. 31) that a°t (I T0 ), b°t (I T0 ), c°t (I T0 ) can be replaced equivalently by xˆ °0 Ga°0 (I T0 ), ξˆ t° Gb˜ °t (xˆ °t , yTtC1),

tG0, 1, . . . , TA1,

(23b)

ηˆ t° Gc˜ °t (xˆ °t , y ),

tG0, 1, . . . , T,

(23c)

T t

(23a)

where y ji . col(yi , . . . , y j). The optimal estimates are linked via the equation xˆ °tC1 Gf (xˆ °t , ξˆ t°),

tG0, 1, . . . , TA1.

Under suitable assumptions, the equivalence of the functions (23) to the optimal estimation functions can be demonstrated by using the global implicit function theorem. Assuming that the density property A1 holds in each H at, H bt, and H tc we now introduce ANs, using the simplified notations adopted in Section 6, namely, xˆ 0 Gaˆ 0 (I T0 , w1), ξˆ t Gbˆ t (xˆ t , yTtC1 , w2t ),

tG0, 1, . . . , TA1,

(24b)

ηˆ t Gcˆ t (xˆ t , y , w ),

tG0, 1, . . . , T,

(24c)

T t

(24a)

3 t

where w1, w2t , w3t are the vectors of parameters to be determined. Following a procedure similar to the one presented in Section 6, let us define the vector

ν¡ . col(ν1 ; ν2t , tG0, 1, . . . , TA1; û3t , tG0, 1, . . . , T). The family Aνq of approximating networks is given by Aνq .{aˆ 0 (I T0 , w1), w1 ∈⺢ν (k Cn) ; 1

1

2 2 bˆ t (xˆ t , yTtC1 , w2t ), w2t ∈⺢νt (kt Cq), tG0, 1, . . . , TA1;

cˆ t (xˆ t , yTt , w3t ), w3t ∈⺢νt (kt Cr), tG0, 1, . . . , T}. 3

3

By using the same growth mechanism as for the families Aνq described in Section 6, we obtain the infinite nested structure (14) or (15), with

ν°G2TC2. Then, we can extend to the case MG2TC2 what was stated for Problem P in the case MG1. By substituting (24) into (21) and by using the state equation and constraints (22), we can write the estimation cost (21) in the

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

431

form Jν (w, x0 , ξ , η ). Then, the cost function Fν (wν) is given by Fν (wν). E [Jν (wν , x0 , ξ , η )]. x0 ,ξ ,η

Again, for a given ν ¤ ν °, we have reduced the functional optimization Problem E to the following unconstrained nonlinear programming problem of the type of Problem P ν . Problem

Eν .

Find

inf E [Jν (wν , x0 , ξ , η )].



wν x0 ,ξ ,η

It is worth noting that Problem Eν is formally identical with Problem C ν . In order to use the stochastic approximation algorithm (8), we need to

compute the partial derivatives (∂兾∂w j) J[w(k), x0 (k), ξ (k), η (k)], where wj is the jth component of the vector wν . Considering each AN (24), we can write iG1, . . . , N(ν1), ∂J兾∂w1i G(∂J兾∂xˆ 0) [∂aˆ (I T0 , w1)兾∂w1i ], ∂J兾∂w2ti G(∂J兾∂ξˆ t) [∂bˆ t (xˆ t , yTtC1 , w2t )兾∂w2ti ], tG0, 1, . . . , TA1 and iG1, . . . , N(ν 2t ), ∂J兾∂w3ti G(∂J兾∂ηˆ t) [∂cˆ t (xˆ t , yTt , w3t )兾∂w3ti ], tG0, 1, . . . , T and iG1, . . . , N(ν3t ),

where w1i , w2ti , w3ti are the ith components of the vectors w1, w2t , w3t . To simplify further the notation, we denote by ϕ (xˆ 0), χ (yt , xˆ t , ηˆ t), ψ (ξˆ t), ω (ηˆ t) the four quadratic forms appearing in the cost (21). We also denote by xˆ bt , tG0, 1, . . . , TA1, and xˆ tc , tG0, 1,. . .T, the vectors xˆ t when they are considered as input vectors to the ANs bˆ t (xˆ t , yTtC1 , w2t ) and cˆ t (xˆ t , yTt , w3t ) respectively. Then, the following recursive equations can be derived: ∂J兾∂xˆ 0 G(∂兾∂xˆ 0) ϕ (xˆ 0)C(∂兾∂xˆ 0) χ (y0 , xˆ 0 , ηˆ 0) C∂J兾∂xˆ b0 C∂J兾∂xˆ 0c Cλ 1T (∂兾∂xˆ 0)f (xˆ 0 , ξˆ 0), T ∂J兾∂ξˆ t G(∂兾∂ξˆ t) ψ (ξˆ t)Cλ tC1 (∂兾∂ξˆ t)f (xˆ t , ξˆ t), tG0, 1, . . . , TA1, ∂J兾∂ηˆ t G(∂兾∂ηˆ t) χ (yt , xˆ t , ηˆ t)C(∂兾∂ηˆ t) ω (ηˆ t), tG0, 1, . . . , T,

432

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

where the row vectors λ tT .∂J兾∂xˆ t can be computed as follows:

λ tT G(∂兾∂xˆ t) ϕ ( yt , xˆ t , ηˆ t)C∂J兾∂xˆ bt C∂J兾∂xˆ tc T Cλ tC1 (∂兾∂xˆ t)f (xˆ t , ξˆ t), tG1, . . . , TA1, λ TT G(∂兾∂xˆ T)ϕ ( yT , xˆ T , ηˆ T)C∂J兾∂xˆ Tc . As to the ability of the optimal parametric estimation functions to approximate the optimal estimation functions arbitrarily well using a small number of parameters, considerations similar to those in Remark 6.2 hold. Remark 7.1. As in Problem C (see Remark 6.3), also in Problem E the number T of stages may become too large to allow one to associate an AN with each stage. Again, we can cut the old observations and retain the more recent ones. Such an approach leads to the design of the so-called limited-memory state estimators and becomes unavoidable when T tends to infinity. More specifically, at each stage t, we store the vectors ytAT¯ , . . . , yt in the memory, and then use the estimation cost (21) defined on a slidingwindow starting from stage tAT¯ and ending at stage t; T¯ denotes the width of the window. The ANs may be optimized in part offline and in part online (Ref. 31). In the offline optimization, the algorithm (8) is driven by the stochastic variables generated on the basis of the known probability densities; in the online optimization, (8) is driven stage after stage by the vectors ytAT¯ , . . . , yt . The online optimization allows also the ANs to adapt their parameters if the random variables ξ t and η t change their statistical characteristics over time. The limited-memory estimator is used in the following example. Example 7.1. Target Motion Analysis. Let us assume that a target Q is moving at the constant velocity ûQ , while an observer P, moving at the velocity ûPt , is trying to track it by using only noisy observations of the lineof-sight angle θt (see Fig. 3). Let us denote by rt the relative position of Q with respect to P. This example falls into the class of nonlinear passive tracking problems known as Bearings–Only Measurement Problems. The following state equation for the system can be found in Ref. 32: xtC1 GAxt CBut Cξ t ,

tG0, 1, . . . ,

where the vector xt is given by the components of the relative position and velocity, that is, xt . col(rtx , ûtx , rty , ûty),

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

433

Fig. 3. (a) Geometric configuration for bearings-only target motion analysis; (b) example of a multileg maneuver of the observer.

ξ t is the process disturbance vector, and ut . col(utx , uty) is the acceleration vector. As ut is generated by the observer P, it plays the role of a control vector. The matrices A and B are given by



1 0 AG 0 0

δT 1 0 0

0 0 1 0

δ 2T 兾2 δT BG 0 0

冥 冤

0 0 , δT 1



0 0 . δ 2T 兾2 δT

At each stage t, noisy observations of the bearing angle θ t are provided. Then, the nonlinear observation channel is modeled as follows: yt Gθ t Cη t Garctan(rty 兾rtx)Cη t , where x0 ∼ N ((30, 5.0, 30,A5.0)T, Σx0),

ξ t ∼ N (0, Σξ ), η t ∼ N (0, σ η2 ),

tG0, 1, . . . ,

434

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

with Σx0Gdiag(0.1, 0.01, 0.1, 0.01), Σξ Gdiag(0.01, 0.01),

σ η2 G0.01, δ T G0.1 s. All random vectors are assumed to be mutually independent. To compare the performances of linear and nonlinear ANs, we have considered the following three cases: (LRBF) Linear radial basis function network. This linear AN is given by (5), where only the coefficients cij are optimized, whereas the centers τ i and the matrices Γi are kept fixed. (RBF) Radial basis function network. This nonlinear AN is also given by (5), but the centers τ i and the matrices Γi are included also in the parameter vectors κ i to be optimized. (NN) One-hidden-layer feedforward neural network. This nonlinear AN is given by (4), where the parameter vector κ i is made up of the coefficients cij , the vectors α i , and the biases β i. The total numbers of parameters to be optimized in the above networks are denoted by NLRBF, NRBF, NNN. Such networks have been chosen for their wide diffusion. Moreover, by using radial basis function networks, we can replace easily linear LRBFs with nonlinear RBFs. LRBFs, RBFs, and NNs have been trained both offline and online (see Remark 7.1) by using the algorithm (8), which has been run in the offline optimizations for 6000, 10,000, and 60,000 iterations for LRBFs, RBFs, and NNs, respectively. The temporal behaviors of the online optimizations are shown in Figs. 4a and 4b. We have set c1 G10−4,

c2 G107,

Tr G8.

The matrices in the cost (21) are given by Vx Gdiag(0.8, 0.8, 0.8, 0.8), Vy G0.001, Vξ Gdiag(6 · 104, 6 · 104, 6 · 104, 6 · 104), Vη G6 · 103. It is assumed that the target is traveling at the constant speed of 0.8 m s−1 in the 45° direction, while the observer is traversing a multileg maneuver

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

435

consisting of periodic changes in the components of ut (see Fig. 3b). The following setups have been considered to derive the numerical results. (i)

Comparison of LRBFs, RBFs, and NNs with the Same Number of Parameters. We have compared the performances of LRBFs with a given number of parameters with the performances of RBFs and NNs with the same number of parameters. NNs have been implemented by using 10 hidden sigmoidal functions. Clearly, in order to have NLRBF ⯝NRBF ⯝NNN ,

(ii)

the number of fixed Gaussian functions in LRBFs has to be much larger than the number of variable Gaussian functions in RBFs and the number of sigmoidal functions in NNs. As can be seen from Fig. 4a, where the RMS errors on rtx are plotted, both RBFs and NNs outperform LRBFs with the same number of parameters. It is worth noting that, in Ref. 31, the reported RMS error on rtx , estimated by the extended Kalman filter, is much larger than the errors characterizing all the above-described ANs. Moreover, by using such a filter, a divergent estimation behavior on rtx was observed; the possibility of this phenomenon was pointed out in Ref. 33. Different Influences of Linear and Nonlinear Parameters on the RMS Error. To evaluate the different influences of the linear parameters (i.e., the weighting coefficients of the linear combination of basis functions) and the nonlinear parameters (i.e., those parametrizing the basis functions) on the estimation accuracy, we have considered the following test. A simulation concerning LRBFs with ν G56 fixed Gaussians and NLRBF parameters has been run. After 1650 steps, two more simulations have been started. In one simulation, RBFs have been used with the same number ν of parametrized Gaussians as the number of fixed Gaussians in LRBFs; then, NRBFHNLRBF. In other words, the centers τ i and the matrices Γi [see (5)] have become parameters to be optimized. In the other simulation, LRBFs have been used that had been obtained by adding to the previous LRBFs a number of fixed Gaussians such that the total number of parameters was N′LRBF ⯝NRBF ; then, if ν ′ was the new number of Gaussians, we had ν ′Hν. The behaviors of the RMS errors in the three simulations are shown in Fig. 4b. As can be seen, the freedom allowed to the Gaussians of LRBFs yields better performances than those obtained by increasing the number of fixed Gaussians in such

436

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

Fig. 4. (a) RMS estimation errors on rtx for LRBFs, RBFs, and NNs with the same number of parameters; (b) RMS estimation errors on rtx for LRBFs with NLRBF parameters, LRBFs with N′LRBF parameters, and RBFs with NRBF parameters, where N′LRBF⯝NRBFHNLRBF .

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

437

networks. Note that, in all the simulations, the matrices Γi have been chosen diagonal, hence not exploiting all the potentialities of the parametrized basis functions.

8. Conclusions Functional optimization problems can be divided into two classes: analytically solvable problems and analytically unsolvable ones. Typically, the former class includes optimization problems for which the standard LQG assumptions are verified and classical information structures are available. For the latter class, the search for new powerful tools of approximate solution is mandatory. The approximate method described in this paper exhibits the following novel aspects. It extends the classical Ritz method of the calculus of variations. Actually, instead of linear combinations of fixed basis functions, the proposed method uses linear combinations of basis functions containing free parameters to be optimized. When such linear combinations are provided with density properties in the set of feasible solutions to the given optimization problem, we call them nonlinear approximating networks. By using nonlinear approximating networks that benefit by rates of approximation with a favorable behavior with respect to the dimension, we have the possibility of escaping from the curse of dimensionality, which may affect the Ritz method. In stochastic frameworks, once the original functional optimization problem has been reduced to a finite-dimensional nonlinear programming problem, we propose to solve it by stochastic approximation algorithms, which allow one to avoid practically impossible averaging operations. The method has been specialized for the solution of optimal control and state estimation problems. Numerical examples confirm its excellent capabilities for solving functional optimization problems stated in high-dimensional settings. Further research is required both to relate the smoothness assumptions on the feasible solutions and the regularity properties of the cost functional to the choice of the structure of the nonlinear approximating network and to derive better characterizations of the bounds on the approximation rates. From a qualitative point of view, the structure of a nonlinear approximating network may be associated with a given functional optimization problem. Such a structure makes an approximate solution to the problem feasible; that is, an approximate solution can be obtained up to any desired degree of accuracy by using a moderate number of free parameters.

438

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

References 1. HO, Y. C., and CHU, K. C., Team Decision Theory and Information Structures in Optimal Control Problems, IEEE Transactions on Automatic Control, Vol. 17, pp. 15–28, 1972. 2. SAGE, A. P., Optimum Systems Control, Prentice-Hall, New York, NY, 1968. 3. MA¨ KILA¨ , P. M., and TOIVONEN, H. T., Computational Methods for Parametric LQ Problems: A Surûey, IEEE Transactions on Automatic Control, Vol. 32, pp. 658–671, 1987. 4. PARISINI, T., and ZOPPOLI, R., Neural Approximations for Multistage Optimal Control of Nonlinear Stochastic Systems, IEEE Transactions on Automatic Control, Vol. 41, pp. 889–895, 1996. 5. PARISINI, T., SANGUINETI, M., and ZOPPOLI, R., Nonlinear Stabilization by Receding-Horizon Neural Regulators, International Journal of Control, Vol. 70, pp. 341–362, 1998. 6. PARISINI, T., and ZOPPOLI, R., Neural Approximations for Infinite-Horizon Optimal Control of Nonlinear Stochastic Systems, IEEE Transactions on Neural Networks, Vol. 9, pp. 1388–1408, 1998. 7. ALESSANDRI, A., BAGLIETTO, M., PARISINI, T., and ZOPPOLI, R., A Neural State Estimator with Bounded Errors for Nonlinear Systems, IEEE Transactions on Automatic Control, Vol. 44, pp. 2028–2042, 1999. 8. BEARD, R. W., and MCLAIN, T. W., Successiûe Galerkin Approximation Algorithms for Nonlinear Optimal and Robust Control, International Journal of Control, Vol. 71, pp. 717–743, 1998. 9. SJO¨ BERG, J., ZHANG, Q., LJUNG, L., BENVENISTE, A., GLORENNEC, P. Y., DELYON, B., HJALMARSSON, H., and JUDITSKY, A., Nonlinear Black-Box Modeling in System Identification: A Unified Oûerûiew, Automatica, Vol. 31, pp. 1691–1724, 1995. 10. LESHNO, M., LIN, V. YA., PINKUS, A., and SCHOCKEN, S., Multilayer Feedforward Networks with a Nonpolynomial Actiûation Function Can Approximate Any Function, Neural Networks, Vol. 6, pp. 861–867, 1993. 11. GIROSI, F., Regularization Theory, Radial Basis Functions, and Networks, From Statistics to Neural Networks: Theory and Pattern Recognition Applications, Edited by J. H. Friedman, V. Cherkassky, and H. Wechsler, Computer and Systems Sciences Series, Springer Verlag, Berlin, Germany, pp. 166–187, 1993. 12. PARK, J., and SANDBERG, I. W., Uniûersal Approximation Using Radial-BasisFunction Networks, Neural Computation, Vol. 3, pp. 246–257, 1991. 13. GELFAND, I. M., and FOMIN, S. V., Calculus of Variations, Prentice Hall, Englewood Cliffs, New Jersey, 1963. 14. SIRISENA, H. R., and CHOU, F. S., Conûergence of the Control Parametrization Ritz Method for Nonlinear Optimal Control Problems, Journal of Optimization Theory and Applications, Vol. 29, pp. 369–382, 1979. 15. ATTOUCH, H., Variational Conûergence for Functions and Operators, Pitman Publishing, London, England, 1984. 16. PINKUS, A., n-Widths in Approximation Theory, Springer Verlag, Berlin Heidelberg, Germany, 1985.

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

439

17. BARRON, A. R., Uniûersal Approximation Bounds for Superpositions of a Sigmoidal Function, IEEE Transactions on Information Theory, Vol. 39, pp. 930– 945, 1993. 18. GIROSI, F., JONES, M., and POGGIO, T., Regularization Theory and Neural Networks Architectures, Neural Computation, Vol. 7, pp. 219–269, 1995. 19. GIULINI, S., and SANGUINETI, M., On Dimension-Independent Approximation by Neural Networks and Linear Approximators, Proceedings of the International Joint Conference on Neural Networks, Como, Italy, pp. 283–288, 2000. 20. KU˚ RKOVA´ , V., and SANGUINETI, M., Comparison of Worst Case Errors in Linear and Neural Network Approximation, IEEE Transactions on Information Theory, Vol. 48, 2002. 21. KU˚ RKOVA´ , V., and SANGUINETI, M., Bounds on Rates of Variable-Basis and Neural-Network Approximation, IEEE Transactions on Information Theory, Vol. 47, pp. 2659–2665, 2001. 22. KAINEN, P. C., KU˚ RKOVA´ , V., and VOGT, A., Approximation by Neural Networks Is Not Continuous, Neurocomputing, Vol. 29, pp. 47–56, 1999. 23. DEVORE, R., HOWARD, R., and MICCHELLI, C., Optimal Nonlinear Approximation, Manuskripta Mathematica, Vol. 63, pp. 469–478, 1989. 24. KUSHNER, H. J., and YIN, G. G., Stochastic Approximation Algorithms and Applications, Springer Verlag, New York, NY, 1997. 25. ERMOLIEV, YU., and WETS, J. B., Editors, Numerical Techniques for Stochastic Optimization, Springer Verlag, Heidelberg, Germany, 1988. 26. PAPAGEORGIOU, M., Applications of Automatic Control Concepts to Traffic Flow Modeling and Control, Lecture Notes in Control and Information Sciences, Springer Verlag, New York, NY, 1983. 27. MESSNER, A., and PAPAGEORGIOU, M., Motorway Network Control ûia Nonlinear Optimization, Proceedings of the 1st Meeting of the EURO Working Group on Urban Traffic and Transportation, Landshut, Germany, pp. 1–24, 1992. 28. PARISINI, T., and ZOPPOLI, R., A Receding-Horizon Regulator for Nonlinear Systems and a Neural Approximation, Automatica, Vol. 31, pp. 1443–1451, 1995. 29. CHEN, V. C. P., RUPPERT, D., and SHOEMAKER, C. A., Applying Experimental Design and Regression Splines to High-Dimensional Continuous-State Stochastic Dynamic Programming, Operations Research, Vol. 47, pp. 38–53, 1999. 30. BAGLIETTO, M., CERVELLERA, C., PARISINI, T., SANGUINETI, M., and ZOPPOLI, R., Approximating Networks, Dynamic Programming, and Stochastic Approximation, Proceedings of the American Control Conference, Chicago, Illinois, pp. 3304–3308, 2000. 31. ALESSANDRI, A., PARISINI, T., and ZOPPOLI, R., Neural Approximations for Nonlinear Finite-Memory State Estimation, International Journal of Control, Vol. 67, pp. 275–302, 1997. 32. CHAN, Y. T., HU, A. G. C., and PLANT, J. B., A Kalman Filter Based Tracking Scheme with Input Estimation, IEEE Transactions on Aerospace and Electronic Systems, Vol. 15, pp. 237–244, 1978.

440

JOTA: VOL. 112, NO. 2, FEBRUARY 2002

33. AIDALA, V. J., Kalman Filter Behaûior in Bearings-Only Tracking Applications, IEEE Transactions on Aerospace and Electronic Systems, Vol. 14, pp. 29–39, 1979.

Suggest Documents