polynomially-complex approximating networks for optimal ... - IASI-CNR

1 downloads 0 Views 718KB Size Report
It reduces a functional optimization problem to a sequence of nonlinear programming problems, by means of nonlinear OHL networks γ ν(x, wν): inf γ(x)∈S.
'

$

POLYNOMIALLY-COMPLEX APPROXIMATING NETWORKS FOR OPTIMAL CONTROL OF FREEWAY TRAFFIC Marcello Sanguineti Department of Communications, Computer and System Sciences (DIST) University of Genoa Via Opera Pia 13, 16145 Genova, Italy E-mail: [email protected]

&

%

'

$

Outline • A model of freeway traffic • Statement of a Freeway Traffic Optimal Control (FTOC) problem • Main features of Problem FTOC: −→ functional nature (infinite dimension) −→ large number of variables • A methodology for the approximate solution of functional optimization problems: the Extended Ritz Method (ERIM) • Solution of Problem FTOC by the ERIM • Numerical results &

1

%

'

$

Modelling freeway traffic • Microscopic models represent the behavior of each single vehicle in the traffic stream. • Macroscopic models represent the behavior of the entire traffic stream in terms of aggregate variables (i.e., density and mean speed). They use continuous variables traditionally describing fluid mechanics. • We refer to the following model: 

vi,t+1

ρi,t+1

&



   δT  δT m(3−2bit ) l = vit + V b 1 − (ρ /ρ ) − v + vit (vi−1,t − vit) f it it max it  τ  ∆i µ δT (ρi+1,t − ρit) δT rit + − δon , vit τ ∆i (ρit + χ) ∆i ρit + χ δT = ρit + [α (1 − γi) ρi−1,tvi−1,t + (1 − 2α + γiα − γi) ρit vit ∆i −(1 − α)ρi+1,tvi+1,t + rit] , t = 0, 1, . . . , T − 1, i = 1, . . . , D ,

2

%

'

$

bk i

ρk i vk i

lki

rki

dk i

• D = 30 sections of length ∆i = 1 km ; k: index of the section • δT = 15 s : sample time interval; T = 60 ; control horizon: 15 min • vit : mean traffic speed on section i at time tδT ; ρit : traffic density; rit : on–ramp traffic volume • lit : queue on the on-ramp of section i ∈ I r ( I r : set of the indexes of the sections with on/off ramps) • dit : stochastic demand flow on the on-ramp of section i ∈ I r • Dynamics of the queues on the on-ramps: li,t+1 = lit + δT (dit − rit) , i = 0, 1, . . . , T − 1, i ∈ I r . &

3

%

'

$

• bit : speed limits set by variable message signs rit : traffic lights monitoring the on-ramp traffic volume. • Demands dit : mutually independent and uniformly distributed over the interval  D = [ri max/2 − δ, ri max/2 + δ] , with δ = 500veh/h • State variables: mean traffic speed, traffic density, queues on the on-ramps  xt = col ( vit, i = 1, . . . , D; ρit, i = 1, . . . , D; lit, i ∈ I r ) • Control variables: traffic lights on the on-ramps, speed limits  ut = col ( rit, i ∈ I r ; bit, i = 1, . . . , D ). • =⇒ the model can be described by a discrete-time nonlinear state equation xt+1 = f (xt, ut, dt) ,

t = 0, 1, . . . , T − 1 .

• All the state variables are measurable by means of detectors affected by noise: y t = xt + η t ,

t = 0, 1, . . . , T − 1 .

• Each component of η t is uniformly distributed over an interval centered at the mean value of the corresponding state component. Length of each interval: 40% of the range defined by the constraints on the state components. • Initial state x0 uniformly distributed over the set X0 of states describing all possible initial congestion situations &

4

%

'

$

• Components of the stochastic vectors x0, d0, . . . , dT −1, η 0, . . . , η T −1 : mutually indep. • Constraints:

0 ≤ ρit ≤ ρi max ,

t = 0, 1, . . . , T − 1, i = 1, . . . , D

(1)

0 ≤ vit ≤ vi max ,

t = 0, 1, . . . , T − 1, i = 1, . . . , D

(2)

0 ≤ lit ≤ limax , 0.7 ≤ bit ≤ 1 ,

t = 0, . . . , T − 1, i ∈ Ir

(3)

t = 0, 1, . . . , T − 1, i = 1, . . . , D

(4)

ri min t ≤ rit ≤ ri max t , t = 0, 1, . . . , T − 1, i ∈ Ir      1 1  ri min t = max ri min, dit − (limax − lit) , ri max t = min ri max, dit + lit . T T ri min and ri max : fixed parameters dependent on the road characteristics.  

(5)

• Cost functional: total time spent on the freeway and in the queues J = δT

T −1 t=0

 

D i=1

ρit∆i +



i∈I r



lit 

• Sections 3, 9, 15, 21, and 27 have on-ramps and off-ramps =⇒ five control variables rit, i ∈ I r . The variable message signs are placed in sections 1, 7, 13, 19, and 25, and each of them acts on 6 successive sections =⇒ other five control variables bit for speed limits. • =⇒ dim (xt) = 65 and dim (ut) = 10 . &

5

%

'

$

measuring detectors section 1 traffic flow

variable message signs

section i

70

80

section D 60

traffic lights

Control System

&

6

%

'

$

Parameter values of the model • α = 0.9 • Vf = 123 km/h • l=4 • m = 1.4 • τ = 0.01 h • µ = 21.6 km2/h • χ = 20 veh/km • δon = 0.1 • ρimax = ρmax = 200veh/km , vimax = 200 km/h , i = 1, . . . , D • limax = 200 veh , ri min = 0 veh/h ri max = 20000 veh/h , for each on–ramp • γi = 0.1 for each off-ramp

&

7

%

'

$



• xt = col ( vit, i = 1, . . . , D; ρit, i = 1, . . . , D; lit, i ∈ I r ): density, queues on the on-ramps

mean traffic speed, traffic



• ut = col ( rit, i ∈ I r ; bit, i = 1, . . . , D ), : speed limits, traffic lights on the on-ramps • xt+1 = f (xt, ut, dt) , y t = xt + η t . 

• Information vector I t = col (y 0, . . . , y t, u0, . . . , ut−1) , t = 0, 1, . . . , T − 1 • Cost J = δT

T −1 t=0

 

D i=1

ρit∆i +



i∈I r



lit 

Problem FTOC. Find the optimal traffic control law u◦t = µ◦t (I t), t = 0, 1, . . . , T − 1 , which minimizes the expected value of the cost J in correspondence to an initial traffic condition x0 , considered as a stochastic vector uniformly distributed on a set X0 of initial conditions.

Two main features of Problem FTOC: – the admissible solutions are functions, i.e., elements of an infinite–dimensional space =⇒ Problem FTOC is a functional optimization problem; – the admissible functions depend on a large number of variables (dim (xt) = 65). &

8

%

'

$

General Stochastic Functional Optimization Problem • z ∈ Ω ⊆ IRq : stochastic vector with a known probability distribution; z describes the so-called “state of the world”. • The Decision Maker (DM), in general, cannot observe perfectly z. It generates its decisions on the basis of an information vector x that we assume to be a known function of z. • H: infinite-dimensional real linear space of functions (“ambient space”). γ (·) : B → IRn2 , where B ⊆ IRn1 . • S ⊆ H : set of admissible solutions (decision functions). • Cost functional F : S → IR F (γ) = Ez {J[γ (x), z ]} , where J : IRn2 × Ω → IR.

Problem P:

inf E {J[γ (x), z ]} z γ ∈S

• A functional optimization problem is a triple (H, S, F ). &

9

%

'

$

• When there is a plurality of decision functions, as in the case of the T optimal freeway traffic control functions, we generalize Problem P as follows:

Problem PM: inf {γ 1,...,γ M }∈SM Ez {J[γ 1(x1), . . . , γ M (xM ), z ]}

• SM ⊆ H 1 × . . . × H M . • Hi, i = 1, . . . M : real spaces of functions γ i : Bi → IRn2i , Bi ⊆ IRn1i . • J : IRn21 × . . . × IRn2M × Z → IR: cost functional. • The properties of interest to us that hold true for Problem P can be extended directly to Problem PM =⇒ in the following we address Problem P.

&

10

%

'

$

What makes Problem P hard: • the infinite dimension of the ambient space: the admissible solutions are functions, not vectors in IRn =⇒ mathematical programming tools are inadequate =⇒ need for developing a suitable mathematical framework and new analytic tools; • the large number of variables =⇒ risk that classical numerical approaches incur the “curse of dimensionality” =⇒ need for developing procedures of solution that provide algorithms that are efficient also in presence of a large number of variables in admissible solutions.

&

11

%

'

$

Examples of stochastic functional optimization problems • • • • •

−→ Stochastic optimal control Optimal estimation of parameters and state vectors Cooperative stochastic games (team optimal control) Fault detection in presence of disturbances ...

Functional optimization problems in large-scale traffic networks (=⇒ large number of variables): • • • • • •

Computer networks extending in large geographical areas −→ Freeway systems −→ Problem FTOC Store–and–forward packet switching communication networks Reservoir networks in multi-reservoir systems Queueing networks in manufacturing systems ...

&

12

%

'

$

• In general, stochastic functional optimization problems can be solved analytically if i) they are LQG ii) in presence of multiple DMs, they are characterized by a partially nested information structure (= any DM can reconstruct the information owned by the DMs the actions of which influenced its personal information). • Problem FTOC does not satisfy these conditions. ⇓ Need for a procedure of approximate solution, efficient also in presence of a very large dimension of the state vector

&

13

%

'

$

Towards the approximate solution of Problem P • −→ we are interested in algorithms of approximate solution, which are efficient also when the number n1 of components of the vector x is very large. • Basic idea: (i) Constraining the functions γ (x) ∈ S to take on the form γ ν (x, wν ), where γ ν (·, ·) has a fixed structure, ν is a positive integer, and wν is a vector of parameters to be determined; ν “measures the complexity” of the fixed-structure functions γ ν (x, wν ), in the sense specified later on. (ii) Substituting γ ν (x, wν ) into the cost functional F (γ), thus obtaining the cost function  F (wν ) = F [γ ν (x, wν )] .

=⇒ for each ν, we have a finite-dimensional nonlinear programming problem Problem Pν :

 inf Fν (wν ) , where Wν = {wν : γ ν (x, wν ) ∈ S} is the set of

wν ∈Wν

admissible vectors wν .

&

14

%

'

$

• Problems Pν , ν = 1, 2, . . ., can be solved (hopefully) by NLP algorithms. • By a “smart” choice of the fixed-structure functions γ ν (x, wν ), Problem P can be replaced by a sequence of approximating problems Pν “much easier” to solve. • The idea has similarities with: “Specific Optimal Control”, “Parametric Controllers” (e.g., Levine-Athans and Anderson-Moore’s methods), and, particularly, with the classical Ritz method for the calculus of variations.

&

15

%

'

$

One-hidden-layer networks • As to the choice of functions γ ν (x, wν ), it is natural to address fixed-structure functions given by linear combinations of fixed or parameterized basis functions, i.e., ν

cij ϕi(x) , cij ∈ IR, j = 1, . . . , n2

(6)

cij ϕ(x, κi) , cij ∈ IR, κi ∈ IRk , j = 1, . . . , n2 .

(7)

γ νj (x, wν ) =

i=1

or γνj (x, wν ) =

ν i=1

• The components of wν are given by the coefficients cij and the components of κi.    N (ν) γ • Aν = ν (x, wν ) : wν ∈ IR . γ ν take on the structure (6) or (7) Def.: A sequence {Aν }∞ ν=1 , where the functions is called OHL scheme. The functions of each Aν are called one-hidden-layer networks (OHL networks) if they take on the structure (6) (linear OHL networks) or (7) (nonlinear OHL networks). • “Networks”: the functions γ ν (x, wν ) have the same structure as feedforward neural networks with one hidden layer of ν computational units and linear output units. &

16

%

'

$

• wν ∈ IRN (ν), where N (ν) gives the number of parameters to determine −→ N (ν) = dim(wν ) • For the structure (6), N (ν) = ν · n2 • For the structure (7), N (ν) = ν · n2 + ν · k, where k = dim(κi) • Then, for a fixed dimension of x, the number ν of basis functions “measures the complexity” of the OHL networks. • OHL networks include widespread nonlinear approximators commonly used in applications: – radial-basis-functions networks – feedforward neural networks γνj (x, wν ) =

ν i=1

cij σ(x αi + βi) ,

σ(·) sigmoidal (bounded measurable function on the real line such that z→+∞ lim σ(z) = 1, lim σ(z) = 0) z→−∞ – trigonometric polynomials with free frequencies and phases – free-node splines, etc.

&

17

%

'

$

Examples of OHL networks • Ridge construction. We “shrink” the n1-dimensional vector x into a one-dimensional variable by the inner product: ϕ(x, κi) = h(x αi + βi), 

where κi = col(αi, βi). −→ sigmoidal neural networks with one hidden layer and linear output activation functions: ϕ(x, κi) = σ(x αi + βi) , σ(·) sigmoidal (bounded measurable function on the real line such that z→+∞ lim σ(z) = 1, lim σ(z) = 0). z→−∞

• Radial construction. x is “shrunk” into a scalar variable by some norm: ϕ(x, κi) = h( x − τ i 2Γi ) , 

where x 2Γi = x Γi x, Γi = Γ i , Γi > 0, and κi = col (τ i , distinct elements of Γi ). −→ Gaussian basis function: − x−τ i ||2Γ i . ϕ(x, κi) = e

&

18

%

'

$

The Extended Ritz Method (ERIM) in a nutshell • It is a general methodology that we have developed for the solution of functional optimization problems −→ Zoppoli & Parisini ’92, Parisini & Zoppoli ’94, Parisini, Sanguineti & Zoppoli ’96, Zoppoli, Sanguineti & Parisini ’02, Sanguineti ’02, K˚ urkov´a & Sanguineti ’02b. • It reduces a functional optimization problem to a sequence of nonlinear programming problems, by means of nonlinear OHL networks γ ν (x, wν ): inf F (γ ) −→

γ (x)∈S

γνj (x, wν ) =

ν i=1

inf

γ ν (x,wν )∈S∩Aν

F (γ ν (x, wν )) = inf Fν (wν ), wν ∈Wν

cij ϕ(x, κi) , cij ∈ IR, κi ∈ IRk , j = 1, . . . , n2

 Wν = {wν : γ ν (x, wν ) ∈ S}: set of admissible parameter values.

• It allows to obtain an approximate solution to Problem P within any desired degree of accuracy, by using OHL networks with a “moderate” complexity −→ Zoppoli, Sanguineti & Parisini ’02, K˚ urkov´a & Sanguineti ’02b, Sanguineti ’02, Baglietto, Sanguineti & Zoppoli ’03 &

19

%

'

$

• Linear OHL networks used by the classical Ritz method often incur the “curse of dimensionality”: for a desired accuracy ε, the number of basis functions grows very fast with the number d of variables of admissible solutions:  d C ν≥  ε • In the ERIM the number of basis functions (hence the number of free parameters) grows at most polynomially with d:  1/q C ν ≤   d p/q ε =⇒ possibility of avoiding the curse of dimensionality. • It exploits approximation properties of OHL networks with suitable basis functions −→ Jones ’92, Barron ’93, Breiman ’93, Girosi ’93, Mhaskar ’95, K˚ urkov´a & Sanguineti ’01a, K˚ urkov´a & Sanguineti ’01b, K˚ urkov´a & Sanguineti ’02. • It is based on the introduction of suitable classes of OHL networks, having a “moderate” complexity (rate of growth at most polynomial in d): −→ S-polynomially-complex networks −→ polynomially-complex P-optimizing networks. −→ Zoppoli, Sanguineti & Parisini ’02, K˚ urkov´a & Sanguineti ’02b, Sanguineti ’02, Baglietto, Sanguineti & Zoppoli ’03 &

20

%

'

$

Function approximation

Functional optimization OHL networks

H- dense networks

S-polynomially-complex networks

P-optimizing networks

polynomially-complex P-optimizing networks

Classes of networks obtained from OHL networks

&

21

%

'

$

Dense networks • As we are interested in approximating the functions γ (x) ∈ S by the functions γ ν (x, wν ), we have to equip the linear space H with a norm · . 

• Then we shall consider a linear normed space given by H = (H, · ). • Accordingly, the set S is replaced by S ⊆ H. • H-density property: the functions γ ν (x, wν ) are such that the set

+∞  ν=1

Aν is dense in H.

Def.: The OHL networks that are provided with the H-density property are defined as H-dense networks (H-DNs). • Linear H-dense networks: when the basis functions do not contain the parameter vectors κi (fixed basis functions). • Nonlinear H–dense networks: if the vectors κi are present in the basis functions (parameterized basis functions).

&

22

%

'

$

• Examples of commonly used spaces: −→ The space C(K, IRn2 ) of continuous functions γ (x) : K → IRn2 , K ⊂ IRn1 compact, equipped with the supremum norm γ ∞ = max γ (x) , i.e., x∈K



C(K, IRn2 ) = (C(K, IRn2 ), · ∞) −→ The space L2(K, IRn2 ) of measurable, square integrable functions γ (x) : K → IRn2 ,  1/2 equipped with the L2 norm γ 2 = K γ (x) 2dx , i.e., 

L2(K, IRn2 ) = (L2(K, IRn2 ), · 2)

&

23

%

'

$

Examples of H-dense networks

• An OHL network is a C-dense network if the set

+∞  ν=1

• An OHL network is an L2-dense network if the set

Aν is dense in C(K, IRn2 ).

+∞  ν=1

Aν is dense in L2(K, IRn2 ).

−→ Algebraic and trigonometric polynomials are examples of C- and L2-dense networks. −→ Feedforward neural networks with one hidden layer of suitable activation functions and linear output units, RBF networks with suitable radial functions, trigonometric polynomials with variable frequencies and phases, and hinging hyperplanes are examples of nonlinear Cand L2-dense networks.

&

24

%

'

$

From Problem P to a sequence of nonlinear programming problems





• Consider the sets Aν ∩ S = γ ν (x, wν ) : wν ∈ Wν ⊆ IRN (ν) . • The sequence {Aν }∞ ν=1 has the nested structure A1 ⊂ A2 ⊂ · · · ⊂ Aν ⊂ · · · , and similarly for the sequence {Aν ∩ S}∞ ν=1 .

&

25

%

'

$

Aν H A2 A1

S

&

26

%

'

$

• By substituting γ ν ∈ S ⊆ H = (H, · ) into F (γ ), the functional F (γ ) reduces to a function 

Fν (wν ) = F [γν (·, wν )] . =⇒ Then, for ν = 1, 2, . . ., we have a sequence of “approximating” nonlinear programming problems

Problem Pν :

inf wν ∈Wν Fν (wν )

• “Extended Ritz method” (ERIM) (−→ Zoppoli & Parisini ’92, Parisini & Zoppoli ’94, Parisini, Sanguineti & Zoppoli ’96, Zoppoli, Sanguineti & Parisini ’02) • If the approximating networks are linear, then wν is given by all the coefficients of the linear combinations of fixed basis functions and solving the sequence of Problems Pν corresponds to the one obtained by the classical Ritz method.

&

27

%

'

$

• Suppose that Problems P and Pν admit solutions γ ◦ and γ ◦1 , . . . , γ ◦ν , . . . in S and A1 ∩ S, . . . , Aν ∩ S, . . ., respectively. • Problem Pν solves approximately Problem P if i) optimal solutions γ ◦0 , γ ◦1 , . . . can be determined such that lim γ ◦ν = γ ◦,

ν→+∞

ii)

that is,

lim ||γ ◦ν − γ ◦|| = 0 in the norm of H,

ν→+∞

γ ◦ν ) = F (γ ◦). F ( lim ν→+∞

• If ν→∞ lim F (γν◦) = F (γ ◦) , then {γ ◦ν } is called “minimizing sequence”.

&

28

%

'

$

Remark

• H-dense networks ensure us only that the optimal solution γ ◦ to Problem P is an accumulation point for some sequence {γ ν } of OHL networks, but not necessarily for {γ ◦ν } . −→ γ ◦ is a difficult task, and is deeply related to the structural properties of • Proving that γ ◦ν ν→∞ the functional F . • The concept which is best suited to the convergence of the sequence of approximating Problems Pν , for ν = 1, 2, . . . is epi-convergence: if the epigraphs of Problems Pν converge to the epigraph of Problem P, then both the sequences {Fν◦} and {γ ◦ν } converge to F ◦ and γ ◦, resp.

&

29

%

'

$

Convergence of epigraphs Eν E ν+1

y

E

Fνo o Fν+1 Fo

γ oν

&

Aν A ν+1 S

γ γo γ oν+1

30

%

'

$

Property:

If • the set

+∞  ν=1

Aν is dense in S,

• problems P and Pν admit solutions γ ◦ and γ ◦1 , . . . , γ ◦ν , . . . in S and A1 ∩ S, . . . , Aν ∩ S, . . ., respectively, and • the functional F (γ ) is continuous (in the norm of H), then ◦ ◦ lim F = F . ν ν→∞

&

31

%

'

$

Two significant versions of Problem Pν

• The cost functional Fν is given by the expected value of a cost function Jν (wν , z ) with respect to a random vector z (stochastic functional optimization) Problem Pν ’: inf Fν (wν ) = inf Ez [Jν (wν , z )] . w ∈W

wν ∈Wν

ν

ν

• The cost functional Fν is given by the supremum of a cost function Jν (wν , z ) with respect to z Problem Pν ”: inf Fν (wν ) = inf sup [Jν (wν , z )] .

wν ∈Wν

&

wν ∈Wν

z

32

%

'

$

Bounding the complexity of OHL networks

• The density property is a sort of necessary condition to be satisfied by OHL networks. • The rate of approximation is the relationship between the approximation accuracy achievable through a given OHL scheme and the “complexity” of the scheme, measured by the number ν of basis functions (recall that N (ν) = ν n2 and N (ν) = ν(k + n2) are linear with ν).

&

33

%

'

$

• Let the elements of the space H be scalar functions γ : IRd → IR , i.e., let n1 = d and n2 = 1 in Problem P. (A multi-output network can always be thought of as the parallel of n2 scalar networks.) • The worst-case error of approximation of functions in S by networks belonging to Aν can be measured by the deviation of S from the set Aν ⊂ H 

δ(S, Aν ) = sup inf γ − γν . γ∈S γν ∈Aν

• For linear approximating networks: Kolmogorov ν-width of S in H dν (S) = inf δ(S, span{φ1, . . . , φν }) . Hν

The infimum is taken over all ν-dimensional subspaces of H spanned by ν linearly-independent basis functions φ1, . . . , φν ∈ H .

&

34

%

'

$

S-polynomially-complex networks • We want to avoid OHL schemes in which the growth of the complexity ν with d, for a given approximation accuracy ε, is too fast, i.e., it requires an unacceptably large number ν of fixed or parameterized basis functions for large values of d. • Distinction based on the behavior of dν (S) and δ(S, Aν ): 1. An upper bound on dν (S) or δ(S, Aν ) of order O (dp/ν q ) exists, where p, q ∈ IR+. Then, given a fixed approximation accuracy ε (i.e., a worst-case error not exceeding ε), we have 



1/q C ν ≤   d p/q ε

=⇒ favorable behavior of the network with d: the number ν of basis functions necessary to guarantee a fixed approximation accuracy ε has to grow at most as a power of d. &

35

%

'

$

2. There exists a lower bound of order O(1/ν 1/d) . Then 



d C ν≥  ε

=⇒ the number ν of basis functions is required to grow exponentially with d in order to obtain an approximation accuracy ε =⇒ curse of dimensionality =⇒ unfeasibility in high-dimensional settings.

Def.: The OHL networks corresponding to item 1 are defined as S-polynomially complex networks. • Roughly speaking, S-polynomially complex networks make “computationally feasible” the problem of approximating the elements of S .

&

36

%

'

$

Examples of S-polynomially-complex networks • A OHL network “behaves well” when the functions of S to be approximated belong to a smoothness class that is specific for each network. • Let us focus on nonlinear OHL networks with a rate of approximation of the order of O(1/ν) in suitable admissible sets S. • Examples: 1. “Sigmoidal neural networks with one hidden layer and linear output activation functions”: ν ci σ(x αi + βi) , γν (x, wν ) = i=1

• γ such that

 d

IR

|ω |2 |Γ(ω )| dω ≤ c , with c growing at most polynomially with d.

• L2 error of the order of O(1/ν) obtained by using O(c νd) “free” parameters. (−→ Barron ’93) &

37

%

'

$

2. Linear combinations of sinusoidal functions with “variable” frequencies: γν (x, w) = • γ such that



IRd

ν i=1

ci sin (x αi + βi)

|Γ(ω )| dω ≤ c , with c growing at most polynomially with d.

• L2 error of the order of O(1/ν) obtained by using O(c νd) “free” parameters. (−→ Jones ’92) 3. Radial Basis Functions: γν (x, w) =

ν i=1



ci e

x−τ i 2 2σi2

• γ ∈ W12m(IRd) for 2m > d , i.e., such that its partial derivatives of order up to m are integrable, with m > d even. • Error of the order of O(1/ν) in the supremum norm obtained by using O(νd) “free” parameters. (−→ Girosi ’93) • =⇒ 1), 2) and 3) are S-polynomially complex networks for sets S of admissible functions whose elements satisfy a specific smoothness condition. &

38

%

'

$

• In general, it is important that “free” parameters appear also as arguments of the basis functions: ν ci ϕ (x, κi) γν (x, wν ) = i=1

&

39

%

'

$

Examples of S-polynomially-complex networks Functional Approximation scheme   space 1 2 3 4

L2(K) L2(K) L2(K) Lp(K) 1≤p≤∞ 5 L∞(IRd) 6 L2(IRd)



Aν =



ν

i=1



ci ϕ(x, κi) 

ϕ(x, κ) = sin(x α + β) ϕ(x, κ) = σ(x α + β) ϕ(x, κ) = |x α + β|+ ϕ(x, κ) = ξ(x α + β) 

ϕ(x, κ) = Bm |x − τ |2 2 2 ϕ(x, κ) = e−|x−τ | /σ

Smoothness class of



functions with rate O 

√1 ν



Θ = {f : d |f˜(ω)|dω < +∞}  IR Γ = {f : d |ω||f˜(ω)|dω < +∞}  IR Λ = {f : d |ω|2|f˜(ω)|dω < +∞} IR Wpd/2(K)



W12m(IRd), 2m > d W12m(IRd), 2m > d

• The function ξ : IR → IR in entry 4 has to satisfy a technical condition (Mhaskar ’95); for example, the squashing function, the generalized multiquadrics, the thin plate splines and the Gaussian function are allowed. 1 • In entry 5, Bm is a Bessel potential, i.e., B˜m(ω) = m , m > 0. 2 (1+|ω| ) 2

• Wps(K) is the Sobolev space of order s in Lp(K) norm, K is a compact subset of IRd, and f˜ denotes the Fourier transform of f . &

40

%

'

$

• Squashing function ξ(t) =

1 1 + e−t

• Generalized multiquadrics ξ(t) = (1 + t2)α ,

α ∈ Z +

• Thin plate splines   

t2q−d log t, if d even ξ(t) =  ,  t2q−d , if d odd where q ∈ Z + and q > d/2 • Gaussian function 2

ξ(t) = e−t

&

41

%

'

$

• OHL networks become S-polynomially complex networks in smoothness classes S defined by regularity assumptions specific for each type of basis functions. • Each pair (OHL network, normed linear space) determines a class of sets S ⊂ H, for which approximation by that OHL is “parsimonious”. • Given a normed linear space H, in general there are non-void intersections but no inclusions among different sets S ⊂ H whose functions can be approximated with the same rate by different OHL networks. (→ Giulini & Sanguineti ’00)

&

42

%

'

$

P -Optimizing Networks • In approximating Problem P by Problems Pν , it seems quite natural to require that 1) the sequence {γ ◦ν }∞ ν=1 is such that lim F (γ ◦ν ) = F ◦

ν→∞

(8)

(i.e., {γ ◦ν }∞ ν=1 is a minimizing sequence for F over S); γ ◦ , i.e., 2) the sequence {γ ◦ν }∞ ν=1 has a limit function lim γ ◦ν − γ ◦ = 0 ,

ν→∞

(9)

where · is the norm of H. • Then, we define the following class of OHL networks Def.: Given an instance of Problem P, the OHL networks such that both limits (8) and (9) hold are defined as P–optimizing networks (P-ONs).

&

43

%

'

$

Polynomially-complex P -Optimizing Networks • We require that: 1) the sequence {γ ◦ν }∞ ν=1 is such that 



F (γ ◦ν ) − F ◦ ≤ O dp /ν q



(10)

where p, q  ∈ IR+ ; γ ◦ such that 2) the sequence {γ ◦ν }∞ ν=1 has a limit function γ ◦ν − γ ◦ ≤ O (dp/ν q ) ,

(11)

where p, q ∈ IR+ and · is the norm of H. Def.: Given an instance of Problem P, the OHL networks such that both bounds (10) and (11) hold are defined as polynomially-complex P–optimizing networks (polynomially-complex P-ONs).

&

44

%

'

$

• Polynomially-complex P–ONs make the approximate solution of Problem P computationally feasible: it can be performed up to any desired degree of accuracy using a “moderate” number of basis functions =⇒ OHL scheme {Aν }∞ ν=1 with a “moderate complexity”. • To construct classes of polynomially-complex P–ONs one has to take into account not only the structure of H and S but also the properties of the functional F : S → IR, i.e., all the triple (H, S, F ). • Preliminary results in this direction have been recently obtained in −→ Kainen, K˚ urkov´a & Sanguineti ’02a Sanguineti ’02, and K˚ urkov´a & Sanguineti ’02b, where classes of OHL networks have been constructed, which are polynomially complex P–ONs for classes of functionals F : S → IR.

&

45

%

'

$

Choice of the OHL scheme – Example: sigmoidal neural networks

γν (x, wν ) =

ν i=1

ci σ(x αi + βi) ,

σ(·) sigmoidal (bounded measurable function on the real line such that z→+∞ lim σ(z) = 1, lim σ(z) = 0). z→−∞

• N (ν) = ν(d + 2)

&

46

%

'

$

Rates of approximation of sigmoidal OHL neural networks 



Gc = γ : IR → IR





|ω | |˜ γ (ω )| dω ≤ c , IRd γ˜ (ω ) is the Fourier transform of γ, |ω | = (ω ω )1/2, c > 0 (smoothness class by Barron ’93). d

such that

• In general, c increases with d; suppose at most polynomially (examples in Barron ’93). • In L2(B1d, IR) , where B1d denotes the closed ball of radius 1 centered in the origin of IRd, the deviation from Aν is bounded from above by 



1/2 1 δ(Gc, Aν ) ≤ 2c   . ν • =⇒ As N (ν) = ν(d + 2) , the number of parameters required to achieve an L2 √ approximation error of the order of O (1/ ν) is O(c2 ν d) , hence it grows only polynomially with d (−→ Barron ’93).

Therefore for d-variable functions γ ∈ Gc , the number of sigmoidal units necessary to obtain a given accuracy ε increases at most polynomially with d = dim (x) . −→ no curse of dimensionality. &

47

%

'

$

Comparison with linear OHL networks in Gc • Instead, in L2([0, 1]d, IR) , the ν-width of S is bounded from below by 



kc  1 1/d dν (Gc) ≥ , d ν where k ≥ 1/(8πeπ−1). • This means that, for a given accuracy ε, the number ν of basis functions has to grow exponentially with d (−→ Barron ’93).

Therefore −→ curse of dimensionality. A more general comparison between the rates of linear and nonlinear OHL networks is investigated in: −→ K˚ urkov´a & Sanguineti ’01a, ’01b, and ’02a.

&

48

%

'

$

(s)

• Consider functions to be approximated belonging to Sobolev spaces W2 (class of functions with partial derivatives of order up to s belonging to L2), with s ≥ d/2 + 2 . (s)

• In the space W2 linear OHL networks do not exhibit the curse of dimensionality. (s)

• As W2 ⊂ Gc for some c, neural OHL networks should behave better than linear ones in the (s) difference set Gc \ W2 .

(s)

W2

Gc

&

49

%

'

$

δ (Gc , A ν ) d ν (Gc )

2c

ν

1 ν

1

2

: upper bound on δ (Gc , A ν) ν 2 2α d 2 : upper bound δ on ν

κα d : lower bound dν on ν

δ =dν

d

d neural approximating networks

κc d

1 ν

1

d

: lower bound on d ν (Gc )

linear approximating networks ν

Comparison between sigmoidal neural OHL networks and linear OHL networks for the sets Gc, in the case c = α d ( α > 0).

&

50

%

'

$

ERIM: final comments • We suggest the use of linear combinations of basis functions containing “free” parameters to be optimized (ERIM, Extended Ritz Method). • We suggest the use of OHL approximating networks that benefit by rates of approximation with a favorable behavior with respect to the dimension. • Then we have the possibility of avoiding the curse of dimensionality. • Any functional optimization problem (H, S, F ) “requires” its own nonlinear OHL network. • A suitable choice of OHL networks makes the approximate solution of (H, S, F ) feasible: any desired degree of accuracy can be obtained by using a “moderate” number of parameters. • Further research is required both to relate 1) the smoothness assumptions on the admissible solutions and 2) the regularity properties of the cost functional to the choice of the structure of the nonlinear OHL network. −→ preliminary results in Kainen, K˚ urkov´a & Sanguineti ’02a and ’02b , Sanguineti ’02, and K˚ urkov´a & Sanguineti ’02b.

&

51

%

'

$

Reduction of Problem FTOC to a nonlinear programming problem by the ERIM



γ t(I t) =⇒ ut = γˆ ν t (I t, wt) nonlinear OHL network,

wt ∈ IRN (νt)

• By substituting the state and measurement equations into the cost functions, only the “primitive” random variables remain present: x0 •



d = col (d0, . . . , dT −1)

 η= col (η 0, . . . , η T −1)



w = col (w0, . . . , wT −1)

• For any ν ∈ Z + Problem FTOC takes on the form:

inf Fν (wν ) = inf [Jν (wν , x0, d, η )] wν wν x E η 0 ,d,

&

52

%

'

$

• To determine the free parameters we use a stochastic approximation algorithm: wk+1 = wk − αk ∇w Jν [wk , x0(k), d(k), η (k)] ,

k = 0, 1, . . .

• Two steps alternate till convergence: 1. random selection of vectors x0(k), d(k), η (k) and computation of the corresponding state trajectory xt; 2. in correspondence with the state trajectory, computation of the gradient ∇w Jν [wk , x0(k), d(k), η (k)] and computation of the new parameters vector wk+1. • The gradient ∇w Jν [wk , x0(k), d(k), η (k)] is computed by applying backpropagation to all T neural networks 

 λt =

∂ ∂ h(xt, ut) + λt+1 f (xt, ut, dt) +  ∂ xt ∂ xt  λT =

where



T −1 j=t

∂Jν  ∂ g (xt, η t) ,  ∂ y jyt (0) ∂ xt

t = 0, 1, . . . , N − 1

∂ hT (xT ) ∂ xT

λt = ∇xt Jν (w, x0, d, η ), &



t = 0, 1, . . . , T 53

%

'

$

• To determine the free parameters we use a stochastic approximation algorithm =⇒ we resort to a penalty functions approach. • For the state constraints (1),(2), and (3) =⇒ we add to the cost the terms Kρ {[max (0, −ρit)]2 + [max (0, ρit − ρi max)]2} , Kv {[max (0, −vit)]2 + [max (0, vit − vi max)]2} , Kl {[max (0, −lit)]2 + [max (0, lit − li max)]2} , where Kρ, Kv , and Kl are positive scalars. • Similar terms are added for the control constraints.

&

54

%

'

$

Numerical results – Case 1: no disturbances on the measurements • We have compared the solution by the ERIM with the so-called CEOLF (“certainty equivalent open–loop feedback”) approach, proposed in Messner & Papageorgiou ’92. • CEOLF consists in a receding-horizon optimization, performed on line periodically: −→ in correspondence of the state xt at stage t, an optimal control problem is solved over a number T  of stages by using a nonlinear programming technique (we have set T  = 5 ); −→ a sequence of traffic control functions u◦t , . . . , u◦t+T −1 is derived; −→ the first control u◦t of the sequence is applied; −→ the state evolves to xt+1 and the on–line procedure is repeated; −→ at each on–line optimization, the random demands dit are replaced with their expected values, thus obtaining a certainty equivalent open–loop feedback (CEOLF) traffic control strategy.

&

55

%

'

$

• For the ERIM we have used nonlinear OHL networks corresponding to feedforward neural networks with ν = 45 sigmoidal basis functions: γνj (x, wν ) =

ν i=1

cij σ(x αi + βi) ,

σ(·) sigmoidal (bounded measurable function on the real line such that z→+∞ lim σ(z) = 1, lim σ(z) = 0). z→−∞

• Convergence of the algorithm for optimization of parameters in OHL networks: ∼ 3 · 105 iterations. • The surfaces of ρit and vit under the traffic control functions obtained by the ERIM are nearly the same as those obtained by the CEOLF strategy (increase in the cost 0.4%). • Basic difference between CEOLF and ERIM −→ in CEOLF the computational effort is performed on line: a nonlinear programming problem has to be solved at any stage t to generate u◦t =⇒ the dynamics has to be sufficiently slow with respect to the speed of the computing system. −→ in ERIM the computational effort is performed off line =⇒ the decision maker generates “almost instantaneously” the optimal traffic control vector for any xt belonging to the set of admissible states. &

56

%

'

$

(b)

120

Traffic mean speed (km/h)

Traffic density (veh/km)

(a)

80

40 0 30 15

20 10

10 section i

5 0

0

t (min)

120 80 40 0 30 15

20 section i

10

5 0

10 t (min)

0

Evolutions of the traffic density ρit and of the mean speed vit in clearing a severe congestion on section 11 under the action of the control law obtained by the ERIM with the neural OHL scheme. a,b: Random disturbances on the dynamic system and perfect measurements on the state vector.

&

57

%

'

$

120

140

120

100

mean traffic speed [km/h]

traffic density [veh/km]

100 80

60

40

80

60

40 OLO : OLO : 20

20

RHON :

0

5

10

15 section k

20

25

30

0

RHON :

5

10

15 section k

20

25

30

Comparisons of the traffic density and the mean traffic speed at stage t = 30 , under the actions of the CEOLF traffic control law and the traffic control law obtained by the ERIM with the neural OHL scheme.

&

58

%

'

$

Numerical results – Case 2: disturbances on the measurements • y t = xt + η t , mutually independent stochastic variables, each component of η t uniformly distributed over suitable intervals. • Large number of stages =⇒ instead of storing the entire information vector I t, we used a 1 “limited–memory” traffic control function ut = γˆ t (y t, µt−1, w1t ) µt is a “contracted estimate” of I t generated by the recursive mapping µt = γˆ 2t (y t, ut, µt−1, w ˜ 2t ) (then dim (µt) = 75 ). • Nonlinear OHL networks corresponding to feedforward neural networks with ν = 50 sigmoidal basis functions: • Convergence of the algorithm for optimization of parameters in OHL networks: ∼ 4 · 105 iterations. • Under a variety of initial conditions, the OHL control law drives the mean traffic speed to a neighborhood of the so–called “free speed” Vf = 123 km/h , corresponding to very low traffic densities.

&

59

%

'

$

(d)

Traffic mean speed (km/h)

Traffic density (veh/km)

(c)

120

80

40 0 30 15

20

80 40 0 30

5 0

0

t (min)

15

20

10

10 section i

120

section i

10

10

5 0

t (min)

0

Evolutions of the traffic density ρit and of the mean speed vit in clearing a severe congestion on section 11 under the action of the traffic control law obtained by the ERIM with the neural OHL scheme. c,d: Random disturbances on the dynamic system and noisy measurements on the state vector.

&

60

%

$

140

140

120

120

mean traffic speed [km/h]

traffic density [veh/km]

'

100 80 60 40 20 0 30

100 80 60 40 20 0 30

30

20

25

30

20

25

20 15

10 section k

20

0

0

time [min]

15

10

10 5 section k

10 0

5 0

time [min]

Evolutions of the traffic density ρkt and of the mean speed vkt , while a congestion affecting sections 11 to 21 is being cleared under the action of the traffic control law obtained by the ERIM with the neural OHL scheme.

&

61

%

'

$

Final comments • Problem FTOC is a functional optimization problem. • The admissible functions depend on a large number of variables. • In the example: state vector with 65 components! =⇒ even if it is perfectly measurable, this rules out dynamic programming techniques based on state-space discretization. • The applicability of classical numerical approaches to functional optimization problems is limited to rather small dimensions (curse of dimensionality). • Solution of FTOC by ERIM: −→ use of nonlinear OHL schemes with a “moderate complexity” (the growth is at most polynomial with the number of variables) −→ possibility of avoiding the curse of dimensionality. Recipe: functional optimization problem (H, S, F )  class of nonlinear OHL schemes that make its approximate solution computationally feasible &

62

%

'

$

...more about dynamic programming applied to the solution of functional optimization problems... • Recently, an approach based on dynamic programming combined with splines has allowed the solution of an inventory forecasting problem with up to nine state variables −→ Chen, Ruppert & Shoemaker ’99 This was the largest dimension of the state vector till then faced successfully in this problem. • First results on the application of ERIM to such a problem has allowed us to exceed that value. −→ Baglietto, Cervellera, Parisini, Sanguineti, & Zoppoli ’01.

&

63

%

'

$

References • (Baglietto, Cervellera, Parisini, Sanguineti & Zoppoli ’01) M. Baglietto, C. Cervellera, T. Parisini, M. Sanguineti, R. Zoppoli, “Approximating Networks for the Solution of T -Stage Stochastic Optimal Control Problems”, Proc. IFAC Workshop on Adaptation and Learning in Control and Signal Processing, pp. 107-114, 2001. • (Baglietto, Sanguineti & Zoppoli ’03) M. Baglietto, M. Sanguineti, R. Zoppoli, “ Facing the Curse of Dimensionality by the Extended Ritz Method in Stochastic Functional Optimization: Dynamic Routing in Traffic Networks”, in Applied Optimization, G. Di Pillo, A. Murli Eds., Kluwer Academic Publishers, pp. 22-55, di prossima pubblicazione. • (Baglietto, Parisini & Zoppoli ’01) “Distributed-Information Neural Control: The Case of Dynamic Routing in Traffic Networks”, IEEE Trans. Neural Networks, vol. 12, pp. 485–502, 2001. • (Barron ’93) A.R. Barron, “Universal Approximation Bounds for Superpositions of a Sigmoidal Function”, IEEE Trans. on Information Theory, vol. 39, pp. 930–945, 1993. &

64

%

'

$

• (Breiman ’93) L. Breiman: “Hinging hyperplanes for regression, classification, and function approximation,” IEEE Trans. on Information Theory, vol. 39, no. 3, pp. 993–1013, 1993. • (Chen, Ruppert & Shoemaker ’99) V. C. P. Chen, D. Ruppert, and C. A. Shoemaker, “Applying experimental design and regression splines to high-dimensional continuous-state stochastic dynamic programming,” Operations Research, vol. 47, pp. 38–53, 1999. • (Girosi ’93) F. Girosi, “Regularization Theory, Radial Basis Functions and Networks”, in From Statistics to Neural Networks. Theory and Pattern Recognition Applications, J. H. Friedman, V. Cherkassky, and H. Wechsler, Eds., Computer and Systems Sciences Series, Springer Verlag, Berlin, Germany, pp. 166-187, 1993. • (Giulini & Sanguineti ’00) S. Giulini, M. Sanguineti, “On Dimension–Independent Approximation by Neural Networks and Linear Approximators”, Proc. Int. Joint Conf. on Neural Networks, pp. I283–I288, 2000. • (Jones ’92) L.K. Jones, “A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training”, Annals of Statistics, vol. 20, pp. 608–613, 1992. &

65

%

'

$

• (Kainen, K˚ urkov´a & Sanguineti ’02a) P.C. Kainen, V. K˚ urkov´a, M. Sanguineti, M., “Minimization of Error Functionals Over Variable-Basis Functions”, SIAM Journal on Optimization, to appear. • (Kainen, K˚ urkov´a & Sanguineti ’02b) P.C. Kainen, V. K˚ urkov´a, M. Sanguineti, M., “Rates of Approximate Minimization of Error Functionals Over Boolean Variable-Basis Functions”, submitted to IMA Journal of Management Mathematics, 2002. • (K˚ urkov´a & Sanguineti ’01a) V. K˚ urkov´a, M. Sanguineti, “Bounds on Rates of Variable–Basis and Neural–Network Approximation, IEEE Trans. Information Theory, vol. 47, pp. 2659-2665, 2001. • (K˚ urkov´a & Sanguineti ’01b) V. K˚ urkov´a, M. Sanguineti, “Tight Bounds on Rates of Variable-Basis Approximation Via Estimates of Covering Numbers”, Research Report ICS-00-830 - Institute of Computer Science - Academy of Sciences of the Czech Republic, 2000. • (K˚ urkov´a & Sanguineti ’02a) V. K˚ urkov´a, M. Sanguineti, “Comparison of Worst-Case Errors in Linear and Neural Network Approximation”, IEEE Trans. Information Theory, vol. 48, pp. 264-275, 2002. • (K˚ urkov´a & Sanguineti ’02b) &

66

%

'

$

V. K˚ urkov´a and M. Sanguineti, “Error Estimates for Approximate Optimization Over Variable-Basis Functions”, Research Report - Institute of Computer Sciences, Academy of Sciences of the Czech Republic, 2002 • (Messner & Papageorgiou ’92) A. Messner and M. Papageorgiou, “Motorway network control via nonlinear optimization,” Proc. 1st Meeting of the EURO Working Group on Urban Traffic and Transportation, Landshut, Germany, pp. 1-24, 1992. • (Mhaskar ’95) H.N. Mhaskar, “Neural Networks for Optimal Approximation of Smooth and Analytic Functions”, Neural Computation, vol. 8, pp. 164-177, 1996. • (Payne ’71) H. J. Payne, “Models of freeway traffic and control,” in Simulation Council Proc., vol. 1, pp. 51–61, 1971. • (Parisini & Zoppoli ’94) “Neural Networks for Feedback Feedforward Nonlinear Control Systems”, IEEE Trans. Neural Networks, vol. 5, pp. 436–449, 1994. • (Sanguineti ’02) M. Sanguineti, “Error Estimates for Approximate Solution of Optimization Problems by Approximating Networks”, Workshop Mathematical Diagnostics, Erice, Italy, 2002. &

67

%

'

$

• (Zoppoli & Parisini ’92) R. Zoppoli, T. Parisini, “Learning Techniques and Neural Networks for the Solution of N –Stage Nonlinear Nonquadratic Optimal Control Problems”, in Systems, Models and Feedback: Theory and Applications, A. Isidori and T. J. Tarn (Eds.), Birkh¨auser, Boston, pp. 193–210, 1992. • (Zoppoli, Parisini, & Sanguineti ’96) R. Zoppoli, T. Parisini, M. Sanguineti, “Neural Approximators for Functional Optimization”, Proc. 35th IEEE Conference on Decision and Control (CDC), pp. 3290-3293, 1996. • (Zoppoli, Sanguineti & Parisini ’02) R. Zoppoli, M. Sanguineti, T. Parisini, “Approximating Networks and Extended Ritz Method for the Solution of Functional Optimization Problems”, Journal of Optimization Theory and Applications, vol. 112, pp. 403-440, 2002.

&

68

%

Suggest Documents