OPTIMIZATION

1 downloads 0 Views 933KB Size Report
❑Derivative-based (nonlinear) optimization. ▫ Descent Methods. ▫ The Method of Steepest Descent. ▫ Newton's Methods (NM). ❑Derivative-free optimization.
System Identification Goal: Determine a mathematical model for an unknown system (or target system) by observing its input-output data pairs. Purposes:

OPTIMIZATION

 To predict a system’s behavior, as in time series prediction and weather forecasting;  To explain the interactions and relationships between inputs and outputs of a system;  To design a controller based on the model of a system, as an aircraft or ship control;  Simulate the system under control once the model is known.

Derivative-Based Optimization

João Miguel da Costa Sousa / Alexandra Moutinho

System Identification

Structure identification

The problem of determining a mathematical model for an unknown system by observing its input-output data pair is generally referred to as system identification.

 Structure identification  Parameter identification

f depends on the problem at hand and on the designer’s experience and the laws of nature governing the target system 460

Parameter identification

João Miguel da Costa Sousa / Alexandra Moutinho

461

Parameter identification

The structure of the model is known, however we need to apply optimization techniques in order to determine the parameter vector θ = θˆ such that the resulting model yˆ = f (u ,θˆ) describes the system appropriately:

ui

Identification techniques

462

yi

Target system to be identified Mathematical model f(u;θ)

y i − yˆ → 0 with y i assigned to ui

João Miguel da Costa Sousa / Alexandra Moutinho

Apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is to be conducted; this class of model is denoted by a function y=f(u;θ) with:  y the model output  u s the input vector  θ the parameter vector

Two top-down steps:

João Miguel da Costa Sousa / Alexandra Moutinho

459

João Miguel da Costa Sousa / Alexandra Moutinho

y i

+ ‒

y − y i i

463

1

System Identification

Optimization

System identification needs to do both structure and parameter identification repeatedly until satisfactory model is found: 1. Specify and parameterize a class of mathematical models representing the system to be identified 2. Perform parameter identification to choose the parameters that best fit the training data set 3. Conduct validation set to see if the model identified responds correctly to an unseen data set 4. Terminate the procedure once the results of the validation test are satisfactory. Otherwise, another class of model is selected and repeat step 2 to 4 João Miguel da Costa Sousa / Alexandra Moutinho

464

Derivative-based (nonlinear) optimization  Descent Methods  The Method of Steepest Descent  Newton’s Methods (NM)

Derivative-free optimization     

João Miguel da Costa Sousa / Alexandra Moutinho

Derivative-based optimization

They are used in:  Optimization of nonlinear neuro-fuzzy models  Neural network learning  Regression analysis in nonlinear models

We cover:  Gradient based optimization techniques  Steepest descent methods  Newton Methods  Conjugate gradient methods  Nonlinear least-squares problems 466

João Miguel da Costa Sousa / Alexandra Moutinho

Descent methods

467

Descent methods

Goal: Determine a point θ= θ* =[θ1*,θ2*,…,θn*]T such that f(θ1,θ2,…,θn) is minimum on θ = θ*

θnext=θnow+ ηd (η>0 is a step size regulating the search in the direction d)

We are looking for a local and not necessarily a global minimum θ*; Let f(θ1,θ2,…,θn) = E(θ1,θ2,…,θn) be the objective function; The search of this minimum is performed through a certain direction d starting from an initial value θ=θ0 (iterative scheme!) João Miguel da Costa Sousa / Alexandra Moutinho

465

Derivative-based optimization

 Goal: Solving minimization nonlinear problems through derivative information

João Miguel da Costa Sousa / Alexandra Moutinho

Simplex Simulated Annealing Genetic Algorithms Ant Colony Optimization ...

468

or: θk+1 = θk + ηkdk (k = 1, 2, …); The series {θk} k=1,2,… should converge to a local minimum θ*;  We first need to determine the next direction d and then compute the step size η; João Miguel da Costa Sousa / Alexandra Moutinho

469

2

Descent methods

Gradient-based methods

ηkdk is called the k-th step, whereas ηk is the k-th step size; We should have E(θnext) = E(θnow+ ηd)< E(θnow); The principal differences between various descent algorithms lie in the first procedure for determining successive directions; Once d is determined, η is computed as (line minimization): η * = arg min φ (η )

Definition: The gradient of a differentiable function E: ℝn → ℝ at θ is the vector of first derivatives of E, denoted as g. That is: def  ∂E (θ ) ∂E (θ ) ∂E (θ )  g(θ ) = ∇E (θ ) =  , ,..., θ θ ∂ ∂ ∂θ n   1 2

T

η >0

φ (η ) = E(θnow +ηd )

where

João Miguel da Costa Sousa / Alexandra Moutinho

470

Gradient-based methods

dE (θ now +ηd ) dη

η =0

= g T d = gT

471

Gradient-based methods

Based on a given gradient, downhill directions adhere to the following condition for feasible descent directions (does not guarantee convergence of algorithms): φ '(0) =

João Miguel da Costa Sousa / Alexandra Moutinho

The previous equation is justified by Taylor series expansion: E(θnow + ηd) = E(θnow)+ ηgTd +੅(η2)

d cos(ξ (θ now )) < 0

where ξ is the angle between g and d and ξ(θnow) is the angle between gnow and d at point θnow. João Miguel da Costa Sousa / Alexandra Moutinho

472

Gradient-based methods

João Miguel da Costa Sousa / Alexandra Moutinho

473

Gradient-based methods A class of gradient-based descent methods has the following form in which feasible descent directions can be found by gradient deflection: Gradient deflection consists of multiplying the gradient g by a positive definite matrix (pdm) G d=‒Gg ⇒ gTd=‒ gTGg0, G pdm) (*)

João Miguel da Costa Sousa / Alexandra Moutinho

474

João Miguel da Costa Sousa / Alexandra Moutinho

475

3

Gradient-based methods

The method of Steepest Descent

Theoretically, we wish to determine a value θnext such that (necessary but not sufficient condition): g(θ next ) =

∂E (θ ) ∂θ

θ =θ next

=0

but this is difficult to solve. In practice, we stop the algorithm if: 1. The objective function value is sufficiently small; 2. The length of the gradient vector g is smaller than a threshold; 3. The specified computation time is exceeded; João Miguel da Costa Sousa / Alexandra Moutinho

476

The method of Steepest Descent

João Miguel da Costa Sousa / Alexandra Moutinho

477

The method of Steepest Descent

Therefore, the negative gradient direction (–g) points to the locally steepest downhill direction This direction may not be a shortcut to reach the minimum point θ* However, if the steepest descent uses the line minimization technique (min ϕ(η)) then ϕ’(η) = 0 φ '(η ) =

Despite its slow convergence, this method is the most frequently used nonlinear optimization technique due to its simplicity. If G = I (identity matrix) then equation (*) expresses the steepest descent scheme: θnext = θnow ‒ ηg If cos ξ=–1 (meaning that d points to the same direction of vector –g ) then the objective function E can be decreased locally by the biggest amount at point θnow

If the contours of the objective function E form hyperspheres (or circles in a 2 dimensional space), the steepest descent methods leads to the minimum in a single step. Otherwise the method does not lead to the minimum point.

dE (θ now −η gnow ) = ∇T E (θ now −η gnow ) gnow dη

T = gnext − gnow = 0

⇒ gnext is orthogonal to the current gradient vector gnow João Miguel da Costa Sousa / Alexandra Moutinho

478

Newton’s Methods

479

Newton’s Methods

Classical Newton’s Method Principle: The descent direction d is determined by using the second derivatives of the objective function E if available If the starting position θnow is sufficiently close to a local minimum, the objective function E can be approximated by a quadratic form: 1 E (θ ) ≅ E (θ now ) + gT (θ − θ now ) + (θ − θ now )T H (θ − θ now ) 2  ∂ 2E  where H = ∇2E (θ ) =  2  ∂ θ  João Miguel da Costa Sousa / Alexandra Moutinho

João Miguel da Costa Sousa / Alexandra Moutinho

Since the equation defines a quadratic function E(θ) in the θnow neighborhood ⇒ its minimum θˆ can be determined by differentiating and setting to 0. Which gives: 0=g+H(θˆ - θnow) equivalent to: θˆ =θnow–H-1g It is a gradient-based method for η=1 and G=H-1

480

João Miguel da Costa Sousa / Alexandra Moutinho

481

4

Newton’s Methods

Newton’s Methods

Only when the minimum point θˆ of the approximated quadratic function is chosen as the next point θnext, we have the so-called Newton’s Method (NM) or the Newton-Raphson method θˆ = θnow–H-1g If H is positive definite and E(θ) is quadratic then the NM directly reaches a local minimum in the single Newton step (single –H-1g) If E(θ) is not quadratic, then the minimum may nor be reached in a single step and NM should be iteratively repeated João Miguel da Costa Sousa / Alexandra Moutinho

482

Step Size Determination

João Miguel da Costa Sousa / Alexandra Moutinho

483

Initial bracketing

Formula of a class of gradient-based descent methods: θnext = θnow+ ηd = θnow‒ ηGg This formula entails effectively determining the step size η. Efficiency of the step size determination affects the entire minimization process.

We assume that the search area (or specified interval) contains a single relative minimum: E is unimodal over the closed interval Determining the initial interval in which a relative minimum must lie is of critical importance. Two possible schemes (next slide):

ϕ’(η) = 0 with ϕ(η) = E(θnow+ηd) is often impossible to solve João Miguel da Costa Sousa / Alexandra Moutinho

484

Initial bracketing

João Miguel da Costa Sousa / Alexandra Moutinho

Line searches

1. A scheme, by function evaluation for finding three points to satisfy: E’(θk-1) > E(θk) < E(θk+1); θk-1< θk 0, θk < θk+1

The process of determining η* that minimizes a onedimensional function ϕ(η) is achieved by searching on the line for the minimum Line search algorithms usually include two components: sectioning (or bracketing), and polynomial interpolation  Newton’s method When ϕ(ηk), ϕ’(ηk), and ϕ’’(ηk) are available, the classical Newton method (defined by θˆ = θ now − Hg−1 ) can be applied to solving the equation ϕ’(ηk)=0: ηk +1 = ηk −

João Miguel da Costa Sousa / Alexandra Moutinho

485

486

João Miguel da Costa Sousa / Alexandra Moutinho

φ '(ηk ) φ ''(ηk )

(*) 487

5

Line searches

Newton’s and secant methods

 Secant method (or method of false position) If we use both ηk and ηk-1 to approximate the second derivative in equation (*), and if the first derivatives alone are available then we have an estimated ηk+1 defined as: η k +1 = η k −

φ '(η k ) φ '(ηk ) − φ '(ηk −1 ) ηk −η k −1

Both the Newton’s and the secant method are illustrated in the following figure.

João Miguel da Costa Sousa / Alexandra Moutinho

488

Line searches

490

Line searches

Polynomial interpolation  This method is based on curve-fitting procedures  A quadratic interpolation is the method that is very often used in practice  It constructs a smooth quadratic curve q that passes through three points (η1,ϕ1), (η2,ϕ2) and (η3,ϕ3): 3

∏ (η −η )

i =1

∏(η −η )

q(η ) = ∑φi

j

j ≠i

i

j

j ≠i

where ϕi = ϕ(ηi), i = 1,2,3 João Miguel da Costa Sousa / Alexandra Moutinho

491

Termination rules

 Condition for obtaining a unique minimum point is: q’(η) = 0, therefore the next point ηnext is:

Line search methods do not provide the exact minimum point of the function ϕ

2 2 2 2 2 2 1 (η2 −η3 )φ1 + (η3 −η1 )φ2 + (η1 −η2 )φ3 2 (η2 −η3 )φ1 + (η3 −η1 )φ2 + (η1 −η2 )φ3

ηnext = *

João Miguel da Costa Sousa / Alexandra Moutinho

489

Line searches

 Sectioning methods • It starts with an interval [a1,b1] in which the minimum η* must lie, and then reduces the length of the interval at each iteration by evaluating the value of ϕ at a certain number of points • The two endpoints a1 and b1 can be found by the initial bracketing described previously • The bisection method is one of the simplest sectioning method for solving ϕ’(η*) = 0, if first derivatives are available! • Others: Golden search method (does not require ϕ to be differentiable) João Miguel da Costa Sousa / Alexandra Moutinho

João Miguel da Costa Sousa / Alexandra Moutinho

We need a termination rule that accelerates the entire minimization process without affecting precision too much

492

João Miguel da Costa Sousa / Alexandra Moutinho

493

6

Termination rules – Goldstein test

Nonlinear Least-Squares Problems Goal: Optimize a model by minimizing a squared error measure between desired outputs and the model’s output y = f(x,θ) Given a set of m training data pairs (xp; tp), (p = 1, …, m), we can write: m

m

E (θ ) = ∑ (t p − y p )2 =∑ (t p − f ( x p ,θ ))2 p =1

p =1

m

= ∑ rp (θ ) = r (θ )r (θ ) 2

T

p=1

João Miguel da Costa Sousa / Alexandra Moutinho

494

Nonlinear Least-Squares Problems

Known also as the linearization method

m ∂rp (θ ) ∂E (θ ) g = g(0) = = 2∑ rp (θ ) = 2JT r ∂θ ∂θ p=1

Use Taylor series expansion to obtain a linear model that approximates the original nonlinear model

where J is the Jacobian matrix of r.

Use linear least-squares optimization to obtain the model parameters

Since rp(θ) = tp – f(xp,θ), this implies that the p-th row of J is: −∇θT f ( x p ,θ ) 496

Gauss-Newton Method

João Miguel da Costa Sousa / Alexandra Moutinho

497

y – f(x, θnow) is linear with respect to θi–θi,now since the partial derivatives are constant E (θ ) = t − f ( x ,θ now ) −

Taylor expansion of y = f(x,θ) around θ = θnow:

θ =θ now

João Miguel da Costa Sousa / Alexandra Moutinho

Gauss-Newton Method

The parameters θT=(θ1,θ2,…,θn,…) will be computed iteratively

n  ∂f ( x ,θ ) y = f ( x ,θ now ) + ∑  ∂θ i i =1 

495

Gauss-Newton Method

The gradient is expressed as:

João Miguel da Costa Sousa / Alexandra Moutinho

João Miguel da Costa Sousa / Alexandra Moutinho

∂f ( x ,θ now ) (θ −θ now ) ∂θ

= r + JT (θ − θ now ) = r + JT δ 2

  (θ i − θ i ,now ) 

2

2

where δ = θ–θnow 498

João Miguel da Costa Sousa / Alexandra Moutinho

499

7

Gauss-Newton Method The next point θnext is obtained by:

∂E (θ ) ∂θ

θ =θ next

= J T {r + J (θ next − θ now )} = 0

Therefore, the following Gauss-Newton formula is expressed as: θnext = θnow– (JTJ)-1 JTr = θnow– ½ (JTJ)-1g (since g = 2JTr) João Miguel da Costa Sousa / Alexandra Moutinho

500

8

Suggest Documents