OPTIMIZATION

System Identification Goal: Determine a mathematical model for an unknown system (or target system) by observing its input-output data pairs. Purposes:

OPTIMIZATION

To predict a system’s behavior, as in time series prediction and weather forecasting; To explain the interactions and relationships between inputs and outputs of a system; To design a controller based on the model of a system, as an aircraft or ship control; Simulate the system under control once the model is known.

Derivative-Based Optimization

João Miguel da Costa Sousa / Alexandra Moutinho

System Identification

Structure identification

The problem of determining a mathematical model for an unknown system by observing its input-output data pair is generally referred to as system identification.

Structure identification Parameter identification

f depends on the problem at hand and on the designer’s experience and the laws of nature governing the target system 460

Parameter identification


461

Parameter identification

The structure of the model is known, however we need to apply optimization techniques in order to determine the parameter vector θ = θˆ such that the resulting model yˆ = f (u ,θˆ) describes the system appropriately:

ui

Identification techniques

462

yi

Target system to be identified Mathematical model f(u;θ)

y i − yˆ → 0 with y i assigned to ui


Apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is to be conducted; this class of model is denoted by a function y=f(u;θ) with: y the model output u s the input vector θ the parameter vector

Two top-down steps:


459


y i

+ ‒

y − y i i

463

1

System Identification

Optimization

System identification needs to do both structure and parameter identification repeatedly until satisfactory model is found: 1. Specify and parameterize a class of mathematical models representing the system to be identified 2. Perform parameter identification to choose the parameters that best fit the training data set 3. Conduct validation set to see if the model identified responds correctly to an unseen data set 4. Terminate the procedure once the results of the validation test are satisfactory. Otherwise, another class of model is selected and repeat step 2 to 4 João Miguel da Costa Sousa / Alexandra Moutinho

464

Derivative-based (nonlinear) optimization Descent Methods The Method of Steepest Descent Newton’s Methods (NM)

Derivative-free optimization


Derivative-based optimization

They are used in: Optimization of nonlinear neuro-fuzzy models Neural network learning Regression analysis in nonlinear models

We cover: Gradient based optimization techniques Steepest descent methods Newton Methods Conjugate gradient methods Nonlinear least-squares problems 466


Descent methods

467

Descent methods

Goal: Determine a point θ= θ* =[θ1*,θ2*,…,θn*]T such that f(θ1,θ2,…,θn) is minimum on θ = θ*

θnext=θnow+ ηd (η>0 is a step size regulating the search in the direction d)

We are looking for a local and not necessarily a global minimum θ*; Let f(θ1,θ2,…,θn) = E(θ1,θ2,…,θn) be the objective function; The search of this minimum is performed through a certain direction d starting from an initial value θ=θ0 (iterative scheme!) João Miguel da Costa Sousa / Alexandra Moutinho

465

Derivative-based optimization

Goal: Solving minimization nonlinear problems through derivative information


Simplex Simulated Annealing Genetic Algorithms Ant Colony Optimization ...

468

or: θk+1 = θk + ηkdk (k = 1, 2, …); The series {θk} k=1,2,… should converge to a local minimum θ*; We first need to determine the next direction d and then compute the step size η; João Miguel da Costa Sousa / Alexandra Moutinho

469

2

Descent methods

Gradient-based methods

ηkdk is called the k-th step, whereas ηk is the k-th step size; We should have E(θnext) = E(θnow+ ηd)< E(θnow); The principal differences between various descent algorithms lie in the first procedure for determining successive directions; Once d is determined, η is computed as (line minimization): η * = arg min φ (η )

Definition: The gradient of a differentiable function E: ℝn → ℝ at θ is the vector of first derivatives of E, denoted as g. That is: def  ∂E (θ ) ∂E (θ ) ∂E (θ )  g(θ ) = ∇E (θ ) =  , ,..., θ θ ∂ ∂ ∂θ n   1 2

T

η >0

φ (η ) = E(θnow +ηd )

where


470


dE (θ now +ηd ) dη

η =0

= g T d = gT

471


Based on a given gradient, downhill directions adhere to the following condition for feasible descent directions (does not guarantee convergence of algorithms): φ '(0) =


The previous equation is justified by Taylor series expansion: E(θnow + ηd) = E(θnow)+ ηgTd +੅(η2)

d cos(ξ (θ now )) < 0

where ξ is the angle between g and d and ξ(θnow) is the angle between gnow and d at point θnow. João Miguel da Costa Sousa / Alexandra Moutinho

472



473

Gradient-based methods A class of gradient-based descent methods has the following form in which feasible descent directions can be found by gradient deflection: Gradient deflection consists of multiplying the gradient g by a positive definite matrix (pdm) G d=‒Gg ⇒ gTd=‒ gTGg0, G pdm) (*)


474


475

3


The method of Steepest Descent

Theoretically, we wish to determine a value θnext such that (necessary but not sufficient condition): g(θ next ) =

∂E (θ ) ∂θ

θ =θ next

=0

but this is difficult to solve. In practice, we stop the algorithm if: 1. The objective function value is sufficiently small; 2. The length of the gradient vector g is smaller than a threshold; 3. The specified computation time is exceeded; João Miguel da Costa Sousa / Alexandra Moutinho

476



477


Therefore, the negative gradient direction (–g) points to the locally steepest downhill direction This direction may not be a shortcut to reach the minimum point θ* However, if the steepest descent uses the line minimization technique (min ϕ(η)) then ϕ’(η) = 0 φ '(η ) =

Despite its slow convergence, this method is the most frequently used nonlinear optimization technique due to its simplicity. If G = I (identity matrix) then equation (*) expresses the steepest descent scheme: θnext = θnow ‒ ηg If cos ξ=–1 (meaning that d points to the same direction of vector –g ) then the objective function E can be decreased locally by the biggest amount at point θnow

If the contours of the objective function E form hyperspheres (or circles in a 2 dimensional space), the steepest descent methods leads to the minimum in a single step. Otherwise the method does not lead to the minimum point.

dE (θ now −η gnow ) = ∇T E (θ now −η gnow ) gnow dη

T = gnext − gnow = 0

⇒ gnext is orthogonal to the current gradient vector gnow João Miguel da Costa Sousa / Alexandra Moutinho

478

Newton’s Methods

479

Newton’s Methods

Classical Newton’s Method Principle: The descent direction d is determined by using the second derivatives of the objective function E if available If the starting position θnow is sufficiently close to a local minimum, the objective function E can be approximated by a quadratic form: 1 E (θ ) ≅ E (θ now ) + gT (θ − θ now ) + (θ − θ now )T H (θ − θ now ) 2  ∂ 2E  where H = ∇2E (θ ) =  2  ∂ θ  João Miguel da Costa Sousa / Alexandra Moutinho


Since the equation defines a quadratic function E(θ) in the θnow neighborhood ⇒ its minimum θˆ can be determined by differentiating and setting to 0. Which gives: 0=g+H(θˆ - θnow) equivalent to: θˆ =θnow–H-1g It is a gradient-based method for η=1 and G=H-1

480


481

4

Newton’s Methods

Newton’s Methods

Only when the minimum point θˆ of the approximated quadratic function is chosen as the next point θnext, we have the so-called Newton’s Method (NM) or the Newton-Raphson method θˆ = θnow–H-1g If H is positive definite and E(θ) is quadratic then the NM directly reaches a local minimum in the single Newton step (single –H-1g) If E(θ) is not quadratic, then the minimum may nor be reached in a single step and NM should be iteratively repeated João Miguel da Costa Sousa / Alexandra Moutinho

482

Step Size Determination


483

Initial bracketing

Formula of a class of gradient-based descent methods: θnext = θnow+ ηd = θnow‒ ηGg This formula entails effectively determining the step size η. Efficiency of the step size determination affects the entire minimization process.

We assume that the search area (or specified interval) contains a single relative minimum: E is unimodal over the closed interval Determining the initial interval in which a relative minimum must lie is of critical importance. Two possible schemes (next slide):

ϕ’(η) = 0 with ϕ(η) = E(θnow+ηd) is often impossible to solve João Miguel da Costa Sousa / Alexandra Moutinho

484

Initial bracketing


Line searches

1. A scheme, by function evaluation for finding three points to satisfy: E’(θk-1) > E(θk) < E(θk+1); θk-1< θk 0, θk < θk+1

The process of determining η* that minimizes a onedimensional function ϕ(η) is achieved by searching on the line for the minimum Line search algorithms usually include two components: sectioning (or bracketing), and polynomial interpolation Newton’s method When ϕ(ηk), ϕ’(ηk), and ϕ’’(ηk) are available, the classical Newton method (defined by θˆ = θ now − Hg−1 ) can be applied to solving the equation ϕ’(ηk)=0: ηk +1 = ηk −


485

486


φ '(ηk ) φ ''(ηk )

(*) 487

5

Line searches

Newton’s and secant methods

Secant method (or method of false position) If we use both ηk and ηk-1 to approximate the second derivative in equation (*), and if the first derivatives alone are available then we have an estimated ηk+1 defined as: η k +1 = η k −

φ '(η k ) φ '(ηk ) − φ '(ηk −1 ) ηk −η k −1

Both the Newton’s and the secant method are illustrated in the following figure.


488

Line searches

490

Line searches

Polynomial interpolation This method is based on curve-fitting procedures A quadratic interpolation is the method that is very often used in practice It constructs a smooth quadratic curve q that passes through three points (η1,ϕ1), (η2,ϕ2) and (η3,ϕ3): 3

∏ (η −η )

i =1

∏(η −η )

q(η ) = ∑φi

j

j ≠i

i

j

j ≠i

where ϕi = ϕ(ηi), i = 1,2,3 João Miguel da Costa Sousa / Alexandra Moutinho

491

Termination rules

Condition for obtaining a unique minimum point is: q’(η) = 0, therefore the next point ηnext is:

Line search methods do not provide the exact minimum point of the function ϕ

2 2 2 2 2 2 1 (η2 −η3 )φ1 + (η3 −η1 )φ2 + (η1 −η2 )φ3 2 (η2 −η3 )φ1 + (η3 −η1 )φ2 + (η1 −η2 )φ3

ηnext = *


489

Line searches

Sectioning methods • It starts with an interval [a1,b1] in which the minimum η* must lie, and then reduces the length of the interval at each iteration by evaluating the value of ϕ at a certain number of points • The two endpoints a1 and b1 can be found by the initial bracketing described previously • The bisection method is one of the simplest sectioning method for solving ϕ’(η*) = 0, if first derivatives are available! • Others: Golden search method (does not require ϕ to be differentiable) João Miguel da Costa Sousa / Alexandra Moutinho


We need a termination rule that accelerates the entire minimization process without affecting precision too much

492


493

6

Termination rules – Goldstein test

Nonlinear Least-Squares Problems Goal: Optimize a model by minimizing a squared error measure between desired outputs and the model’s output y = f(x,θ) Given a set of m training data pairs (xp; tp), (p = 1, …, m), we can write: m

m

E (θ ) = ∑ (t p − y p )2 =∑ (t p − f ( x p ,θ ))2 p =1

p =1

m

= ∑ rp (θ ) = r (θ )r (θ ) 2

T

p=1


494

Nonlinear Least-Squares Problems

Known also as the linearization method

m ∂rp (θ ) ∂E (θ ) g = g(0) = = 2∑ rp (θ ) = 2JT r ∂θ ∂θ p=1

Use Taylor series expansion to obtain a linear model that approximates the original nonlinear model

where J is the Jacobian matrix of r.

Use linear least-squares optimization to obtain the model parameters

Since rp(θ) = tp – f(xp,θ), this implies that the p-th row of J is: −∇θT f ( x p ,θ ) 496

Gauss-Newton Method


497

y – f(x, θnow) is linear with respect to θi–θi,now since the partial derivatives are constant E (θ ) = t − f ( x ,θ now ) −

Taylor expansion of y = f(x,θ) around θ = θnow:

θ =θ now


Gauss-Newton Method

The parameters θT=(θ1,θ2,…,θn,…) will be computed iteratively

n  ∂f ( x ,θ ) y = f ( x ,θ now ) + ∑  ∂θ i i =1 

495

Gauss-Newton Method

The gradient is expressed as:



∂f ( x ,θ now ) (θ −θ now ) ∂θ

= r + JT (θ − θ now ) = r + JT δ 2

  (θ i − θ i ,now ) 

2

2

where δ = θ–θnow 498


499

7

Gauss-Newton Method The next point θnext is obtained by:

∂E (θ ) ∂θ

θ =θ next

= J T {r + J (θ next − θ now )} = 0

Therefore, the following Gauss-Newton formula is expressed as: θnext = θnow– (JTJ)-1 JTr = θnow– ½ (JTJ)-1g (since g = 2JTr) João Miguel da Costa Sousa / Alexandra Moutinho

500

8

OPTIMIZATION

OPTIMIZATION

Suggest Documents

Optimization

Optimization

OPTIMIZATION

Invex Optimization Revisited - Optimization Online

Nonconvex Optimization Is Combinatorial Optimization

Optimization Inner-Loop Optimization

Smooth Optimization Approach for Covariance ... - Optimization Online

Particle Swarm Optimization Method for Constrained Optimization ...

Large Scale Portfolio Optimization with ... - Optimization Online

Machine Learning for Global Optimization - Optimization Online

Sparsity Optimization in Design of ... - Optimization Online

Nonsmooth optimization via BFGS - Optimization Online

Machine Tool Optimization Strategies – Machine Tool Optimization ...

Optimization on Black Box Function Optimization Problem

Robust Optimization with Multiple Ranges - Optimization Online

simulation optimization embedded particle swarm optimization for ...

Solving Nonlinear Portfolio Optimization ... - Optimization Online

Battery optimization vs energy optimization - CiteSeerX

Differential agitators, Parametric optimization, shape optimization ...

Optimization Problems in Natural Gas ... - Optimization Online

Variable mesh optimization for continuous optimization problems

Benchmarking Derivative-Free Optimization ... - Optimization Online

ordinal optimization approach to stochastic simulation optimization ...

Conversion Rate Optimization - Web Site Optimization