âDerivative-based (nonlinear) optimization. â« Descent Methods. â« The Method of Steepest Descent. â« Newton's Methods (NM). âDerivative-free optimization.
System Identification Goal: Determine a mathematical model for an unknown system (or target system) by observing its input-output data pairs. Purposes:
OPTIMIZATION
To predict a system’s behavior, as in time series prediction and weather forecasting; To explain the interactions and relationships between inputs and outputs of a system; To design a controller based on the model of a system, as an aircraft or ship control; Simulate the system under control once the model is known.
Derivative-Based Optimization
João Miguel da Costa Sousa / Alexandra Moutinho
System Identification
Structure identification
The problem of determining a mathematical model for an unknown system by observing its input-output data pair is generally referred to as system identification.
Structure identification Parameter identification
f depends on the problem at hand and on the designer’s experience and the laws of nature governing the target system 460
Parameter identification
João Miguel da Costa Sousa / Alexandra Moutinho
461
Parameter identification
The structure of the model is known, however we need to apply optimization techniques in order to determine the parameter vector θ = θˆ such that the resulting model yˆ = f (u ,θˆ) describes the system appropriately:
ui
Identification techniques
462
yi
Target system to be identified Mathematical model f(u;θ)
y i − yˆ → 0 with y i assigned to ui
João Miguel da Costa Sousa / Alexandra Moutinho
Apply a priori knowledge about the target system to determine a class of models within which the search for the most suitable model is to be conducted; this class of model is denoted by a function y=f(u;θ) with: y the model output u s the input vector θ the parameter vector
Two top-down steps:
João Miguel da Costa Sousa / Alexandra Moutinho
459
João Miguel da Costa Sousa / Alexandra Moutinho
y i
+ ‒
y − y i i
463
1
System Identification
Optimization
System identification needs to do both structure and parameter identification repeatedly until satisfactory model is found: 1. Specify and parameterize a class of mathematical models representing the system to be identified 2. Perform parameter identification to choose the parameters that best fit the training data set 3. Conduct validation set to see if the model identified responds correctly to an unseen data set 4. Terminate the procedure once the results of the validation test are satisfactory. Otherwise, another class of model is selected and repeat step 2 to 4 João Miguel da Costa Sousa / Alexandra Moutinho
464
Derivative-based (nonlinear) optimization Descent Methods The Method of Steepest Descent Newton’s Methods (NM)
Derivative-free optimization
João Miguel da Costa Sousa / Alexandra Moutinho
Derivative-based optimization
They are used in: Optimization of nonlinear neuro-fuzzy models Neural network learning Regression analysis in nonlinear models
We cover: Gradient based optimization techniques Steepest descent methods Newton Methods Conjugate gradient methods Nonlinear least-squares problems 466
João Miguel da Costa Sousa / Alexandra Moutinho
Descent methods
467
Descent methods
Goal: Determine a point θ= θ* =[θ1*,θ2*,…,θn*]T such that f(θ1,θ2,…,θn) is minimum on θ = θ*
θnext=θnow+ ηd (η>0 is a step size regulating the search in the direction d)
We are looking for a local and not necessarily a global minimum θ*; Let f(θ1,θ2,…,θn) = E(θ1,θ2,…,θn) be the objective function; The search of this minimum is performed through a certain direction d starting from an initial value θ=θ0 (iterative scheme!) João Miguel da Costa Sousa / Alexandra Moutinho
465
Derivative-based optimization
Goal: Solving minimization nonlinear problems through derivative information
João Miguel da Costa Sousa / Alexandra Moutinho
Simplex Simulated Annealing Genetic Algorithms Ant Colony Optimization ...
468
or: θk+1 = θk + ηkdk (k = 1, 2, …); The series {θk} k=1,2,… should converge to a local minimum θ*; We first need to determine the next direction d and then compute the step size η; João Miguel da Costa Sousa / Alexandra Moutinho
469
2
Descent methods
Gradient-based methods
ηkdk is called the k-th step, whereas ηk is the k-th step size; We should have E(θnext) = E(θnow+ ηd)< E(θnow); The principal differences between various descent algorithms lie in the first procedure for determining successive directions; Once d is determined, η is computed as (line minimization): η * = arg min φ (η )
Definition: The gradient of a differentiable function E: ℝn → ℝ at θ is the vector of first derivatives of E, denoted as g. That is: def ∂E (θ ) ∂E (θ ) ∂E (θ ) g(θ ) = ∇E (θ ) = , ,..., θ θ ∂ ∂ ∂θ n 1 2
T
η >0
φ (η ) = E(θnow +ηd )
where
João Miguel da Costa Sousa / Alexandra Moutinho
470
Gradient-based methods
dE (θ now +ηd ) dη
η =0
= g T d = gT
471
Gradient-based methods
Based on a given gradient, downhill directions adhere to the following condition for feasible descent directions (does not guarantee convergence of algorithms): φ '(0) =
João Miguel da Costa Sousa / Alexandra Moutinho
The previous equation is justified by Taylor series expansion: E(θnow + ηd) = E(θnow)+ ηgTd +(η2)
d cos(ξ (θ now )) < 0
where ξ is the angle between g and d and ξ(θnow) is the angle between gnow and d at point θnow. João Miguel da Costa Sousa / Alexandra Moutinho
472
Gradient-based methods
João Miguel da Costa Sousa / Alexandra Moutinho
473
Gradient-based methods A class of gradient-based descent methods has the following form in which feasible descent directions can be found by gradient deflection: Gradient deflection consists of multiplying the gradient g by a positive definite matrix (pdm) G d=‒Gg ⇒ gTd=‒ gTGg0, G pdm) (*)
João Miguel da Costa Sousa / Alexandra Moutinho
474
João Miguel da Costa Sousa / Alexandra Moutinho
475
3
Gradient-based methods
The method of Steepest Descent
Theoretically, we wish to determine a value θnext such that (necessary but not sufficient condition): g(θ next ) =
∂E (θ ) ∂θ
θ =θ next
=0
but this is difficult to solve. In practice, we stop the algorithm if: 1. The objective function value is sufficiently small; 2. The length of the gradient vector g is smaller than a threshold; 3. The specified computation time is exceeded; João Miguel da Costa Sousa / Alexandra Moutinho
476
The method of Steepest Descent
João Miguel da Costa Sousa / Alexandra Moutinho
477
The method of Steepest Descent
Therefore, the negative gradient direction (–g) points to the locally steepest downhill direction This direction may not be a shortcut to reach the minimum point θ* However, if the steepest descent uses the line minimization technique (min ϕ(η)) then ϕ’(η) = 0 φ '(η ) =
Despite its slow convergence, this method is the most frequently used nonlinear optimization technique due to its simplicity. If G = I (identity matrix) then equation (*) expresses the steepest descent scheme: θnext = θnow ‒ ηg If cos ξ=–1 (meaning that d points to the same direction of vector –g ) then the objective function E can be decreased locally by the biggest amount at point θnow
If the contours of the objective function E form hyperspheres (or circles in a 2 dimensional space), the steepest descent methods leads to the minimum in a single step. Otherwise the method does not lead to the minimum point.
dE (θ now −η gnow ) = ∇T E (θ now −η gnow ) gnow dη
T = gnext − gnow = 0
⇒ gnext is orthogonal to the current gradient vector gnow João Miguel da Costa Sousa / Alexandra Moutinho
478
Newton’s Methods
479
Newton’s Methods
Classical Newton’s Method Principle: The descent direction d is determined by using the second derivatives of the objective function E if available If the starting position θnow is sufficiently close to a local minimum, the objective function E can be approximated by a quadratic form: 1 E (θ ) ≅ E (θ now ) + gT (θ − θ now ) + (θ − θ now )T H (θ − θ now ) 2 ∂ 2E where H = ∇2E (θ ) = 2 ∂ θ João Miguel da Costa Sousa / Alexandra Moutinho
João Miguel da Costa Sousa / Alexandra Moutinho
Since the equation defines a quadratic function E(θ) in the θnow neighborhood ⇒ its minimum θˆ can be determined by differentiating and setting to 0. Which gives: 0=g+H(θˆ - θnow) equivalent to: θˆ =θnow–H-1g It is a gradient-based method for η=1 and G=H-1
480
João Miguel da Costa Sousa / Alexandra Moutinho
481
4
Newton’s Methods
Newton’s Methods
Only when the minimum point θˆ of the approximated quadratic function is chosen as the next point θnext, we have the so-called Newton’s Method (NM) or the Newton-Raphson method θˆ = θnow–H-1g If H is positive definite and E(θ) is quadratic then the NM directly reaches a local minimum in the single Newton step (single –H-1g) If E(θ) is not quadratic, then the minimum may nor be reached in a single step and NM should be iteratively repeated João Miguel da Costa Sousa / Alexandra Moutinho
482
Step Size Determination
João Miguel da Costa Sousa / Alexandra Moutinho
483
Initial bracketing
Formula of a class of gradient-based descent methods: θnext = θnow+ ηd = θnow‒ ηGg This formula entails effectively determining the step size η. Efficiency of the step size determination affects the entire minimization process.
We assume that the search area (or specified interval) contains a single relative minimum: E is unimodal over the closed interval Determining the initial interval in which a relative minimum must lie is of critical importance. Two possible schemes (next slide):
ϕ’(η) = 0 with ϕ(η) = E(θnow+ηd) is often impossible to solve João Miguel da Costa Sousa / Alexandra Moutinho
484
Initial bracketing
João Miguel da Costa Sousa / Alexandra Moutinho
Line searches
1. A scheme, by function evaluation for finding three points to satisfy: E’(θk-1) > E(θk) < E(θk+1); θk-1< θk 0, θk < θk+1
The process of determining η* that minimizes a onedimensional function ϕ(η) is achieved by searching on the line for the minimum Line search algorithms usually include two components: sectioning (or bracketing), and polynomial interpolation Newton’s method When ϕ(ηk), ϕ’(ηk), and ϕ’’(ηk) are available, the classical Newton method (defined by θˆ = θ now − Hg−1 ) can be applied to solving the equation ϕ’(ηk)=0: ηk +1 = ηk −
João Miguel da Costa Sousa / Alexandra Moutinho
485
486
João Miguel da Costa Sousa / Alexandra Moutinho
φ '(ηk ) φ ''(ηk )
(*) 487
5
Line searches
Newton’s and secant methods
Secant method (or method of false position) If we use both ηk and ηk-1 to approximate the second derivative in equation (*), and if the first derivatives alone are available then we have an estimated ηk+1 defined as: η k +1 = η k −
φ '(η k ) φ '(ηk ) − φ '(ηk −1 ) ηk −η k −1
Both the Newton’s and the secant method are illustrated in the following figure.
João Miguel da Costa Sousa / Alexandra Moutinho
488
Line searches
490
Line searches
Polynomial interpolation This method is based on curve-fitting procedures A quadratic interpolation is the method that is very often used in practice It constructs a smooth quadratic curve q that passes through three points (η1,ϕ1), (η2,ϕ2) and (η3,ϕ3): 3
∏ (η −η )
i =1
∏(η −η )
q(η ) = ∑φi
j
j ≠i
i
j
j ≠i
where ϕi = ϕ(ηi), i = 1,2,3 João Miguel da Costa Sousa / Alexandra Moutinho
491
Termination rules
Condition for obtaining a unique minimum point is: q’(η) = 0, therefore the next point ηnext is:
Line search methods do not provide the exact minimum point of the function ϕ
2 2 2 2 2 2 1 (η2 −η3 )φ1 + (η3 −η1 )φ2 + (η1 −η2 )φ3 2 (η2 −η3 )φ1 + (η3 −η1 )φ2 + (η1 −η2 )φ3
ηnext = *
João Miguel da Costa Sousa / Alexandra Moutinho
489
Line searches
Sectioning methods • It starts with an interval [a1,b1] in which the minimum η* must lie, and then reduces the length of the interval at each iteration by evaluating the value of ϕ at a certain number of points • The two endpoints a1 and b1 can be found by the initial bracketing described previously • The bisection method is one of the simplest sectioning method for solving ϕ’(η*) = 0, if first derivatives are available! • Others: Golden search method (does not require ϕ to be differentiable) João Miguel da Costa Sousa / Alexandra Moutinho
João Miguel da Costa Sousa / Alexandra Moutinho
We need a termination rule that accelerates the entire minimization process without affecting precision too much
492
João Miguel da Costa Sousa / Alexandra Moutinho
493
6
Termination rules – Goldstein test
Nonlinear Least-Squares Problems Goal: Optimize a model by minimizing a squared error measure between desired outputs and the model’s output y = f(x,θ) Given a set of m training data pairs (xp; tp), (p = 1, …, m), we can write: m
m
E (θ ) = ∑ (t p − y p )2 =∑ (t p − f ( x p ,θ ))2 p =1
p =1
m
= ∑ rp (θ ) = r (θ )r (θ ) 2
T
p=1
João Miguel da Costa Sousa / Alexandra Moutinho
494
Nonlinear Least-Squares Problems
Known also as the linearization method
m ∂rp (θ ) ∂E (θ ) g = g(0) = = 2∑ rp (θ ) = 2JT r ∂θ ∂θ p=1
Use Taylor series expansion to obtain a linear model that approximates the original nonlinear model
where J is the Jacobian matrix of r.
Use linear least-squares optimization to obtain the model parameters
Since rp(θ) = tp – f(xp,θ), this implies that the p-th row of J is: −∇θT f ( x p ,θ ) 496
Gauss-Newton Method
João Miguel da Costa Sousa / Alexandra Moutinho
497
y – f(x, θnow) is linear with respect to θi–θi,now since the partial derivatives are constant E (θ ) = t − f ( x ,θ now ) −
Taylor expansion of y = f(x,θ) around θ = θnow:
θ =θ now
João Miguel da Costa Sousa / Alexandra Moutinho
Gauss-Newton Method
The parameters θT=(θ1,θ2,…,θn,…) will be computed iteratively
n ∂f ( x ,θ ) y = f ( x ,θ now ) + ∑ ∂θ i i =1
495
Gauss-Newton Method
The gradient is expressed as:
João Miguel da Costa Sousa / Alexandra Moutinho
João Miguel da Costa Sousa / Alexandra Moutinho
∂f ( x ,θ now ) (θ −θ now ) ∂θ
= r + JT (θ − θ now ) = r + JT δ 2
(θ i − θ i ,now )
2
2
where δ = θ–θnow 498
João Miguel da Costa Sousa / Alexandra Moutinho
499
7
Gauss-Newton Method The next point θnext is obtained by:
∂E (θ ) ∂θ
θ =θ next
= J T {r + J (θ next − θ now )} = 0
Therefore, the following Gauss-Newton formula is expressed as: θnext = θnow– (JTJ)-1 JTr = θnow– ½ (JTJ)-1g (since g = 2JTr) João Miguel da Costa Sousa / Alexandra Moutinho
500
8