Comput Optim Appl (2007) 38: 195–216 DOI 10.1007/s10589-007-9047-7
A recursive algorithm for nonlinear least-squares problems A. Alessandri · M. Cuneo · S. Pagnan · M. Sanguineti
Published online: 23 May 2007 © Springer Science+Business Media, LLC 2007
Abstract The solution of nonlinear least-squares problems is investigated. The asymptotic behavior is studied and conditions for convergence are derived. To deal with such problems in a recursive and efficient way, it is proposed an algorithm that is based on a modified extended Kalman filter (MEKF). The error of the MEKF algorithm is proved to be exponentially bounded. Batch and iterated versions of the algorithm are given, too. As an application, the algorithm is used to optimize the parameters in certain nonlinear input–output mappings. Simulation results on interpolation of real data and prediction of chaotic time series are shown.
A. Alessandri and M. Cuneo were partially supported by the EU and the Regione Liguria through the Regional Programmes of Innovative Action (PRAI) of the European Regional Development Fund (ERDF). M. Sanguineti was partially supported by a grant from the PRIN project ‘New Techniques for the Identification and Adaptive Control of Industrial Systems’ of the Italian Ministry of University and Research. A. Alessandri Department of Production Engineering, Thermoenergetics, and Mathematical Models (DIPTEM), University of Genoa, P.le Kennedy Pad. D, 16129 Genova, Italy e-mail:
[email protected] M. Cuneo · S. Pagnan Institute of Intelligent Systems for Automation, ISSIA-CNR National Research Council of Italy, Via De Marini 6, 16149 Genova, Italy M. Cuneo e-mail:
[email protected] S. Pagnan e-mail:
[email protected] M. Sanguineti () Department of Communications, Computer and System Sciences (DIST), University of Genoa, Via Opera Pia 13, 16145 Genova, Italy e-mail:
[email protected]
196
A. Alessandri et al.
Keywords Nonlinear programming · Nonlinear least squares · Extended Kalman filter · Recursive optimization · Batch algorithms 1 Introduction Recursive nonlinear least-squares algorithms have gained a lot of attention for their extensive use in a number of different research areas [1, 2]. The investigations were aimed at both deriving convergence results and improving the algorithmic efficiency. Connections were established between nonlinear least-squares methods and various techniques based on the Extended Kalman Filter (EKF) [3–5]. The difficulties in attaining convergence results in nonlinear least squares are well known in statistics, and most of the available results require strong assumptions on the distribution of regression errors (see, e.g., [6] and the references therein). In this paper, we present a recursive algorithm for nonlinear least-squares problems. The algorithm is based on the EKF and, along the lines of results on EKF-based estimation for nonlinear systems [7], we prove that the algorithm’s estimation error is exponentially bounded. In contrast to the stochastic analysis made in [7], we study convergence conditions in a deterministic context. This allows us to avoid assumptions on the statistics of the processes, which are difficult to verify in most nonlinear least-squares problems. As the proposed algorithm is a slight modification of the standard EKF algorithm, we have called it “modified extended Kalman filter (MEKF).” In [3] the focus is on the possibility of taking advantage of the EKF algorithm when processing data in blocks via a nonlinear least-squares approach. Taking the hint from this idea, we present also two extensions of the MEKF algorithm, aimed at dealing with large data sets in a batch way and by repeated iterations. The batch and iterated extensions are crucial in applications such as machine learning, where large amounts of data have to be processed (see, e.g., [8] and the references therein). In neural networks learning, for example, after the appearance of backpropagation (BP) [9, 10], various methods have been proposed to optimize the neural network parameters by performing recursive optimization (i.e., using at each step only the data that become available at that step). Such approaches are well-suited to considering many data sets on line, but may suffer from poor performances. As compared with these techniques, the EKF provides a nice framework to perform batch optimization (i.e., using one data block at a time), with advantages over BP [3]. The good performances obtained by EKF-based learning algorithms may be ascribed to the information available from the covariance matrix, which, however, turns out to be quite demanding from the computational point of view. For this reason, we have devoted much attention to an efficient coding of our MEKF algorithm. Numerical results show that our algorithm performs very satisfactorily when applied to machine learning by neural networks. The paper is organized as follows. The MEKF algorithm for solving nonlinear least-squares problems is presented in Sect. 2, together with the analysis of its properties. The batch and iterated versions, called BMEKF and IBMEKF, respectively, are described in Sect. 3. Section 4 is focused on the application of the proposed algorithms to the optimization of parameters in neural networks, and reports simulation results on two test-beds (namely, interpolation of real data and chaotic time-series prediction). Some conclusions are drawn in Sect. 5.
A recursive algorithm for nonlinear least-squares problems
197
2 Statement and analysis of the algorithm The following definitions and notations will be used throughout the paper. Let n, m be positive integers. For a real vector x ∈ Rn , we denote by x the Euclidean norm of x,
i.e., x = x12 + · · · + xn2 . For a symmetric matrix S ∈ Rn×n , we denote by λmin (S) and λmax (S) its minimum and maximum eigenvalue, respectively.Given a matrix n×m , we denote by M the norm of M defined as M = λ T M max (MM ) = ∈R T λmax (M M); for a symmetric positive-definite matrix S, S = λmax (S) (see [11]). Given two symmetric matrices S1 and S2 , S1 > S2 (S1 ≥ S2 ) means that the matrix S1 − S2 is positive definite (semidefinite). Sequences (of vectors, matrices, etc.) are shortly denoted by using curly brackets, e.g., {si }. Consider a set {(x t , y t ), t = 0, 1, . . .} of data, where x t ∈ X ⊂ Rm and y t ∈ Y ⊂ p R , where X and Y are compact sets. Assume that each input–output pair is randomly generated via an unknown continuous function f : Rm −→ Rp , i.e., y t = f (x t ). A function γ (w, x), γ : Rn × Rm → Rp , where w ∈ Rn is a parameter vector, can be used to interpolate the input–output pairs by solving the following nonlinear leastsquares (NLS) problem. and a data Problem NLSt Given a sequence {Rt } of positive definite matrices t 1 set {(x i , y i ), i = 0, 1, . . .}, find wt ∈ Rn that minimizes Jt (w) = t+1 i=0 y i − 2 γ (w, x i )Ri . Problem NLSt is a nonlinear programming problem, and various techniques are available to solve it for each t = 0, 1, . . . [2]. It is quite common to have the need of solving Problem NLSt recursively (i.e., finding at each step t the solution w◦t+1 of Problem NLSt+1 using only w◦t and the new datum (ut+1 , yt+1 )). This led to a number of methods that obtained a wide diffusion in various applications. Among such approaches, we focus on recursive Extended Kalman Filter (EKF) methods [3, 5]. The use of a Kalman-like algorithm is suggested by the following observation: roughly speaking, the cost Jt (w) is related to the expected value of the regression error in Problem NLSt . Moreover, the linear version of Problem NLSt can be efficiently solved by using a Kalman filter [12]. Taking the hint from this, we consider the following recursive algorithm. Modified EKF (MEKF) Algorithm For t = 0, 1, . . . , w ˆ t+1 = w ˆ t + Kt [y t − γ (w ˆ t , x t )],
(1)
where Ht =
∂γ (w, x) ∂w
(2)
, w=w ˆ t ,x=x t −1
Kt = Pt HtT (Ht Pt HtT + Rt )
,
Pt+1 = (α + 1)(Pt − Kt Ht Pt + I ),
(3) (4)
198
A. Alessandri et al.
> 0, α > 0, P0 and Rt are symmetric positive definite matrices, and (1) is initialized with a given w ˆ 0. For each t = 0, 1, . . . , w ˆ t plays the role of an estimate of the (unknown) solution w ˆ ◦t to Problem NLSt . The main difference between the standard EKF algorithm and the MEKF algorithm consists in Eq. 4, which for the EKF is a Riccati equation Pt+1 = Pt − Kt Ht Pt + Qt , where each matrix of the sequence {Qt } is positive semidefinite. In our case, this choice is not admissible because of technical reasons arising in the proof of the results given later on (see Lemma 3 in Appendix). For simplicity, we take Qt = I , but the only requirement is the choice of a positive-definite matrix Qt . In a stochastic interpretation of the regression problem, the role of the factor α + 1 consists in increasing the value of the covariance. In order to investigate the convergence properties of the MEKF algorithm, we first study properties of the solution of Problem NLSt . Toward this end, we make the following assumptions. Assumption 1 For t = 0, 1, . . . , Problem NLSt has a solution w◦t ∈ W ⊂ Rn . By C(X, Y ) we denote the normed linear space of continuous functions defined on X ⊂ Rm and with values in Y ⊂ Rp , equipped with the supremum norm. Assumption 2 The set of functions = {γ (w, ·) : Rm → Rp , w ∈ W } is dense in C(X, Y ). Assumption 2 corresponds to assuming that for every desired accuracy ε, there exists a vector w ∗ ∈ W such that the function γ (w ∗ , ·) approximates within ε in the supremum norm the unknown continuous mapping f generating the data, i.e., for every (x, y) ∈ X × Y one has γ (w ∗ , x) − y ≤ ε. In the neural-network parlance this is called the universal approximation property, which is satisfied by a large variety of neural mappings γ [13, 14]. Assumption 3 The mapping γ satisfies the following conditions: (i) for every w ∈ W , the function γ (w, ·) : Rm → Rp is continuously differentiable; (ii) for every x ∈ X, the function γ (·, x) : Rn → Rp is Lipschitz, i.e., there exists L > 0 such that for all w, w ∈ W , γ (w, x) − γ (w , x) ≤ Lw − w . Assumption 4 The cost function Jt : W → R is twice continuously differentiable and there exists lmin > 0 such that for t = 0, 1, . . . one has 1 HJt (sw ◦t + (1 − s)w ◦t+1 )ds ≥ lmin , λmin 0
where HJt : W
→ Rn
× Rn
is the Hessian of Jt .
A recursive algorithm for nonlinear least-squares problems
199
Now we are ready to prove the following theorem. Theorem 1 Suppose that Assumptions 1, 2, 3, and 4 hold. Then w◦t+1 = w ◦t + ϕ t ,
t = 0, 1, . . . ,
(5)
1 where ϕ t ( 0 HJt (sw ◦t+1 + (1 − s)w ◦t )ds)−1 ∇Jt (w ◦t+1 ), limt→+∞ ϕ t = 0, supt≥0 ϕ t < ∞, and there exists w◦ ∈ Rn such that limt→+∞ w◦t = w◦ . Proof By the definition of the cost, we have Jt+1 (w) =
t +1 1 Jt (w) + y t+1 − γ (w, x t+1 )2Rt+1 . t +2 t +2
Hence ∇Jt+1 (w) =
t +1 2 ∇Jt (w) + Rt+1 ∇w γ (w, x t+1 )[y t+1 − γ (w, x t+1 )], t +2 t +2
where the gradients ∇Jt (w) and ∇w γi (w, x), i = 1, 2, . . . , p, are considered as column vectors. For w = w◦t+1 , we have ∇Jt+1 (w ◦t+1 ) = 0 as a necessary optimality condition, so the previous equation gives ∇Jt (w ◦t+1 ) = −
2 Rt+1 ∇w γ (w ◦t+1 , x t+1 )[y t+1 − γ (w◦t+1 , x t+1 )]. t +1
By the properties of the matrix norms [11], we obtain ∇Jt (w ◦t+1 ) ≤
2 Rt+1 ∇w γ (w ◦t+1 , x t+1 )y t+1 − γ (w ◦t+1 , x t+1 ). t +1
Take ε > 0. By Assumption 2, there exists w∗ ∈ W such that y t+1 − γ (w ∗ , x t+1 ) = f (x t+1 ) − γ (w ∗ , x t+1 ) ≤ ε. Thus, by the triangle inequality and Assumption 3(i), we obtain y t+1 − γ (w ∗ , x t+1 ) = ε + Lw∗ − w◦t+1 ≤ ε + LrW , where rW is the radius of the set W and L is the Lipschitz constant of γ (·, x t ). Hence, ∇Jt (w ◦t+1 ) ≤
2 Rt+1 ∇w γ (w ◦t+1 , x t+1 )(ε + rW ). t +1
(6)
Since each matrix of the sequence {Rt } is positive definite and ∇w γ (w◦t+1 , x t+1 ) admits a maximum over the compact set W by Assumption 3(ii), inequality Eq. 6 implies lim ∇Jt (w ◦t+1 ) = 0.
t→+∞
(7)
Let k : R → Rn be defined as k(s) ∇Jt [sw ◦t+1 + (1 − s)w ◦t ], with derivative k . By 1 the Mean-Value Theorem, we have k(1) − k(0) = 0 k (s)ds and so ∇Jt (w ◦t+1 ) − ∇Jt (w ◦t ) =
1 0
HJt [sw ◦t+1 + (1 − s)w ◦t ](w ◦t+1 − w◦t )ds.
(8)
200
A. Alessandri et al.
As ∇Jt (w ◦t ) = 0, by Assumption 4 the equality Eq. 8 gives w◦t+1
= w ◦t
+ 0
1
HJt [sw ◦t+1
+ (1 − s)w ◦t ]ds
−1
∇Jt (w ◦t+1 ).
1 By Assumption 4, let ϕ t ( 0 HJt [sw ◦t+1 + (1 − s)w ◦t ]ds)−1 ∇Jt (w◦t+1 ), so w◦t+1 = w◦t + ϕ t . By Eq. 7 and Assumption 4, limt→+∞ ϕ t = 0. So the sequence {w◦t } is Cauchy in Rn , hence it converges to some w◦ ∈ Rn . So far, we have investigated the asymptotic behavior of Problem NLSt . In the following, we shall study how close w ˆ t is to w ◦t . Toward this end, we need some additional assumptions. We denote by ηt = y t − γ (w ◦t , x t ) the error made in approximating with the value γ (w ◦t , x t ) the output y t associated with the input x t . The vector ηt is bounded. Indeed, by Assumption 2, for every ε > 0 there exists w∗ ∈ W such that y t − γ (w ∗ , x t ) < ε. Hence ηt = y t − γ (w ◦t , x t ) ≤ y t − γ (w ∗ , x t ) + γ (w ∗ , x t ) − γ (w ◦t , x t ) < ε + LrW , where L is the Lipschitz constant of γ (·, x t ) and rW is the radius of the set W . Assumption 5 There exist positive constants ηmax , hmax , pmin , pmax , and rmin such that for t = 0, 1, . . . one has: (i) (ii) (iii) (iv)
ηt ηTt ≤ ηmax I ; Ht ≤ hmax ; pmin I ≤ Pt ≤ pmax I ; Rt ≥ rmin I .
−1 (I − Kt t ), and Gt = (I Kt t )T Pt+1 λmin (Gt ).
1
∂γ ◦ ˆ t , x t ]ds, Ft = (I − 0 ∂w [sw t + (1 − s)w −1 T − Kt Ht ) Pt+1 (I − Kt Ht ). Then λmax (Ft ) ≤
Assumption 6 For t = 0, 1, . . . , let t
Finally, we introduce one definition. Definition 1 A sequence {vt } is exponentially bounded if there exist c0 , c2 > 0 and c1 ∈ (0, 1) such that v t 2 ≤ c0 v 0 2 c1t + c2 , t = 0, 1, . . . . ˆ t between the The next theorem states the boundedness of the error et w ◦t − w ˆ t obtained by the MEKF algorithm. solution w ◦t of Problem NLSt and the vector w Theorem 2 Let w◦t be the solution of Problem NLSt and w ˆ t be given by the MEKF algorithm (see Eqs. 1–4). If Assumptions 1–6 hold, then the sequence {et } of the estimation errors is exponentially bounded. Proof By Eqs. 1 and 5, the error sequence is given by et+1 = et − Kt (y t − γ (w ˆ t , x t )) + ϕ t .
(9)
A recursive algorithm for nonlinear least-squares problems
201
As y t = γ (w ◦t , x t ) + ηt , by the Mean-Value Theorem we obtain y t = γ (w ˆ t , x t ) + t (w ◦t − w ˆ t ) + ηt .
(10)
Therefore, Eq. 9 yields et+1 = (I − Kt t )et − Kt ηt + ϕ t . Let t = Pt−1 , t = 0, 1, . . . . Assumption 5(iii) enables us to introduce the Lyapunov function Vt (et ) = eTt t et . By a little algebra, we obtain Vt+1 (et+1 ) = eTt (I − Kt t )T t+1 (I − Kt t )et − 2(I − Kt t et )T t+1 Kt ηt + ηTt KtT t+1 Kt ηt − ϕ Tt t+1 Kt ηt + 2(I − Kt t et )T t+1 ϕ t + ϕ Tt t+1 ϕ t .
(11)
By using the upper bounds of Lemma 1 (see the Appendix) with a = b = α/2, from Eq. 11 we have Vt+1 (et+1 ) ≤ (1 + α)eTt (I − Kt t )T t+1 (I − Kt t )et 2 T T 2 T η K t+1 Kt ηt + 2 + ϕ t+1 ϕ t + 2+ α t t α t and,
by
Lemma
[pmax (1 +
3
pmax h2max 2 −1 rmin ) ] ,
(see
the
Appendix),
for
β=
k2 1+k2
and
k2 =
we get
2 T T η K t+1 Kt ηt Vt+1 (et+1 ) ≤ (1 − β)eTt t et + 2 + α t t 2 T ϕ t+1 ϕ t . + 2+ α t
(12)
As t = Pt−1 , by Lemma 2 (see the Appendix) and Assumption 5(iii) we have ηTt KtT t+1 Kt ηt
≤ ≤
1 pmin
ηTt KtT Kt ηt
≤
1 pmin
ηmax pmax hmax 2 pmin rmin
pmax hmax ρmin
2 ηTt ηt (13)
and ϕ Tt t+1 ϕ t ≤
2 ϕsup
, (14) pmin 1 where ϕsup supt ϕt < ∞ as ϕt ≤ [ 0 HJt (sw ◦t+1 + (1 − s)w ◦t )ds]−1 × ∇Jt (w ◦t+1 ) (the first term is bounded by Assumption 4, and the second one by the inequality Eq. 6).
202
A. Alessandri et al.
To sum up, Eqs. 12, 13, and 14 give
where
Vt+1 (et+1 ) ≤ (1 − β)Vt (et ) + k4 ,
(15)
2 2 pmax hmax 2 ηmax 2 ϕsup k4 2+ + 2+ . pmin α rmin α pmin
(16)
By applying t times the inequality Eq. 15, we get Vt (et ) ≤ (1 − β)t V0 (e0 ) + i T k4 t−1 i=0 (1 − β) . As Vt (et ) = et t et , by Assumption 5(iii) we get et 2 ≤
t−1 pmax k4 e0 2 (1 − β)t + (1 − β)i . pmin pmin
(17)
i=0
i Since β ∈ (0, 1), ∞ i=0 (1 − β) = 1/β and so, as all the terms of the series are posit−1 i tive, i=0 (1 − β) ≤ 1/β for all t. Thus, Eq. 17 implies et 2 ≤
pmax k4 e0 2 (1 − β)t + , pmin βpmin
which proves that the estimation error is exponentially bounded with c0 = 1 − β, and c2 =
k4 βpmin
pmax pmin , c1
(see the Definition 1).
=
Theorems 1 and 2 allow one to figure out how the estimate w ˆ t provided by the MEKF algorithm behaves in comparison with the optimal solution; this is pictorially described in Fig. 1.
Fig. 1 Qualitative behavior of the sequences {w ◦t } and {wˆ t }, on the basis of Theorems 1 and 2
A recursive algorithm for nonlinear least-squares problems
203
Some comparisons between Theorem 2 and [7, Theorem 3.1, p. 715] have to be made. A major point consists in the fact that we regard {et } and {ηt } as sequences of deterministic variables instead of stochastic variables as in [7]. Note also that, in a stochastic context, Assumption 5(i) implies that the covariance of the regression error is finite. 3 Batch and iterated versions: the BMEKF and IBMEKF algorithms Batch training enables one to deal more efficiently with large data sets, by dividing the patterns into data batches of fixed length N and by applying algorithms to one batch at a time [15]. In Fig. 2, the data batch is shifted by d data at each step, where d ≤ N is a positive integer. The MEKF algorithm corresponds to d = N . Given a fixed number of iterations, say t, if d = 1, then t = 0, 1, . . . , N + t input– output data are explored; if d = N , then N (t + 1) data are explored. In general, the amount of data processed at step t is equal to N + dt (see Fig. 3). We also define ⎡ ⎤ γ (w, x t−N ) ⎢ γ (w, x t−N +1 ) ⎥ ⎢ ⎥ G(w, x t−1 (18) ⎥ ∈ RpN , .. t−N ) ⎢ ⎣ ⎦ . γ (w, x t−1 ) N m where x t−1 t−N col(x t−N , x t−N +1 , . . . , x t−1 ) ∈ X (recall that x t ∈ X ⊂ R ). Simit−1 N larly, let y t−N col(y t−N , y t−N +1 , . . . , y t−1 ) ∈ Y (recall that y t ∈ Y ⊂ Rp ).
Fig. 2 The d-step batch-mode optimization
Fig. 3 Comparison between the one-step and the N -step batch-mode optimizations
204
A. Alessandri et al.
The extension of the MEKF algorithm to the d-step framework can be expressed as follows. Batch-mode MEKF (BMEKF) Algorithm For t = N − 1, N, . . . : d(t−N +1)+N d(t−N +1)+N ˆ t−1 , x d(t−N +1) , ˆ t + Kt y d(t−N +1) − G w w ˆ t+1 = w where
∂G(w, x) Ht = , ∂w w=wˆ t ,x=x d(t−N+1)+N
(19)
(20)
d(t−N+1)
Kt = Pt HtT (Ht Pt HtT + Rt )
−1
,
Pt+1 = (α + 1)(Pt − Kt Ht Pt + I ),
(21) (22)
> 0, α > 0, P0 and Rt are symmetric positive definite matrices, and Eq. 19 is initialized with a given w ˆ N −1 . Note that if d and N are both taken equal to 1, then the BMEKF algorithm corresponds to the MEKF algorithm. It is important to remark that the convergence properties of MEKF (see Sect. 2) apply to BMEKF, too. In the latter, however, at each time t, N input–output pairs are processed instead of one, as in the MEKF algorithm, thus more computations are involved: the matrix to be inverted is of dimension (pN )2 instead of p 2 (see Eq. 21). Since in BMEKF one has to deal with larger matrices, efficient coding is crucial. A generalization of the BMEKF algorithm consists in repeating the estimate and covariance updates by using the same batch of input–output patterns. By following the terminology used in [16], the repetitions are called epochs and the corresponding algorithm is called Iterated Batch-mode Modified EKF (IBMEKF) algorithm, which is described below. Iterated Batch-mode MEKF (IBMEKF) Algorithm For t = N − 1, N, . . . : w ˜1=w ˆt for
i = 1, 2, . . . , NE ∂G(w, x) Hi = ∂w w=w˜ i ,x=x d(t−N+1)+N d(t−N+1)
−1
Ki = Pi HiT (Hi Pi HiT + Ri )
Pi+1 = (α + 1)(Pi − Ki Hi Pi + I ) d(t−N +1)+N d(t−N +1)+N w ˜ i+1 = w ˜ i + Ki y d(t−N +1) − G(w ˜ i , x d(t−N +1) ) end w ˆ t+1 = w ˜ NE , where > 0, α > 0, P0 , and Ri are symmetric positive definite matrices, and NE is the number of epochs.
A recursive algorithm for nonlinear least-squares problems
205
For NE = 1, the IBMEKF algorithm reduces to BMEKF. Note that the convergence results of Sect. 2 apply to the IBMEKF algorithm, too. Note also that IBMEKF results from applying a slight modification of the so-called “iterated extended Kalman filter” (see, e.g., [17]) to Problem NLSt . The possibility of using iterated Kalman filtering techniques to solve nonlinear programming problems is discussed in [3].
4 Application and numerical results In this section, we apply the MEKF algorithm to the problem of optimizing the weights in nonlinear input–output mappings implemented by feedforward neural networks. 4.1 Optimization of parameters in feedforward neural networks Feedforward neural networks (in the following, for the sake of brevity, often called “neural networks” or simply “networks”) are nonlinear mappings composed of L layers, with νs computational units in the layer s (s = 1, . . . , L). The input–output mapping of the q-th unit of the s-th layer is given by νs−1 yq (s) = g wpq (s)yp (s − 1) + w0q (s) , s = 1, . . . , L; q = 1, . . . , νs , (23) p=1
where g : R → R is called the activation function. The coefficients wpq (s) and the so-called biases w0q (s) are lumped together into the so-called weights vector ws . We let w col(w 1 , w 2 , . . . , w L ) ∈ W ⊂ Rn , where n
L
νs+1 (νs + 1)
s=0
is the total number of weights, ν0 = m, and νL+1 = p. The function implemented by a feedforward neural network with the weights vector w is denoted by γ (w, x), γ : W × X → Rp , where x ∈ X ⊂ Rm is the network input vector. In applications, the elements of w are optimized on the basis of a data set consisting of input–output pairs (x i , y i ), i = 0, 1, . . . , where x i ∈ X ⊂ Rm and y i ∈ Y ⊂ Rp represent the input and the desired output of the network γ , respectively. In the neurocomputing parlance, the process of parameters optimization is called “training” or “learning process.” As recalled in the Introduction, BP is the most popular method for the optimization of parameters in feedforward neural networks [9]. Although BP has been successfully applied in a variety of areas, it suffers from slow convergence, which makes highdimensional problems intractable. The slowness is to be ascribed both to the use of
206
A. Alessandri et al.
the steepest-descent method, which performs poorly in terms of convergence in highdimensional settings [18], and to the fixed, arbitrarily chosen step length. For these reasons, algorithms using also the second derivatives were developed [19] and modifications to BP were been proposed (see, e.g., the acceleration technique presented in [20] and the approach described in [21], which is aimed at restoring the dependence of the learning rate on time). The determination of the search direction and of the step length by using methods of nonlinear optimization has been considered, for example, in [22]. Deeper insights can be gained by regarding the learning of feedforward neural networks as a problem of parameter estimation. If one assumes that the data are generated by a continuous function f : Rm −→ Rp , i.e., y t = f (x t ), then, for every X ⊂ Rm , Assumption 2 holds, i.e., there exists a vector w ∗ ∈ W ⊂ Rn such that f (x) = γ (w ∗ , x) + η,
∀x ∈ X,
(24)
where η ∈ K ⊂ Rp is the network approximation error (see, e.g., [23]). Let ηt be the error made in approximating f by the neural network implementing the mapping γ , in correspondence of the input x t . Then y t = γ (w ∗ , x t ) + ηt ,
t = 0, 1, . . . ,
(25)
where ηt is regarded as a disturbance associated with y t . Equation 25 can be regarded as a nonlinear dynamic system; its state vector is given by the weights vector w, which evolves according to a fictitious dynamics, i.e., wt+1 = w t ,
t = 0, 1, . . .
(26)
where w 0 = w ∗ . If the activation function in Eq. 23 is differentiable, then Eqs. 25 and 26 motivate using recursive state estimators for neural network training. Following this approach, training algorithms, based on the extended Kalman filter (EKF) and showing faster convergence than BP, were proposed (see, e.g., [24–29]). However, the advantages of EKF-based training are obtained at the cost of a notable computational burden (as matrix inversions are required) and of a large amount of memory. In [30] it was developed an optimization-based learning algorithm for feedforward neural networks, that copes with some limitations of BP and does not require matrix inversions. When applying the MKEF algorithm to optimization of parameters in neural networks, the choice of Rt may be critical, as it is related to ηt (the “approximation error”) and depends on the structure of the network and on the initial choice of the weights. Usually, the matrix Rt is chosen quite large and tuned on line (see, for example, [26, p. 962, formulas 34–36]). As to α, P0 , and , too large values may cause slow convergence; by contrast, if they are taken too small, then the training process may suffer from inadequate generalization, i.e., poor capability of “approximating” new data. In the following, we report the numerical results obtained on two benchmark least-squares problems: interpolation of real data and prediction of chaotic time series. The MEKF algorithm (trainmekf) was compared with nine widely-used training algorithms, available from the Matlab Neural Toolbox [16]: BFGS (BroydenFletcher-Goldfarb-Shanno) quasi-Newton backpropagation (trainbfg), Powell-Beale
A recursive algorithm for nonlinear least-squares problems
207
conjugate gradient backpropagation (traincgb), Fletcher-Powell conjugate gradient backpropagation (traincgf), Polak-Ribiére conjugate gradient (traincgp), gradient descent with momentum and adaptive backpropagation (traingdx), LevenbergMarquardt backpropagation (trainlm), one-step secant backpropagation (trainoss), resilient backpropagation (trainrp), and scaled conjugate gradient backpropagation (trainscg). The mean square error (MSE), its standard deviation, and the processing time (in s) obtained by using a “Pentium 4” 1.6 GHz computer were considered as performance indexes with the purpose of comparing the various learning algorithms. The numerical results show that the MEKF outperforms backpropagation and other widespread training algorithms. 4.2 Interpolation of real data: the “Building” PROBEN1 benchmark The PROBEN1 benchmark collection [31] contains a large set of real data, which can be used for the prediction of energy consumption in a building (electrical energy, hot and cold water) on the basis of date, time of day, outside temperature, outside air humidity, solar radiation, and wind speed. The data set is composed of 2104 records (each consisting of 14 inputs and 3 outputs) for training and 1052 records for testing. The network to be trained is a feedforward neural network with a linear output layer and one hidden layer with four neurons and a sigmoidal activation function. The BMEKF algorithm was initialized with P0 = 10−2 I . The parameters α and were chosen equal to 10−2 and 10−2 , respectively. Table 1 summarizes the results for different choices of the window size N and d = N . The columns show the MSEs with the corresponding standard deviations and the processing times obtained by 100 trials. Different initial weights were used in each trial, uniformly randomly distributed between 0 and 1. As it can be seen from the table, the trainmekf algorithm performs quite well as regards the MSE evaluation of the training set, and outperforms all the other algorithms in terms of the test-set MSE. The processing time for BMEKF is much shorter than for all the other algorithms considered. 4.3 Prediction of chaotic time series For an integer τ ≥ 1, the discrete-time Mackey-Glass series is given by the following delay-difference equation [32]: xt+1 = (1 − c1 )xt + c2
xt−τ , 1 + (xt−τ )10
t = τ, τ + 1, . . . .
The training data were generated using the values c1 = 0.1, c2 = 0.2, and τ = 36; and the initial value x0 was randomly chosen. The data were arranged in sets made up of 100 series, each one with 2000 samples. The first 1000 time steps of each series were omitted. The successive 500 time steps were used for training and the remaining 500 for testing. The prediction was made by interpolating the values of xt+1 and xt+2 via the previous l + 1 samples, that is, (xt+2 , xt+1 ) ← (xt , xt−1 , xt−2 , . . . , xt−l ),
t = l, l + 1, . . . ,
2.5
68.9
29.2
32.5
29.1
38.7
55.8
71.4
58.4
70.6
0.82 ± 0.00
1.28 ± 0.01
1.51 ± 0.01
1.75 ± 0.01
1.60 ± 0.01
2.35 ± 0.04
5.73 ± 0.09
6.98 ± 0.11
6.87 ± 0.12
13.54 ± 0.17
trainmekf
traingdx
trainbfg
trainrp
trainlm
trainscg
traincgp
traincgb
traincgf
(s)
(error ± std)
1.03 ± 0.01 0.98 ± 0.01
20.01 ± 0.21
0.82 ± 0.00
0.86 ± 0.00
3.54 ± 0.12
1.91 ± 0.03
3.28 ± 0.11
1.14 ± 0.00
0.87 ± 0.00
0.79 ± 0.00
(error ± std)
11.30 ± 0.17
9.63 ± 0.14
8.31 ± 0.13
3.62 ± 0.03
2.98 ± 0.00
2.96 ± 0.01
2.86 ± 0.00
2.40 ± 0.00
0.78 ± 0.00
(error ± std)
100×
time
100×
100×
N = 90 Training set
Test set
N = 30
Training set
trainoss
Algorithm
25.5
26.1
25.4
19.6
33.4
10.3
33.2
10.2
24.3
1.5
(s)
time
N = 150
1.51 ± 0.01
1.63 ± 0.00
1.55 ± 0.00
1.46 ± 0.00
1.53 ± 0.00
0.88 ± 0.00
0.96 ± 0.01
0.84 ± 0.00
0.84 ± 0.00
5.18 ± 0.14
2.82 ± 0.03 3.90 ± 0.11
1.36 ± 0.02
1.18 ± 0.00
0.90 ± 0.00
0.75 ± 0.00
(error ± std)
100×
Training set
3.18 ± 0.08
2.00 ± 0.00
1.65 ± 0.00
0.52 ± 0.00
(error ± std)
100×
Test set
Table 1 MSEs and processing times for the PROBEN1 data set, using a 4-neuron one-hidden-layer feedforward neural network
18.5
19.0
18.8
14.2
30.2
7.4
24.2
7.4
17.8
1.3
(s)
time
Test set
1.66 ± 0.00
1.69 ± 0.00
1.64 ± 0.00
1.65 ± 0.00
3.91 ± 0.11
2.38 ± 0.01
2.02 ± 0.00
2.23 ± 0.00
1.82 ± 0.00
0.45 ± 0.00
(error ± std)
100×
208 A. Alessandri et al.
0.3
16.2
30.9
16.2
31.1
29.3
29.4
41.2
29.7
49.2
0.03 ± 0.00
0.50 ± 0.03
0.15 ± 0.00
0.58 ± 0.01
0.10 ± 0.00
0.09 ± 0.00
0.11 ± 0.00
0.12 ± 0.00
0.53 ± 0.02
0.77 ± 0.07
trainmekf
traincgb
traincgp
trainscg
trainoss
trainrp
trainbfg
trainlm
traingdx
(s)
(error ± std)
0.78 ± 0.07
0.56 ± 0.02
0.12 ± 0.00
0.11 ± 0.00
0.09 ± 0.00
0.11 ± 0.00
0.60 ± 0.01
0.16 ± 0.00
0.51 ± 0.03
0.03 ± 0.00
(error ± std)
100×
time
100×
0.00 ± 0.00
0.01 ± 0.00
0.02 ± 0.00
0.01 ± 0.00
0.01 ± 0.00
0.01 ± 0.00
17.8
12.5
15.2
13.7
13.5
13.4
6.2
12.8
0.01 ± 0.00 0.34 ± 0.00
6.5
0.2
(s)
time
0.03 ± 0.00
0.01 ± 0.00
(error ± std)
100×
N = 90 Training set
Test set
N = 30
Training set
traincgf
Algorithm
0.00 ± 0.00
0.01 ± 0.00
0.02 ± 0.00
0.01 ± 0.00
0.01 ± 0.00
0.01 ± 0.00
0.34 ± 0.00
0.01 ± 0.00
0.04 ± 0.00
0.01 ± 0.00
(error ± std)
100×
Test set
0.00 ± 0.00
0.05 ± 0.00
0.03 ± 0.00
0.01 ± 0.00
0.01 ± 0.00
0.01 ± 0.00
0.49 ± 0.00
0.01 ± 0.00
0.05 ± 0.00
0.01 ± 0.00
(error ± std)
100×
Training set
N = 150
Table 2 MSEs and processing times for Mackey-Glass series prediction, using a 5-neuron one-hidden-layer feedforward neural network
14.7
9.5
12.2
12.2
12.6
12.3
4.1
8.0
4.1
0.2
(s)
time
Test set
0.00 ± 0.00
0.06 ± 0.00
0.03 ± 0.00
0.01 ± 0.00
0.01 ± 0.00
0.01 ± 0.00
0.50 ± 0.00
0.01 ± 0.00
0.05 ± 0.00
0.01 ± 0.00
(error ± std)
100×
A recursive algorithm for nonlinear least-squares problems 209
Fig. 4 Mackey-Glass series predictions of xt+1 by a 5-neuron one-hidden-layer neural network
210 A. Alessandri et al.
Fig. 5 Mackey-Glass series predictions of xt+2 by a 5-neuron one-hidden-layer neural network
A recursive algorithm for nonlinear least-squares problems 211
212
A. Alessandri et al.
where l was chosen equal to 4. Table 2 summarizes the MSE and processing time results averaged over 100 different trials with 5-neuron one-hidden-layer neural networks. In each trial, the time series were initialized with different initial conditions uniformly distributed between 0 and 0.4; the initial neural weights were randomly chosen between 0 and 1 according to a uniform distribution. The MEKF algorithm was initialized with P0 = 10−2 I . The parameters α and were chosen equal to 10−2 and 10−2 , respectively. As can be seen in Table 2, trainmekf outperforms all the other algorithms in terms of both the MSE and the computational load (“time” column). Figures 4 and 5 enable one to compare the prediction capabilities obtained by the four algorithms trainmekf, traincgf, traincgb, and traincgp in two simulation runs using the training and test sets, respectively. 5 Conclusions The solution of nonlinear least-squares problems has been addressed via an algorithm based on the EKF. The proposed algorithm, called “modified extended Kalman filter” (MEKF) has been analyzed and its estimation error has been proved to be exponentially bounded. In addition, the algorithm is well-suited to being used in batch and iterated modes as well. As an application, the problem of optimizing the parameters of nonlinear input– output mappings implemented by feedforward neural networks has been considered. Numerical results show interesting features and very good performances of MEKF, which outperforms many widespread algorithms for neural networks training. This may be ascribed to the computation of the covariance matrix, which provides a useful measure of the uncertainty associated with the estimate of the optimal weights, though it does not correspond to the real covariance of the underlying stochastic process. However, in a fair evaluation of the pros and cons, it has to be considered that the MEKF algorithm has to take into account the storage requirements of the covariance matrix. Appendix In this appendix, the technical lemmas used for the proof of Theorem 2 are stated and proved. Lemma 1 Let t = Pt−1 . Under Assumption 5(iii), for every a, b > 0, the following inequalities hold for t = 0, 1, . . . : 2[−(I − Kt t )et ]T t+1 Kt ηt ≤ aeTt (I − Kt t )T t+1 (I − Kt t )et 1 + ηTt KtT t+1 Kt ηt , a
(27)
2[(I − Kt t )et ]T t+1 ϕ t ≤ beTt (I − Kt t )T t+1 (I − Kt t )et 1 + ϕ Tt t+1 ϕ t , b
(28)
A recursive algorithm for nonlinear least-squares problems
213
−ϕ Tt t+1 Kt ηt ≤ ϕ Tt t+1 ϕ t + ηTt KtT t+1 Kt ηt .
(29)
Proof Recall that, for every positive integer s, every symmetric positive definite matrix X ∈ Rs×s , and every v 1 , v 2 ∈ Rs , Young’s inequality [33] gives 2v T1 v 2 ≤ v T1 Xv 1 + v T2 X −1 v 2 .
(30)
By applying Eq. 30 to the first term of the left-hand side of Eq. 27 with v 1 = −(I − Kt t )et , v 2 = t+1 Kt ηt , and X = a t+1 , a > 0, we obtain 2[−(I − Kt t )et ]T t+1 Kt ηt ≤ aeTt (I − Kt t )T t+1 (I − Kt t )et 1 + ηTt KtT t+1 Kt ηt . a Similarly one can prove Eq. 28, by using Eq. 30 with v 1 = (I − Kt t )et , v 2 = t+1 ϕ t , and X = b t+1 , b > 0, and Eq. 29, by using Eq. 30 with v 1 = √1 ϕ t , v 2 = √1 t+1 Kt ηt , 2
2
and X = 2 t+1 .
Lemma 2 Under Assumptions 5(ii) and (iii), the following inequality holds for t = 0, 1, . . . : Kt ≤
pmax hmax . rmin
(31)
Proof From Eq. 3 we have Kt ≤ Pt HtT (Ht Pt HtT + Rt )−1 ≤ Pt HtT (Ht Pt HtT + Rt )−1 .
(32)
As Ht Pt HtT is positive semidefinite and Rt is positive definite, we obtain Ht Pt HtT + 1 I and so Rt ≥ Rt ≥ rmin I , hence (Ht Pt HtT + Rt )−1 ≤ Rt−1 ≤ rmin (Ht Pt HtT + Rt )−1 ≤
1 . rmin
The above inequality, Assumptions 5(ii) and (iii), and Eq. 32 give Eq. 31.
Lemma 3 Under Assumptions 5(ii) and 6, for every α > 0 the following inequality holds for t = 0, 1, . . . (α + 1)(I − Kt t )T t+1 (I − Kt t ) ≤ (1 − β) t , where t = Pt−1 , β =
k2 1+k2
< 1, and k2 = (pmax (1 +
pmax h2max 2 −1 rmin ) ) .
Proof From Eqs. 3 and 4 we have Pt+1 = (α + 1)(Pt − Kt Ht Pt + I ) = (α + 1)(Pt − Pt HtT (Ht Pt HtT + Rt )
−1
Ht Pt + I )
(33)
214
A. Alessandri et al.
= (α + 1)(Pt − Pt HtT KtT + I ) = (α + 1)((I − Kt Ht )Pt (I − Kt Ht )T + Kt Ht Pt (I − Kt Ht )T + I ). (34) Consider the term Kt Ht Pt (I − Kt Ht )T in Eq. 34. By using Eq. 3 and the matrix equality (7B.5) in [12, p. 262], we obtain (Pt−1 + HtT Rt−1 Ht )−1 = Pt − Pt HtT (Ht Pt HtT + Rt )−1 Ht Pt = (I − Pt HtT (Ht Pt HtT + Rt )−1 Ht )Pt = (I − Kt Ht )Pt . Thus, I − Kt Ht = (Pt−1 + HtT Rt−1 Ht )−1 Pt−1 > 0.
(35)
As Kt Ht = Pt HtT (Ht Pt HtT + Rt )−1 Ht is positive semidefinite, by (35) the matrix Kt Ht Pt (I − Kt Ht )T is positive semidefinite, too. Hence, from (34) we get Pt+1 ≥ (α + 1)((I − Kt Ht )Pt (I − Kt Ht )T + I ) and, by left and right multiplying for (I − Kt Ht )−1 and [(I − Kt Ht ]T )−1 , respectively, we have 1 (I − Kt Ht )−1 Pt+1 [(I − Kt Ht )T ]−1 ≥ Pt + (I − Kt Ht )−1 [(I − Kt Ht )T ]−1 . α+1 (36) Lemma 2 and Assumption 5(ii) enable us to obtain I − Kt Ht ≤ I − Kt Ht I ≤ (I + Kt Ht )I pmax h2max I ≤ (1 + Kt Ht )I ≤ 1 + rmin and so, by inverting both sides, (I − Kt Ht )−1 ≥
1 1+
pmax h2max rmin
I.
By Eq. 36, since I = Pt Pt−1 ≥ Pt /pmax , by Assumption 5(iii) the previous inequality gives 1 (I − Kt Ht )−1 Pt+1 [(I − Kt Ht )T ]−1 ≥ (1 + k2 )Pt , α+1 where k2 = [pmax (1 + we obtain
pmax h2max 2 −1 rmin ) ]
> 0. By inverting both sides of this inequality,
−1 (I − Kt Ht ) ≤ (α + 1)(I − Kt Ht )T Pt+1
1 P −1 . 1 + k2 t
(37)
A recursive algorithm for nonlinear least-squares problems
215
By Assumption 6 we can bound from above as follows −1 −1 (I − Kt t )T Pt+1 (I − Kt t ) ≤ λmax [(I − Kt t )T Pt+1 (I − Kt t )]I −1 ≤ λmin [(I − Kt Ht )T Pt+1 (I − Kt Ht )]I −1 ≤ (I − Kt Ht )T Pt+1 (I − Kt Ht ).
By Eq. 37 the latter inequality gives (I − Kt t )T t+1 (I − Kt t ) ≤ 1 1+k2 t
= (1 − β) t .
1 1 α+1 1+k2 t
≤
References 1. Bates, D.M., Watts, D.G.: Nonlinear Regression and Its Applications. Wiley, New York (1988) 2. Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (1999) 3. Bertsekas, D.P.: Incremental least–squares methods and the extended Kalman filter. SIAM J. Optim. 6(3), 807–822 (1996) 4. Feldkamp, L.A., Prokhorov, D.V., Eagen, C.F., Yuan, F.: Enhanced multi-stream Kalman filter training for recurrent networks. In: Suykens, J., Vandewalle, J. (eds.) Nonlinear Modeling: Advanced BlackBox Techniques, pp. 29–53. Kluwer Academic, Dordrecht (1998) 5. Moriyama, H., Yamashita, N., Fukushima, M.: The incremental Gauss–Newton algorithm with adaptive stepsize rule. Comput. Optim. Appl. 26(2), 107–141 (2003) 6. Shuhe, H.: Consistency for the least squares estimator in nonlinear regression model. Stat. Probab. Lett. 67(2), 183–192 (2004) 7. Reif, K., Günter, S., Yaz, E., Unbehauen, R.: Stochastic stability of the discrete-time extended Kalman filter. IEEE Trans. Autom. Control 44(4), 714–728 (1999) 8. K˚urková, V., Sanguineti, M.: Learning with generalization capability by kernel methods of bounded complexity. J. Complex. 21(3), 350–367 (2005) 9. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representation by error propagation. In: Rumelhart, D.E., McClelland, J.L., PDP Research Group (eds.) Parallel Distributed Processing: Explorations in the Microstructures of Cognition, vol. I: Foundations, pp. 318–362. MIT, Cambridge (1986) 10. Widrow, B., Lehr, M.A.: 30 years of adaptive neural networks: perceptron, madaline, and backpropagation. Proc. IEEE 78(9), 1415–1442 (1990) 11. Ortega, J.M.: Numerical Analysis: A Second Course. SIAM, Philadelphia (1990). Reprint of the 1972 edition by Academic, Now York 12. Jazwinski, A.H.: Stochastic Processes and Filtering Theory. Academic, New York (1970) 13. Pinkus, A.: Approximation theory of the MLP model in neural networks. Acta Numer. 8, 143–196 (1999) 14. Park, J., Sandberg, I.W.: Universal approximation using radial–basis–function networks. Neural Comput. 3(2), 246–257 (1991) 15. Heskes, T., Wiegerinck, W.: A theoretical comparison of batch-mode, on-line, cyclic, and almostcyclic learning. IEEE Trans. Neural Netw. 7(4), 919–925 (1996) 16. Demuth, H., Beale, M.: Neural Network Toolbox—User’s Guide. The Math Works, Natick (2000) 17. Bell, B.M., Cathey, F.W.: The iterated Kalman filter update as a Gauss–Newton method. IEEE Trans Autom. Control 38(2), 294–297 (1993) 18. Fletcher, R.: Practical Methods of Optimization. Wiley, Chichester (1987) 19. Battiti, R.: First- and second-order methods for learning: between steepest descent and Newton’s method. Neural Comput. 4(2), 141–166 (1992) 20. Tollenaere, T.: SuperSAB: fast adaptive backpropagation with good scaling properties. Neural Netw. 3(5), 561–573 (1990) 21. Jacobs, R.A.: Increased rates of convergence through learning rate adaption. Neural Netw. 1(4), 295– 307 (1988) 22. Denton, J.W., Hung, M.S.: A comparison of nonlinear optimization methods for supervised learning in multilayer feedforward neural networks. Eur. J. Oper. Res. 93(2), 358–368 (1996)
216
A. Alessandri et al.
23. Stinchcombe, M., White, H.: Approximation and learning unknown mappings using multilayer feedforward networks with bounded weights. In: Proc. Int. Joint Conf. on Neural Networks IJCNN’90, pp. III7–III16 (1990) 24. Singhal, S., Wu, L.: Training multilayer perceptrons with the extended Kalman algorithm. In: Touretzky, D.S. (ed.) Advances in Neural Information Processing Systems 1, pp. 133–140. Morgan Kaufmann, San Mateo (1989) 25. Ruck, D.W., Rogers, S.K., Kabrisky, M., Maybeck, P.S., Oxley, M.E.: Comparative analysis of backpropagation and the extended Kalman filter for training multilayer perceptrons. IEEE Trans. Pattern Anal. Mach. Intell. 14(6), 686–691 (1992) 26. Iiguni, Y., Sakai, H., Tokumaru, H.: A real-time learning algorithm for a multilayered neural network based on the extended Kalman filter. IEEE Trans. Signal Process. 40(4), 959–966 (1992) 27. Schottky, B., Saad, D.: Statistical mechanics of EKF learning in neural networks. J. Phys. A: Math. Gen. 32(9), 1605–1621 (1999) 28. Nishiyama, K., Suzuki, K.: H∞ -learning of layered neural networks. IEEE Trans. Neural Netw. 12(6), 1265–1277 (2001) 29. Leung, C.-S., Tsoi, A.-C., Chan, L.W.: Two regularizers for recursive least squared algorithms in feedforward multilayered neural networks. IEEE Trans. Neural Netw. 12(6), 1314–1332 (2001) 30. Alessandri, A., Sanguineti, M., Maggiore, M.: Optimization-based learning with bounded error for feedforward neural networks. IEEE Trans. Neural Netw. 13(2), 261–273 (2002) 31. Prechelt, L.: PROBEN 1—A set of neural network benchmark problems and benchmarking rules. Tech. Rep. 21/94, Fakultät für Informatik, Universität Karlsruhe, Germany, September 1994, Anonymous FTP: /pub/papers/techreports/1994/1994-21.ps.gzonftp.ira.uka.de 32. Mackey, M.C., Glass, L.: Oscillation and chaos in physiological control systems. Science 197, 287– 289 (1977) 33. Hardy, G., Littlewood, J.E., Polya, G.: Inequalities. Cambridge University Press, Cambridge (1989)