Michael Smallâ. Chi K. Tseâ. Abstract â Time delay embedding is the first step in reconstruction of deterministic nonlinear dynam- ics from a time series.
Optimal Selection of Embedding Parameters for Time Series Modelling Michael Small∗ Abstract — Time delay embedding is the first step in reconstruction of deterministic nonlinear dynamics from a time series. Unfortunately, there is no generic way to select the best time delay embedding. We show that for time series modelling it is possible to apply information theoretic arguments which lead to optimal selection of embedding window. Our results show that selection of embedding dimension and embedding lag should be considered not as part of the embedding process but as part of the modelling procedure. Nonlinear time series modelling results show qualitative and quantitative improvement in both long term and short term dynamics.
1
INTRODUCTION
Takens’ embedding theorem [1] is very often invoked as the motivation for applying a time delay embedding to reconstruct multi-dimensional dynamics from a scalar variable. Let xt be the scalar observable observed at integer times t = 0, 1, 3, 4, . . . , N . The usual incarnation of the time delay embedding is to obtain vector variables vt such that vt
=
(xt , xt−τ , xt−2τ , . . . , xt−(de −1)τ )
(1)
and, by appealing to the theorem of Takens one claims that for suitable τ , and sufficiently large de and N the evolution of vt is topologically equivalent to the underlying dynamical system. Unfortunately, N will normally be constrained and there is no generic rule for the selection of de and τ . Within the dynamical systems community methods such as minimum mutual information, false nearest neighbours and plateau onset of dynamical invariant are commonly applied [2]. Engineers would be more familiar with the Nyquist limit which implies an absolute criteria in the case of systems which exhibit finite bandwidth (this is not strictly applicable to deterministic aperiodic nonlinear systems). Very often, the aim of reconstructions such as (1) is to be able to successfully estimate dynamic invariants of the underlying system (such as correlation dimension and the leading Lyapunov expo∗ Department of Electronic and Information Engineering, Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong, China, e-mail: [ensmall,encktse]@polyu.edu.hk, tel.: +852 2766 4744, fax: +852 2362 8439.
Chi K. Tse∗ nent) [2]. In this case one can appeal to theoretical results that suggest de > 2dc + 1 [1], de > dc [3], or that only suitable selection of τ is significant [4]. Contradictory numerical results have also shown that the crucial parameter is actually the embedding window de τ [5] In this paper we ask a slightly different question, and naturally arrive at a different answer. We are interested not in correct estimation of dynamic invariant but only optimal reconstruction of the underling dynamics for a specific finite noisy time series. We find that in this situation choice of embedding lag τ should be left to the modelling algorithm. In fact, our results suggest that embedding and modelling are two parts of the same process and it is generally not possible to find the optimal embedding parameters without first building a model (and vice versa!). We derive an expression for the optimal embedding window as a function of the underlying dynamics, the system noise and the observation length N . Using this measure we provide an algorithm which can be used to estimate this embedding window and show that this method can produce superior modelling results. In section 2 we discuss the necessary theoretical framework. Section 3 describes the numerical modelling algorithm and in section 4 we present some modelling results. 2
THE CRITERION
We first need to define what we mean by the “best” model. Suppose that a time series xt of N observations has been observed and that we wish to construct an embedding such that zt
=
(xt−`1 , xt−`2 , xt−`3 , . . . , xt−`n )
(2)
where the embedding lags `i satisfy 0 ≤ `1 ≤ `i < `i+1 ≤ `n = dw . Notice that (2) represents a slight generalisation of (1). Equation (2) is completely defined by dw and a binary vector a = (a1 , a2 , . . . an ) ∈ {0, 1}dw such that aj = 1 ⇐⇒ j = `i for some i. Our objective is obtain a model f of the underlying dynamics from (2) such that xt+1 = f (zt ) + et
(3)
DL(x) = DL(x|f ) + DL(f )
(4)
is minimised. The first term DL(x|f ) is the description length of the data given the model, this is the description length of the model prediction errors et : DL(x|f ) = − ln P (x|N (0, σ 2 ))
(5)
where we approximate et ∼ N (0, σ 2 ). The second term, DL(f ) is the cost of specifying the particular model we use, the model parameters and the initial conditions of that model. The term DL(f ) is dependent on the embedding parameters a and dw (defined above). The description length of the data is (roughly) the number of bits needed to describe the data to some fixed precision. One can either specify the data completely or specify a model of that data and the model prediction errors. The rationale of minimum description length is that a compact (efficient) description of the data is best. Previously, Judd and Mees [7] have provided an algorithm to achieve an approximate solution to (4) for fixed a and dw . In this contribution we investigate the optimisation of (4) with respect to a and dw . If we restrict our attention to the variation of (4) with the embedding parameters we can write the second term DL(f ) as DL(f )
ln(error) / description length
where et are minimal. Ideally, et ∼ N (0, σ 2 ) such that σ 2 isP minimal. In general we only choose f such that t e2t is minimised. Clearly this minimisation can be achieved by making the model f sufficiently complicated. To avoid this un-enlightening eventuality we impose the minimum description length criteria [6] and instead require that the description length of the data
2 = − ln P (x0 , x1 , . . . , xdw −1 |N (µX , σN )) +DL(a) + DL(P). (6)
The first term on the right hand side is the description length of the model initial conditions, which we approximate as dw Gaussian random variable 2 . The second term with mean µX and variance σX is the description length of the embedding parameter a and clearly DL(a) = dw . The third term is the description length of the model parameters P. Finally, combining (4), (5) and (6); and expanding the negative log likelihood terms; we obtain √ σX ) DL(x) = N ln 2πσ + dw (1 + ln σ N + + DL(µX ) + DL(P). (7) 2 For a given model all but the last term are readily computable, and this term may be estimated using the methodology described in [7].
2000 0 −2000 −4000
0
10
20 30 40 embedding window dw
50
Figure 1: Computation of description length as a function of embedding window dw . The solid line is proportional to ln σ and is non-increasing (and constant for dw ≥ 34). The upper line is an evaluation of (7), which exhibits a minimum at dw = 15. Notice that equation (7) is not dependent on the actual embedding vector a. Therefore, the description length in not dependent on the embedding chosen and that all embedding lags should be presented for the modelling algorithm to select the optimal set. Hence, prior to modelling it is only necessary to determine the maximum embedding lag dw . 3
THE ALGORITHM
In an effort to make the minimisation of (7) tractable we make one substantial simplification. Instead of optimising over all possible models f or a large class of parameterised nonlinear functions, we restrict ourselves to local constant nonlinear models [8]. We choose s 6= t such that kzs − zt k is minimal. Then the “model” is given by f (xt+1 ) = xs+1
(8)
Clearly, this model is not predictive. But, this formulation has the advantage that it is extremely robust and is also provides a good characterisation of the topological properties of a chosen embedding. One further advantage of (8) is that this modelling scheme is entirely parameter free1 . and we may set DL(P) = 0. To choose the optimal dw we must compute (7) as a function of dw . To achieve this we employ the following algorithm. • Let d = 0 and let a = ∅ be the set of selected lags. Initialise the model prediction errors so that et = xt . • Repeat until a minimum of (7) is reached: 1 Alternatively, one could legitimately argue that the data are the parameters. In either case DL(P) = constant
At each iteration of the model we recompute (7) for the new value of dw , but only update the model to include the largest lag if it actually improves the result. This dependence of a is necessary only because the modelling algorithm is not sophisticated enough to select the optimal embedding parameters. Figure 1 demonstrates that this added complication is sufficient to achieve the required results. The curve depicting model prediction error is smooth and a decreasing function of dw , as expected. In figure 1 we also note that the optimal embedding window occurs at dw = 34. For this data set we obtain a value of τ ≈ 8 by examining the autocorrelation curve and the embedding dimension is known to be de = 3 or 4. Therefore, we see good agree between the new algorithm and expected results. In the next section we describe the application of this algorithm for selection of model lags for certain simple test systems and for experimental time series data. 4
THE MODEL
To test the application of this algorithm we consider data generated by the chaotic R¨ossler system and the Ikeda map. Both systems are affected by 5% noise (i.e. σσX = 20) and we vary the length of data. Figure 3 depicts the results of this calculations. As expected, figure 3 shows that the value of dw which minimises − ln σ is an upper bound on the value which minimises (7). For the R¨ossler time series we see that the optimal embedding lag increases with time series length. This is consistent with our intuition: as the amount of available data increases we become more confident of our predictive ability and look further into the past to predict the future. Furthermore, we see that the selected values correspond to significant time lags in the data. According to a calculation of autocorrelation, the first zero occurs at a lag of about 8 and the period is approximately 31. Our calculation for the Ikeda time series shows very little variation with time series length, and no
embedding window d
w
Rossler: 5% noise 60 40 20
embedding window dw
– Compute the description length for dw = d according to (7). – Compute the prediction error of the local constant model with time delay embedding (2) such that `i ∈ a ∪ {d + 1} for all i. If this is smaller than the current best, let a = a ∪ {d + 1} and update the current best model prediction error. – Increment d.
0
4
5
4
5
6 7 Ikeda: 5% noise
8
9
6 7 8 ln(time series length (n))
9
6 4 2 0
Figure 2: Variation of dw with N for R¨ossler and Ikeda systems with 5% observational noise. The bar chart and astricks depict results for the algorithm described here and according to the local model. distinction between the value which minimises (5) and that which minimises (7). Because this is a low dimensional chaotic map the best embedding lag τ = 1. Although the map is 2 dimensional, it is highly “twisted” and requires 3 dimensions for a time lag embedding to have no self intersections. This is consistent with the optimal embedding window dw = 3. Higher values of dw for longer data sets are the result of additional information from previous time series observation being available to “smooth” the noise in the system. We have conducted similar calculations for various noise levels and found results to be consistent with those presented here.In this communication we also present results for experimental data: the sunspot time series and chaotic laser dynamics. The experimental origin of the data and further examples are given in [9]. Figure 3 depicts the experimental sunspots time series and nonlinear models built using a minimum description length nonlinear modelling scheme described in [10]. We compare the typical dynamic behaviour for models built using embedding parameters estimated in the standard way (de = 6 and τ = 3) or using the optimal embedding window (dw = 6). The optimal embedding window actually
200 100 0 0 200
50
100
150
200
250
300
100 0 0 200
50
100
150
200
250
300
50
100
150
200
250
300
100 0 0
Figure 3: The top panel shows the sunspot time series data. The middle and bottom plot show representative iterated (noise-free) predictions from models built either with a standard embedding (centre) or a windowed embedding estimated the method described here (bottom panel).
uses less information about the system (in the sense that the embedding does not look as far into the past), but the dynamic performance is actually superior. The asymptotic dynamics exhibited by the model built using a standard embedding are a stable focus, for the window embedding technique we observe chaos. We also observed similar improvement in dynamic performance for models built from the R¨ossler system and the chaotic laser time series [9]. Finally, table 1 presents a summary of the modelling results for each of the three continuous systems considered in this communication. As the model building scheme is stochastic, the results quoted are the mean of 50 nonlinear models. We see that in each case the model built using the embedding window suggest by our new algorithm produced superior results: lower description length and lower prediction error. The model size (number of nonlinear terms in the model) was comparable in each case. ACKNOWLEDGEMENTS This work was supported by a Hong Kong Polytechnic University Research Grant (No. G-YW55).
model R¨ossler (de = 4, τ = 8) (dw = 15) sunspots (de = 6, τ = 3) (dw = 6) laser (de = 5, τ = 2) (dw = 10)
MDL
size
RMS
−655 −716
15.6 21.1
0.158 0.151
1267.9 1230.1
7.32 6.96
13.16 12.31
5753.6 5239.8
100.8 109.5
2.405 1.767
Table 1: Comparison of model performance with standard constant lag embedding and embedding over the embedding window suggested by the new algorithm. References [1] F. Takens. Detecting strange attractors in turbulence. Lecture Notes in Mathematics, 898:366–381, 1981. [2] H.D.I. Abarbanel. Analysis of observed chaotic data. Institute for nonlinear science. SpringerVerlag, New York, 1996. [3] M. Ding et al. Plateau onset for correlation dimension: when does it occur? Physical Review Letters, 70:3872–3875, 1993. [4] Y.C. Lai and D. Lerner. Effective scaling regime for computing the correlation dimension from chaotic time series. Physica D, 115:1–18, 1998. [5] H.S. Kim, R. Eykholt, and J.D. Salas. Delay time window and plateau onset of the correlation dimension for small data sets. Physical Review E, 58:5676–5682, 1998. [6] J. Rissanen. Stochastic complexity in statistical inquiry. World Scientific, Singapore, 1989. [7] K. Judd and A. Mees. On selecting models for nonlinear time series. Physica D, 82:426–444, 1995. [8] G. Sugihara and R.M. May. Nonlinear forecasting as a way of distinguishing chaos from measurement error in time series. Nature, 344:734–741, 1990. [9] M. Small and C.K. Tse. Optimal embedding parameters: A modelling paradigm. [10] M. Small and C.K. Tse. Minimum description length neural networks for time series prediction. Physical Review E, 66:066706, 2002.