1.1 A good data analysis paradigm. 4. 3.1 The time immemorial assumption. 63. 3.2 Predicted signals with 95% confidence bands of chlorine data using two.
T H E MODEL BASED APPROACH TO SMOOTHING by
SONIA V . T . M A Z Z I L i e , Universidad Nacional de Cordoba, Argentina, M . S c , T h e University of British Columbia,
1989
1992
A T H E S I S S U B M I T T E D IN P A R T I A L F U L F I L L M E N T THE REQUIREMENTS FOR T H E DEGREE DOCTOR OF
OF
PHILOSOPHY in
T H E F A C U L T Y O F G R A D U A T E STUDIES DEPARTMENT OF
STATISTICS
We accept this thesis as conforming to the required standard
T H E UNIVERSITY O F BRITISH January
COLUMBIA
1997
© S o n i a V . T . Mazzi,
1997
OF
In
presenting
degree
at the
this
thesis
in
University of
partial
fulfilment
of
of
department
requirements
British Columbia, I agree that the
freely available for reference and study. I further copying
the
by
his
or
her
representatives.
an advanced
Library shall make it
agree that permission for extensive
this thesis for scholarly purposes may be granted or
for
It
is
by the
understood
that
head of copying
my or
publication of this thesis for financial gain shall not be allowed without my written permission.
Department
of
€Tr\
Tl'STt'CS
The University of British Columbia Vancouver, Canada
Date
DE-6 (2/88)
Sartygr^j
39-No,
The model based approach to smoothing
Abstract
Nonparametric regression methods are devised in order to obtain a smooth fit of a regression curve without having to specify a parametric model. provides a flexible approach to curve estimation.
T h i s , in principle,
It is shown how splines are not such
a flexible tool for smoothing. It is proposed to use the modern approach of state space modeling instead, to solve a variety of smoothing problems. T h i s model based approach alleviates estimation and computational problems as well as provides insightful solutions to smoothing problems.
n
Contents
Abstract
ii
Table of Contents
iii
List of Tables
viii
List of Figures
ix
Acknowledgements
x
Introduction
1
1.1
Motivation
1
1.2
Contributions
6
1.3
Outline
8
1
2 The State space model 2.1
10
Definition of the state space model iii
14
2.2
2.3
2.4
2.5
2.6
Some examples of state space models
18
2.2.1
T h e parametric regression model as a S S M
18
2.2.2
Structural models
19
2.2.3
A R I M A models as state space models
24
Principles and tools of prediction and estimation for the S S M
28
2.3.1
M i n i m u m mean square error prediction
30
2.3.2
Combining predictors and iterated prediction
32
2.3.3
T h e prediction and K a l m a n update formulas
32
2.3.4
T h e K a l m a n filter
34
2.3.5
T h e smoothing filter
35
T h e diffuse K a l m a n filter
39
2.4.1
T h e diffuse likelihood
40
2.4.2
Diffuse prediction
41
2.4.3
Diffuse smoothing
42
State space models which have applied since time immemorial
43
2.5.1
44
Prediction and likelihood evaluation for non-stationary models . . .
Cross-validation and related methods of estimation
46
2.6.1
47
Cross-validation for state space models
iv
2.7
3
2.6.2
Generalised cross-validation and state space models
48
2.6.3
Generalised cross-validation and diffuse state space models
49
Summary
50
Continuous state space models
51
3.1
Definition of continuous state space models
53
3.2
Discrete version of a continuous state space model
53
3.3
Estimation using the continuous state space model
54
3.3.1
T h e state transition matrix
54
3.3.2
Variance matrix of the state error
55
3.3.3
Initialization of the Diffuse K a l m a n filter when using a continuous state space model
3.4
4
63
Smoothing L-Splines
67
3.4.1
Definition of L-splines
68
3.4.2
A stochastic model for L-spline smoothing
69
3.5
Some continuous structural models
76
3.6
Summary
82
Smoothing polynomial splines
84
v
4.1
Definition of polynomial smoothing splines
86
4.2
A basis for the space of natural splines
89
4.2.1
Divided differences
89
4.2.2
T h e Demmler-Reinsch basis
91
4.3
Estimation of the smoothing parameter
95
4.4
A Bayesian approach to spline smoothing
96
4.4.1
Some Gaussian processes
96
4.4.2
Wahba's approach
99
4.4.3
A state space form of Wahba's model
4.5
101
Summary
104
A closer look at polynomial splines
105
5.1
T h e signal plus noise model
106
5.2
Penalised least squares
107
5.3
Smoothing splines, penalized least squares,
signal extraction and state
space models
108
5.4
Smoothing splines and A R I M A models
Ill
5.5
Related topics
114
5.5.1
114
Smoothing i n the presence of heteroskedasticity
vi
5.6
5.5.2
Repeated measurements
115
5.5.3
Semiparametric regression
115
5.5.4
Change point problems or intervention analysis
116
Summary
116
Epilogue
117
References
118
A Stochastic differential equations
124
A.l
Definition of the stochastic integral
124
A.2
Stochastic differentials
125
A.3
Stochastic differential equations, existence and uniqueness of solutions . . . 126
A.4
Linear stochastic differential equations
127
B Kronecker products
129
C Jordan canonical form of a matrix
131
D The matrix exponential
132
E Data sets
134
vii
List of Tables
2.1
Different trend models
23
3.1
Estimated parameters of chlorine data model for two different modeling situations
3.2
73
Estimated parameters for the L-spline and polynomial spline of order 7 models for melanoma data
79
vm
List of Figures 1.1
A good data analysis paradigm
4
3.1
T h e time immemorial assumption
3.2
Predicted signals with 95% confidence bands of chlorine data using two
63
different models 3.3
74
Predicted innovation residuals with 95% confidence bands for chlorine data using two different models
75
3.4
Predicted signals for Case I and II, superimposed
75
3.5
Predicted signal and pointwise confidence bands joined by straight lines of incidence of melanoma data using an L-spline and a polynomial spline of order 7
3.6
80
Predicted innovation residuals of incidence of melanoma data with error bands for the L-spline and polynomial spline of order 7 models
3.7
81
Predicted signal and pointwise confidence bands joined by straight lines of incidence of melanoma data using a cubic spline and function of the innovation residuals
ix
autocorrelation 82
Acknowledgements I a m most grateful to m y research supervisor, D r . Piet de Jong, whose advice and words were always a source of inspiration and guidance.
I thank h i m for his patience
and care to transmit not only the technical aspects of the statistical science but also the epistemological ones. O u r extended discussions were most formative and have definitely marked these first steps of m y carrer. I wish to thank D r . Nancy Heckman, D r . John Petkau, D r . R u b e n Zamar, and D r . Harry Joe for the invaluable advice and help provided throughout m y graduate studies. Thanks to Laurie M a c W i l l i a m , Cheryl Garden, Matias Salibian, Birgitte R0nn, and Felix Menden who, as friends and/or collegues provided at different stages of m y graduate student life uninterested support, love and care. I would also like to thank Piet de Jong for his financial support through an N S E R C grant. Finally, I want to thank my family. Without their love and trust this would have never been accomplished.
Chapter 1 Introduction 1.1
Motivation
Suppose that pairs of data values (yi,Xi),
i = 1 , . . . , n , are observed and one wishes to
describe the relationship between the components of the pairs. T h i s is called a problem.
s m o o t h i n g
T h e reason for the name is that one would like to describe the relationship
between the variables without taking into account too much of the inherent noise which is present i n almost all data sets.
For example, if the x's are the times (or dates) at
which the ?/'s were recorded, the set of pairs ( ? / ; , £ ; ) ,
i
=
l , . . . , n ,
is denoted a time
series and the interest lies i n describing how y evolves i n time. A great deal of effort has been put by researchers to study time series and statistical models and inference tools have been devised for this purpose, for example, A R I M A models (see Box, Jenkins and Reinsel, 1994), the more modern scope of State Space models (see for example Anderson and Moore, 1979, Harvey, 1989 and 1993, West and Harrison, 1989) and i n particular Structural T i m e Series models (see Harvey, 1989 and 1993).
A l l these studies on time
series can be translated with few modifications to the case where x is not necessarily time but any other equally spaced ordering variable for the ?/'s. In this thesis no assumption
1
on the spacing of the indexing variable, x, will be made. Another area of statistics where a lot of effort has been put into studying smoothing problems is nonparametric regression. In this case, the relationship between the variables is assumed to have an unspecified functional form g such that yi = g(xi) where the e;'s are
i.i.d.
+ ei,
i = l,...,n,
(1.1)
error terms. T h e main idea behind nonparametric regression is,
in contrast to parametric models, not to make any assumptions on the form of g. parametric models g would be known, at least implicitly, up to some parameters polynomials).
In (e.g.
In non-parametric regression, qualitative assumptions on g, commonly
the continuity of some derivative, are made in order to restrict the search of a suitable regression function to a reasonably small set. There are different approaches to estimate g in a non-parametric context.
Perhaps
the most popular non-parametric approach is the one based on polynomial splines. T h e estimation criterion that gives rise to polynomial splines consists on finding a function g that has certain regularity conditions (e.g. a degree of differentiability, square integrability of the derivatives, etc.), fits the data well and has some degree of smoothness
(where
smoothness or curvature is measured as the integral of the square of some derivative of g).
It turns out that the polynomial spline of order 2m — 1 is a smooth piecewise
polynomial of order 2m — 1 (see Chapter 4). It is claimed that because of the segmented nature of piecewise polynomials, polynomial smoothing splines will estimate the regression function with high fidelity and smoothness. It has been realised, though, that polynomial smoothing splines do not always provide satisfactory regression estimates.
A s a way of
generalizing smoothing splines and at the same time cover the cases where smoothing polynomial splines fail to perform appropriately, attention is being paid to the so called 2
L-splines (see Chapter 3 for a definition). Most of the work done in the area of spline smoothing is not based on the use of stochastic models, but on the use of functional analysis theory. It is the claim of statisticians advocating nonparametric regression that by only supplying qualitative information about the regression function the data will speak for themselves concerning the actual form of the regression curve. As will be shown, parametric and nonparametric techniques do not cover the range of all possible approaches to solving smoothing problems.
F r o m the literature in non-
parametric regression (e.g. Eubank, 1988) it seems as though the dichotomy "parametric vs. nonparametric" techniques for regression comprise the universe of possible techniques so far developed. There are other approaches, based on stochastic modeling, which have coexisted most of the time with the deterministic modeling techniques but have been somewhat relegated to equally spaced time series analysis and ignored by the researchers working in other regression (or smoothing) problems.
T h e most important objective in data analysis is to understand the true law that dictates the behaviour of observed variables. For this purpose, usually the researcher is presented with not much more than observed data and little or almost nonexistent theory inherent to the variables for this purpose. In this case a sensible paradigm to adopt would be as follows.
i) Recognize important characteristics of the data. ii) Develop a way to analyse the data, i.e. a methodology for data analysis and inference, so that this adopted methodology can incorporate as much available perception
3
and theoretical background as possible. iii) After the data have been analysed according to step ii), check that the adopted methodology has captured the features in i) (by means of diagnostic checks) and look for further improvement if necessary.
One can represent the above paradigm schematically as in Figure 1.1.
i) recognise characteristics of data
ii) adopt methodology
data analysis
for analysis
iii) verify methodology is appropriate
diagnostic checks
Figure 1.1: A good data analysis paradigm.
It is important that the choices of methodologies in step ii) are abundant so that the features of the data recognised in step i) can be captured. These methodologies have to be not only abundant but also easy to implement in terms of the feasibility, amount and level of complexity of computation involved. Also the methodologies of ii) should allow step iii) of the paradigm so that the loop can continue until a satisfactory description of the data is obtained.
This type of paradigm, or the chart depicted in Figure 1.1 is not new in the literature. Box has presented similar schemes in his works (see for example Box, 1979,
1980,
and
Box, Jenkins and Reinsel, 1994 p.17). W h a t is different is that Box does not make explicit step i) of the model building process and starts up in step ii). Other authors, inspired by Box's model building charts, have also designed and proposed more complete schemes 4
(see Cook and Weisberg, 1982, p.7, W i l d , 1994, Tong, 1990). For many regression problems it has been shown in the existing literature that the above paradigm is realizable by adopting in step ii) stochastic models such as state space models (see Chapter 2 for a definition and properties). T h e adoption of state space models permits the incorporation of observed features of data sets (for example in the form of components, like trends, seasonal and/or cyclic components, etc.)
as well as existing
background theory on the relationship of the variables (such as influence of covariates and the fulfillment of differential equations) in a specific and transparent way. Estimation within the context of state space models is relatively easy and convenient algorithms such as the K a l m a n filter, diffuse K a l m a n filter and smoothing algorithms are nowadays easily implemented on practically any computer. Diagnostic tools and tests are readily available to perform part iii) of the paradigm. O n the other hand, if one wishes to adopt the above data analysis paradigm and one decides to use polynomial spline smoothing as a methodology for data analysis, two problems arise.
T h e first one is that in no obvious way is it clear how to incorporate
observed features or underlying theory on the behaviour of the data in the data analysis methodology. T h e second problem is that because of the lack of diagnostic tools and the lack of a model the completion of the cycle indicated in Figure 1.1 cannot be done. A s will be shown polynomial spline smoothing is a rather narrow class of methodologies to adopt for data analysis. T h e amount and level of complexity of computations involved in spline methods is not to be overlooked. Only for the case of cubic polynomial splines do there exist relatively easy algorithms and methods of computation (see Green and Silverman, 1994).
Examples of articles where it is shown that smoothing by means of splines (either
5
polynomial or L-splines) is a problem that can be posed as a signal extraction problem using a particular stochastic representation of the unobserved signal or, in other words, as a smoothing problem in the context of a particular state space model are those by Weinert, B y r d and Sidhu (1980), Wecker and Ansley (1983), Ansley and K o h n (1986), K o h n and Ansley (1987), and Ansley, K o h n and Wong (1993). Weinert, B y r d and Sidhu (1980) studied the relationship between L-splines and stochastic processes and gave a state space model to compute L-splines. Wecker and Ansley (1983), found a state-space model for polynomial splines. Ansley and K o h n (1986) and K o h n and Ansley (1987), and Ansley, K o h n and Wong (1993), among others, based on Weinert, B y r d and Sidhu's work of 1980, explore further properties and extensions of the relationship between L-splines and stochastic processes. As Box (1979) points out, it should be recognised that the choice of a methodology for data analysis is not an irrevocable decision and the investigator need not attempt to allow for all contingencies a priori. A good data analysis methodology should allow for what Box calls a criticism stage that allows understanding of an initial analysis. T h i s should lead the researcher to question the methodology adopted or to look for improvement of the initial analysis.
For all the reasons above, it is proposed i n this thesis to adopt the paradigm represented i n Figure 1.1 for regression analysis.
1.2
Contributions
In this thesis, the problem of smoothing using nonparametric tools and stochastic models is addressed. T h e nonparametric tools considered in this thesis are polynomial smoothing
6
splines and L-splines and the stochastic models are state space models. T h e relationship of both approaches, nonparametric and stochastic models, is studied. A comparison of the performance of these methods is done by analysing real data. In order to carry out the comparison of these two approaches, continuous state space models are studied in detail.
First, it is studied how to derive a discrete state space
model from a continuous state model.
T h e most difficult part in this derivation is to
compute the state error covariance matrix.
W h e n the stochastic process defined by the
state equation of the continuous model is stationary then a result exists that makes the computation of the matrix easy.
T h e problem when the process is nonstationary
so far not been addressed in the literature.
has
T h i s problem has importance since most of
the stochastic processes that are associated with smoothing splines (L- and polynomial) are nonstationary.
A result, applicable to the nonstationary case, is shown that makes
the computation of the state error covariance easier than the direct calculation, in many practical cases. Also, it is shown how to initialise the diffuse K a l m a n filter to carry out estimation and prediction in the context of a discrete model derived from the continuous one.
In general it is shown how to translate all the existing estimation and prediction
tools for the discrete case to the continuous case.
Smoothing by minimising a penalised least squares criterion is associated with smoothing by using a certain continuous state space model. This state space model, in many cases, is not always properly specified in that the initial conditions are always diffuse. B y means of the analysis of an example it is shown that proper initial conditions should not be overlooked since they affect in great extent the estimation and prediction results.
It
is also shown, in this same example and a second one, the advantages of a model based approach to smoothing. It is shown how different models can be compared and how the
7
paradigm in Figure 1.1 can be realised using state space models. In order to compare polynomial spline smoothing with a stochastic model based approach to smoothing, the penalised least squares criterion giving rise to polynomial splines is studied in detail. In particular it is shown that the smoothing polynomial spline defining criterion is equivalent to another penalised least squares criterion. T h i s result generalises a known result for cubic splines. This new penalised least squares criterion allows to make the properties of smoothing polynomial splines transparent. In particular it is shown that a stochastic model associated with smoothing polynomial splines is an A R I M A plus noise model where the only parameter to be estimated from the data is the noise variance.
1.3
Outline
T h e rest of this thesis is organized as follows. Chapter 2 is a review of state space models, where the main ideas on state space modeling, definitions, estimation tools and results are presented.
T h i s chapter should
bring ideas of how these stochastic local models can be conceived and how to carry out estimation and prediction in the state space model context.
Also, this chapter is
fundamental for the development of the next Chapter. In Chapter 3, estimation within the continuous state space model is addressed. New results on the implementation of estimation and prediction tools (diffuse K a l m a n filter and diffuse smoothing) are presented. In particular, initialisation issues of the diffuse K a l m a n filter are studied. Also, L-splines, a nonparametric smoothing tool, are briefly reviewed. T h e focus is to view L-splines in the context of continuous state space models.
Examples
are given to illustrate how nonparametric regression methods can be easily implemented
8
when using a model based approach.
T h e advantages of the model based approach to
nonparametric smoothing are illustrated with examples. It is shown how the exploration of a model, rather than the construction and minimisation of an objective function, can result in a fruitful endeavour. In Chapter 4 , a review of polynomial smoothing splines is given. Mainly, a basis for the space of smoothing polynomial splines (of any pre-fixed order) is studied so as to extract and make more transparent certain properties of polynomial smoothing splines. It is shown that polynomial smoothing splines can be obtained by minimising a certain penalised least squares criterion, generalising a well known result so far proved for cubic splines only. In Chapter 5 polynomial splines are further explored and a connection between polynomial splines, solutions to penalised least squares problems and A R I M A processes is shown. In particular it is shown that polynomial smoothing of order 2m — 1 is equivalent to A R I M A ( 0 , m , m — 1) plus noise modeling where all the M A parameters are fixed. This result may bring some more light in the understanding of the polynomial spline smoothing tool.
9
Chapter 2 T h e State space m o d e l
A general class of models of much current interest is that of state space models, also referred to as S S M from now on. State space models were originally developed by control engineers, particularly for applications concerning navigation systems such as controlling the position of a space rocket.
Currently, and especially due to the contributions of
Anderson and Moore (1979), Harvey (e.g. 1989), West and Harrison (1989), de Jong (e.g. 1987,1988,1991), among others, S S M have been found useful i n time series problems. It is the purpose of this thesis to show that S S M can be useful in general regression problems as well. Suppose that pairs of data values (yi,X{),
i — 1 , . . . ,n, are observed. T h e smoothing
or regression problem consists of describing the relationship between the components of the pairs. B y this notation, the variable x - plays the role of an t
is the y's are ordered according to the x's and x\ < • • • ,x . n
indexing
variable.
That
If x is time, then the data
are denoted a time series and the interest lies i n describing the evolution of y i n time. If y is energy consumption and x is temperature then the interest lies i n the study of the changes of energy consumption as the temperature changes.
10
If y is the height of
a child and x its age in months then the interest is to study how the child grows i n height as it grows older. T h e indexing variable plays the role of imparting structure to the data; it defines contiguity of the y's. T h i n k , for example, of the y's being energy consumption of a household. Ordering the y's i n the (time) sequence they were observed usually gives a completely different structure to the data than if the j/'s had been ordered by a temperature indexing variable. W h e n the researcher measures any sort of (unobservable) signal, it will typically be disturbed by noise so that the actual observation is given by
observation=signal-(-noise.
In state space models the signal is taken to be a linear combination of a set of variables, which constitute what is called the state vector. T h e state vector at X{ denotes the so called "state of nature" at x ± . Note that the state vector is unobservable and that it has to be predicted from the data. Often the state has a physical meaning.
For example, the progress of a spacecraft
can be monitored using a S S M with a state whose components describe the velocity, acceleration, coordinates,
and rate of fuel consumption of the spacecraft.
The N A S A
Space Program utilises S S M to control its spacecrafts. In many economic contexts the state may represent constructs such as trends or seasonalities. For example, Crafts et.al. (1989) employ the S S M to predict the historical unobserved trend and cycle components of the British industrial production index. In other situations, where physical information is not available, the state can represent an unobserved and/or unknown process driving the observed system.
For example Watson and Engle (1983) use the S S M to estimate
unobserved wage rates.
Other applications of the S S M include a study i n inventory 11
control (Downing et.al., 1980)
and of statistical quality control (Phadke, 1981).
In the
area of policy, Harvey and D u r b i n (1986) conducted an interesting study of the impact of seat belt legislation on road casualties. In parametric regression the state vector stays constant as the indexing variable, x, varies and the constants that enter in the linear combination of state variables change. In both, parametric and nonparametric regression, the signal is assumed to be deterministic. In parametric regression the functional form of the signal is assumed to be completely known up to some parameters. In nonparametric regression the signal is not assumed to have a specific form but to have some general qualities, such as a certain degree of differentiability. It is usual to read, especially in the nonparametric literature, that parametric vs. nonparametric regression constitute an ambivalence that covers in great extension a universe of alternatives for the solution of regression problems. A s will be seen throughout this chapter and the following ones, there exist other options to solve regression problems. A great deal of effort has been put into studying signal plus noise models where the signal is stochastic, that is the signal is assumed to be a stochastic process rather than a deterministic function. Hence the way that the signal at Xi \ is related to the signal at Xi is +
established in a probabilistic rather than in a functional manner. T h e study of the signal as a stochastic process has yielded interesting results that allow insightful solutions of the problem of estimating the unknown signal. Perhaps because these studies were motivated by econometric applications, they were relegated to the study of equally spaced time series.
Typically, and this is a strength of state space models, the state vector changes as the
12
indexing variable, x, changes (whereas the constants that enter in the linear combination of the components of the state vector to give the signal are usually fixed.) T h i s is because the state is modeled as a stochastic process on x that suits, or is thought to suit the data well. Hence, S S M are stochastic local models for the data where local refers to the adjacency of the data dictated by the ordering variable, x.
A s a consequence, S S M are
robust in the sense that isolated departures from the bulk of data will have a limited influence in the global results of estimation. In this way, smooth estimates of the signal that produce a good local representation of the data can be obtained. A s an example, suppose that the ordering variable, x, is equally spaced, with a;,-_i—x; = 1, and that it is believed that the relationship between
y
and
x
is linear. T h e n ,
i — 1 , . . . , n , for some a, b G R. This is equivalent to saying that condition yo = a.
yi
=
a + b x i ,
= yi + b, with starting
If one wants a more flexible model than the initial linear model for
the relationship between y and x, then, starting from the linear model one could add a disturbance, e, and postulate that yi = a + bxi + e,-.
O r , one could add a disturbance,
77, to the alternative formulation and say that ?/;+,• = y,- + b + rji. These two new "more flexible" formulations are very different. T h e first one, yi — a + bxi + e'> is the usual t
linear regression paradigm, whose drawbacks, for example lack of flexibility or lack of robustness, are well known. T h e second one, j/,- - = yi + 6 + rji, is a state space description +t
that produces a
local
linear
description
of the data, it represents local characteristics of
the data. Therefore, an initial statement of a model and the placement of disturbances is crucial in making this initial model more flexible. T h e construction of S S M is concerned in identifying possible components (the state) of an initial global formulation of the signal and the placement of random disturbances in the components in order to obtain a local description.
13
Following, a review of discrete state space models is presented. Discrete state space models have been thoroughly studied especially to use these models in econometric applications, e.g. time series. As will be seen, these models can be used in most regression situations as well. T h e formulation and treatment of discrete S S M involves keeping track of the particular spacing of the indexing or ordering variable, x. T h e next chapter introduces continuous state space models which are formulated independently of any particular spacing of a;'s. Once the data are collected, a discrete model is derived from the continuous one in order to suit the data. Most of the results derived for the discrete case will apply fairly directly. In light of the above paragraph, the review of results on discrete S S M that follow is essential to the development of Chapter 3. is given.
In section 2.1 a formal definition of S S M
Section 2.2 shows some examples and explores in them the usefulness of the
formulation of S S M in certain situations.
T h e rest of the chapter deals with technical
details about the implementation of the prediction and estimation tools used with the S S M . Section 2.3 deals with the criteria that are used for estimation in the context of S S M and the K a l m a n and smoothing filters which are the tools to be used to implement the estimation. Sections 2.4 and 2.5 deal with diffuse S S M , and S S M which have applied since time immemorial. These models are useful generalisations of the original S S M of Section 2.1. Section 2.6 shows how generalised cross validation ideas can be implemented in the context of S S M if a distribution free method for estimation is preferred.
2.1
D e f i n i t i o n of the state space m o d e l
State space models offer the possibility of constructing useful local models to solve smoothing problems.
T h e advantage of explicit stochastic local models is that they are more 14
flexible than parametric regression models. In S S M the data play an important role i n determining the local changes. T h i s brings also the advantage of having robust models to local departures from the bulk of the data. Also, important observed features of the data, such as trends or cycles, may be incorporated in the model. In this section the following notation is used:
• If a is a random vector then Var(a)
denotes its variance-covariance matrix.
• 7 ~ (c, a C) indicates that 7 is a random vector with £ ' ( 7 ) = c and Var(-y) = 0 then we allow the slope of the trend to vary in each of the intervals determined by the x's. So, one can appreciate how, starting from an idealised and global deterministic signal and by adding stochastic disturbances appropriately one can construct a flexible stochastic model which will focus on local characteristics of the data.
A d d i n g disturbances to
components is more effective than just adding a single disturbance to the deterministic signal. Note how a completely deterministic treatment of the signal, as in the previous section, in the case of parametric regression, does not suggest other possible ways of making the model more flexible except for turning a parametric signal into a more unspecified one. For the seasonal component, one may assume that it is approximately a trigonometric function of x. asin(cux) + period
i
for some
bcos(u>x)
that is
p,
a c o s ( u x )
j ( x
If it were perfectly trigonometric then it would be of the form j(x)
to
=
1
)
b,
and
u>,
where w is a frequency corresponding to
This ensures 7(3;)
2ir/p.
=
7 ( 2 + p).
T h e n , defining
=
7(2,-
+
hi)
—
[a s'm(ujXi)
=
y*(xi
=
asin(uj(xi +
+ 6cos(a;a;j)]
hi))
+
b c o s ( u ( x i
cos(uhi)
+
+
[a cos(ujXi)
hi))
—
6sin(a;a;j-)]
and 7*(x i) i +
-f*(x)
one has that
— bsm(u>x),
+
a,
+ h i )
=
a c o s ( u ( x i
+ h i ) )
21
=
—
bsm(uj(xi
+ h i ) )
sm(u>hi)
=
=
[acos(ujXi)
Letting 7,- = 7 ( 2 : ; ) and 7* =
— 6sin(u;a;,-)] cos(u>hi)
j * ( x i )
( 7*'+i
\ _
bcos(uxi)]
sm(uhi).
one has that
—
\ 7*+i /
— [asin(a;Xj) +
c o s ( u h i ) - s m ( u h i )
s m ( u h i ) c o s ( u h i )
()
(2-6)
Since in real life one is unlikely to observe an exactly trigonometric seasonal signal, one may
add disturbances in (2.6) so that I 7*'+i j _ \ 7*+i /
COS(UJhi) _
s m ( u h i )
s i n ( u h i )
cos(wftj)
where K ; and K* are (0, A;,) and (0, k*) variables, re depend on h{. In this way, once one has a dynamic linear representation for an idealised signal, one may
construct a more flexible model by adding disturbances appropriately.
Following are some of the most common structural models used in practice.
P u r e local linear trend m o d e l In this case we have a model where only a trend is observed. T h i s trend is assumed to be locally linear with added disturbances. The model expressed in state space form would be
(2.8)
where the e;'s are independent of the 77^'s and £-'s e^'s are i . i . d . (0,cr ), and 7/- and 2
(0,(J .) 2
a n d (0, 0
with fixed slope
> 0
= 0
smooth trend
= 0
> 0
deterministic linear
= 0
= 0
stochastic (or local) level
Table 2.1: Different trend models.
Autoregressive component plus noise model In (2.5)
let //,• = 0 for all i and, moreover, assume that
= pf3{-\ +
0 < p < 1. T h e
signal then follows an autoregressive process of order 1. If p = 1, then the signal follows a random walk process. T h e corresponding state space formulation of the model is Vi
a
i + 1
—
OL{ -\- t{
=
poti + Ci
where e,-'s and the £,-'s are independent, e;'s are i . i . d (0, a ), 2
a, 2
and the £,'s are (0, a .), 2
where
depends on h{.
T h i s model is frequently used in absence of physical information about the signal.
Trigonometric cycle or seasonality model Since seasonality is a periodic phenomenon it is natural to model it using trigonometric functions.
23
T h e corresponding state space formulation is yi
=
(l_0)a - + e,t
c o s ( u h i )
s m ( u h i )
s'm(uhi)
c o s ( u h i )
a,
+
;:
,
™
where K i and K* are correlated white noise disturbances, usually assumed to have a common variance a .., with a ., depending on hi, and p is a "damping factor" which lies in the 2
2
range 0 < p < 1. There are further variations of these models and only some of them were illustrated.
2.2.3
A R I M A models as state space models
A complete reference for A R I M A models can be found in Box, Jenkins and Reinsel (1994). A R I M A models are based on the idea that a time series, yi, i = 1 , 2 , 3 , . . . , i n which successive values are highly dependent, can frequently be regarded as generated from a series of independent (0,a ) 2
"shocks" or disturbances
is called a discrete white noise process.
T h e process u -, t
u;-2, • • •
It is assumed that some sort of "linear filter"
transforms the white noise process into the stochastic process of interest yielding yi = p, +
Ui +
+ f/>2^t-2
ipiUi-i
+ • •
where p represents the level of the process. T h e attention
is focused on modeling the signal as a stationary process, an assumption which is crucial for estimation and modeling purposes. T h e above idea is conceptually based on a result by Wold (1938), who established that any zero-mean purely non-deterministic stationary process ?/,• possesses a linear representation yi = Ui + YsJLi V'j^i-j) with the u,'s being (0, a ) and uncorrelated, and Y^JLi ip] < °°2
In this sense, the models that are studied are moving average ( M A ) models, where the process j/ - is assumed to be a finite linear combination of i . i . d . t
24
disturbances, and
the autoregressive models ( A R ) , where yi is assumed to be a finite linear combination of past values of yi plus a disturbance. Back substitution in the past values of j/j allows expressing an A R process as an infinite linear combination of iid disturbances. T o achieve greater flexibility in fitting an actual time series, yi can also be assumed to be the sum of an autoregressive process and a moving average process, or in the Box and Jenkins terminology, an A R M A process.
Box and Jenkins recognise the importance of having at hand some nonstationary models, in addition to stationary A R M A models, since some real life processes in industry, business, etc., are better represented as such. In particular they consider a special type of nonstationarity. T h e y assume that if the process is nonstationary then, after differencing it a few times it will have a stationary A R M A representation. These type of nonstationary models are called integrated moving average models or A R I M A , for short. In using A R I M A models, it is assumed that the values of the indexing variable, x, are equally spaced.
Also, let us define the backward shift operator, B, by B yi 3
=
yi-j,
T h i s operator is extensively used in the A R I M A literature. A formal definition of the mentioned processes follows.
D e f i n i t i o n 2.2 The process yi is said to follow an ARMA(p,q) yi = 8 + ai?/i_i + a yi-2 2
H
where e,- ~ (0, cr I) and yo, y~\,..., y~ 2
Also, a
v
v
h a y^ p
p
+ e- + &ie,-_i H t
model if
h &,e -_„ t
are unobserved and may be either fixed or
random.
^ 0, and b ^ 0. q
The process yi is said to follow an ARIMA(p,d,q) ARMA(p,q)
model.
25
model if (1 — B) yi d
follows
an
A s a consequence of the definition, if yi follows an A R I M A ( p , d, q) model it also follows an A R M A ( m a x ( p , d), q) (not necessarily stationary) model . It turns out that A R M A models can be cast into S S M form, as the following result shows.
T h e o r e m 2.1 Put c =
(Gardner
et.al., 1980.)
max(p, q + 1) and define T =
Suppose that yi follows { ( a i ; . . . ; a ), (7; 0)}, c
H
an ARMA(p,q) =
model.
(1; b\;...; 6 _i) c
and
W = (1; 0 ; . . . ; 0) where the aj or bj for respectively j > p and j > q are defined as zero. Then,
yi — (l,0,...,0)a,a,-+i
where u.
=
Wp3 + Tai +
Hui
(0,a ).
E x a m p l e 2.1
2
(ARIMA(0,2,1)
models as
SSM.)
Let yi be generated by an A R I M A ( 0 , 2 , 1 ) process.
Then
yi - 2j/,-_i + yi-2 = Ui +
where u 's are independent and it; 8
(0,
( y i , x - i ) , . . . ,
( y „ , x ) it is desired to obtain a prediction of n
y +k n
0.
• Interpolation: given (j/ x ) , . . . , (?/;_!, l5
x
x ^ ) ,
( y
i
+
1
, x
i
+
i ) , . . . , (y , x n
n
)
one wants to
predict y- at X{. t
• Smoothing: if the observations are generated by a S S M , a prediction of the unobserved state a{ at Xi is desired. Another smoothing problem is the signal extraction 28
problem where a prediction of the signal Xid + zT,o:,- at X{ is desired.
Also, an
estimate of the measurement and/or state error components are sometimes desired.
Frequently in the S S M there are unknown parameters, for example some disturbance variances and covariances. Prediction and smoothing can only be carried out once these parameters have been estimated.
Some ways to proceed with the estimation of these
parameters is via m a x i m u m likelihood or via cross-validation. It will be shown that the same tool that will be used for prediction can be used to obtain an expression for the likelihood of the j/'s and/or an expression for the cross-validation criterion. M a x i m u m likelihood is to be used when the disturbances, u,'s, are assumed to be normally distributed. If the normality of the disturbances is doubted then a distribution-free method like cross-validation can be used. T h e prediction criterion to be used is the M e a n square error (Mse) and the estimators that will be considered are a constant plus a linear combination of the observations. Henceforth, predictors will denote "best linear predictors" where "best" is relative to minimization of the Mse. Best linear prediction and m i n i m u m mean squared error are the key words in the implementation of the prediction procedures in the context of the S S M . T h e tools to be used are the K a l m a n and smoothing filters. These algorithms permit the incorporation of information sequentially which makes prediction and estimation methods easy to i m plement.
In prediction procedures where all the data are incorporated at once, as in
classical linear regression, quite cumbersome operations (for example the inversion of a large matrix) may have to be performed.
29
In this section definitions and technical aspects of inference in the context of the S S M , as per Definition 2.1, will be given. T h e proofs of the results will be omitted and can be found in textbooks like Anderson and Moore (1979). First, in Subsection 2.3.1 the concept and some properties of best linear prediction are formally stated.
Subsections 2.3.2
and 2.3.3
contain some important properties of best
linear predictors in terms of combining information. This is a preparation and the basic material for the K a l m a n filter algorithm, which is explained in Subsection 2.3.4.
The
K a l m a n filter provides predictions of the state at x;, given all the information collected at x i , . . . , x ; _ i . T h e smoothing filter gives predictions of the state given all the available information (fixed interval smoothing), a prediction of the state at a particular point x t
as the sample size increases (fixed point smoothing), and/or a prediction of the signal at a point Xi where no observation of y was made (interpolation). These smoothing situations are reviewed in Subsection 2.3.5.
2.3.1
M i n i m u m mean square error prediction
T h i s subsection contains the definitions and some of the properties of best linear predictors. T h i s is the type of estimators that will be considered for estimation within the SSM. Suppose that X Var(X)
= aZ, 2
x
and Y
Var(Y)
are random vectors such that E(X)
= c r % , Cov(X,Y)
=
—
\i , E(Y) x
2
— c — b'Y)
2
is minimum. 30
y
xy
The (best linear) predictor of X given Y is P(X\Y)
and b are such that E(X
\i ,
g | A |
(2.20)
2
Diffuse prediction
2.4.2
T h e D K F can be used to compute diffuse predictors of a:,- and y,- i n the context of the S S M . These are predictors constructed under the assumption that y S^Si,
are in the row space of Si, then as C —• 0 0 ,
41
Msefy)
{
C- )- , 1
1
and i — 1 , . . . , n + 1. If Si
—• a^S^ . 1
If the rows of A^
on —> A;( —SfS{', 1) and Mse(6ci) —•
a (Pi
+ A^Si A' ).
2
V i
If the rows of
iy
{X {-B, t
2.4.3
b) + ZiAi)
are in the row space of Si, then as C — > oo,
(-S-Si-1),
and Mse( )
-+ a\Di + E ^ E ' ^ ) .
yi
Diffuse smoothing
Smoothing refers to predicting the state vector a ; , i — 1 , . . . , n , using the entire observation vector y, where y