1. Time Series data. 1.1. Purpose. 1. Time series are analysed to understand the
past and to predict the future. 2. Kyoto Protocol; Singapore Airlines;. 3.
Introductory Time Series with R Paul S.P. Cowpertwait and Andrew V. Metcalfe 2009
1. 1.1.
Time Series data Purpose
1. Time series are analysed to understand the past and to predict the future. 2. Kyoto Protocol; Singapore Airlines; 3. Time series methods are used in everyday operational decisions. E.g., purchasing oil or gas, or the goods trading in future markets. Time series models often form the basis of computer simulations. 1.2.
Time series
1. For examples, interest rates, exchange rates, stock prices, stock index, consumption, GDP, economic growth rate etc. 2. When a variable is measured sequentially in time over or at a fixed interval, known as the sampling interval, the resulting data form a time series.
Symbolically, Xt(ω), ω ∈ Ω Xt, t = 1, . . . , T, are random variables. 3. time series v.s. (discrete-time) stochastic processes 4. A time series yt can be decomposed as three parts: yt = trend + cycle + randomness The main features of many time series are trends and seasonal variations, so that we can model yt as a function yt = f (t). In most of time series, the randomness in yt are often correlated, i.e., serially dependent. 5. Aggregation often cause serially dependent in aggregated data, such as Macroeconomic variables.
The impact in financial markets are often clustered appearance, so that the financial time series data are often autocorrelated and time varying. 1.3.
R language
1. Write a function of factorial function (n!). P.3. t(): matrix transpose (a r × c matrix can be defined as matrix(nrow=r, ncol=c)). solve(A): the inverse of square matrix A. 2. The multiple regression models. βˆ = (X 0X)−1X 0Y 1.4.
Plots, trends, and seasonal variation
1. Type the following commands in R, and check your results against the output shown here. To save on typing, the data are assigned to a variable called AP. > data(AirPassengers)
> AP < − AirPassengers > AP 2. The key thing to bear in mind is that generic functions in R, such as plot or summary, will attempt to give the most appropriate output to any given input object; try typing summary(AP) now to see what happens. > summary(AP) > plot(AP, xlab="...", ylab="...") Try also, > class(AP), start(AP), end(AP), frequency(AP) 3. In general, a systematic change in a time series that does not appear to be periodic is known as a trend. A repeating pattern within each year is known as seasonal variation. Random, or stochastic, trends are common in economic and financial time series. A regression model would not be appropriate for a stochastic trend.
Why?
4. To get a clearer view of the trend, the seasonal effect can be removed by aggregating the data to the annual level, which can be achieved in R using the aggregate function. > layout(1:2) > plot(aggregate(AP)); note: nfrequency=1,4,12; FUN = ‘‘mean’’ or ‘‘sum’’ > plot(aggregate(AP), nfrequency=4) > plot(aggregate(AP), nfrequency=12): i.e., plot(ap) To see the seasonal variations, we can use > boxplot(AP ˜ cycle(AP)): monthly variation > boxplot(aggregate(AP, nfrequency=4) ˜ cycle(aggregate(AP nfrequency=4))); i.e., quarterly variation
1.5.
Unemployment: Maine
1. > > > >
www < − "http://www.massey.ac.nz/ pscowper/ts/Maine.dat" Maine.month < − read.table(www, header = TRUE) attach(Maine.month); make the first row (unemploy) avail class(Maine.month); i.e., it is a data.frame.
2. The ts function is used to convert the data to a time series object.
3. > Maine.month.ts < − ts(unemploy, start = c(1996, 1), freq = 12) Note that the sample size is 128. If you set freq = 4, you will have a quarterly data. 4. If we wish to analyse trends in the unemployment rate, annual data will suffice. The average (mean) over the twelve months of each year is another example of aggregated data, but this time we divide by 12 to give a mean annual rate. > Maine.annual.ts < − aggregate(Maine.month.ts)/12
Note: aggregate function will not consider the data that is incomplete in the year. 5. To get plots, we can type > layout(1:2) > plot(Maine.month.ts, ylab = "unemployed") > plot(Maine.annual.ts, ylab = "unemployed") 6. We can calculate the precise percentages in R, using window. This function will extract that part of the time series between specified start and end points and will sample with an interval equal to frequency if its argument is set to TRUE. So, the first line below gives a time series of February figures. > Maine.Feb < − window(Maine.month.ts, start = c(1996,2), freq = TRUE) > Maine.Aug < − window(Maine.month.ts, start = c(1996,8), freq = TRUE) > Feb.ratio < − mean(Maine.Feb) / mean(Maine.month.ts)
> Aug.ratio < − mean(Maine.Aug) / mean(Maine.month.ts) To see the ratios, you can use > Feb.ratio > Aug.ratio 7. On average, unemployment is 22% higher in February and 18% lower in August. An explanation is that Maine attracts tourists during the summer, and this creates more jobs. Also, the period before Christmas and over the New Year’s holiday tends to have higher employment rates than the first few months of the new year. 1.6.
Multiple time series: Electricity, beer and chocolate data
1. To find the date from Jan. 1958 to the nearest current date of the three time series from www.abs.gov.au. 2. www < − "http://www.massey.ac.nz/˜ pscowper/ts/cbe.dat" CBE < − read.table(www, header = T)
3. You can use attach(CBE) to make the variables’ names available. 4. Check the below. > plot(as.vector(beer)) > plot(as.ts(choc)) > plot(beer, choc) > abline(lm(beer ˜ choc)) 5. If you omit end, R uses the full length of the vector, and if you omit the month in start, R assumes 1. You can use plot with cbind to plot several series on one figure (Fig. 1.5). > Elec.ts < − ts(CBE[, 3], start = 1958, freq = 12) > Beer.ts < − ts(CBE[, 2], start = 1958, freq = 12) > Choc.ts < − ts(CBE[, 1], start = 1958, freq = 12) > plot(cbind(Elec.ts, Beer.ts, Choc.ts)) 6. The two time series are highly correlated, as can be seen in the plots, with a correlation coefficient of 0.88.
> > > > 1.7.
ap.elec < − ts.intersect(AP, Elec.ts) cor(ap.elec[,1], ap.elec[,2]) plot(as.vector(ap.elec[,1]), as.vector(ap.elec[,2])) abline(reg=lm(ap.elec[,2] ∼ ap.elec[,1]))
Quarterly exchange rate: GBP to NZ dollar
With financial data, exchange rates for example, such marked patterns are less likely to be seen, and different methods of analysis are usually required. A financial series may sometimes show a dramatic change that has a clear cause, such as a war or natural disaster. Day-to-day changes are more difficult to explain because the underlying causes are complex and impossible to isolate, and it will often be unrealistic to assume any deterministic component in the time series model. 1. > www < − ”http://www.massey.ac.nz/˜pscowper/ts/poundsnz.dat” > Z < − read.table(www, header = T)
> Z[1:4, ] > Z.ts < − ts(Z, st = 1991, fr = 4) > plot(z.ts) 2. The window function can be used to extract the subseries: > Z.92.96 < − window(Z.ts, start = c(1992, 1), end = c(1996, 1)) > Z.96.98 < − window(Z.ts, start = c(1996, 1), end = c(1998, 1)) > layout (1:2) > plot(Z.92.96, ylab = ”Exchange rate in NZ/pound”, xlab = ”Time (years)” ) > plot(Z.96.98, ylab = ”Exchange rate in NZ/pound”, xlab = ”Time (years)” ) 1.8.
Global temperature series
There are some functions of R used in this section. They are 1. scan Read data into a vector or list from the console or file.
2. time time creates the vector of times at which a time series was sampled. cycle cycle gives the positions in the cycle of each observation. frequency frequency returns the number of samples per unit time deltat deltat returns the time interval between observations 3. abline(reg=lm(y ∼ a + b t))
2.
2.1.
Decomposition of series Notation
So far, our analysis has been restricted to plotting the data and looking for features such as trend and seasonal variation. This is an important first step, but to progress we need to fit time series models, for which we require some notation. 1. We represent a time series of length n by (xt : t = 1, . . . , n) = (x1, x2, . . . , xn). It consists of n values sampled at discrete times 1, 2, . . . , n. The notation will be abbreviated to (xt) when the length n of the series does not need to be specified. The time series model is a sequence of random variables, and the observed time series is considered a realisation from the model. We use the same notation for both and rely on the context to make the distinction. Pn 2. sample mean: x¯ = t=1 xt/n. 3. The ‘hat’ notation will be used to represent a prediction or forecast. For example,
with the series (xt : t = 1, . . . , n), xt+k|t is a forecast made at time t for a future value at time t + k. A forecast is a predicted future value, and the number of time steps into the future is the lead time (k). 2.2.
Models
1. A simple additive decomposition model is given by xt = mt + st + zt where, at time t, xt is the observed series, mt is the trend, st is the seasonal effect, and zt is an error term that is, in general, a sequence of correlated random variables with mean zero. 2. We briefly outline two main approaches for extracting the trend mt and the seasonal effect st in (1.2) and give the main R functions for doing this. 3. If the seasonal effect tends to increase as the trend increases, a multiplicative model may be more appropriate:
xt = mt × st + zt 4. If the random variation is modelled by a multiplicative factor and the variable is positive, an additive decomposition model for log(xt) can be used: log(xt) = mt + st + zt 5. Some care is required when the exponential function is applied to the predicted mean of log(xt) to obtain a prediction for the mean value xt, as the effect is usually to bias the predictions. If the random series zt are normally distributed with mean 0 and variance σ 2, then the predicted mean value at time t based on (1.4) is given by 2 m +s 1/2σ t t xˆt = e e
2.3.
Estimating trends and seasonal effects
1. There are various ways to estimate the trend mt at time t, but a relatively simple procedure, which is available in R and does not assume any specific form is to calculate a moving average centered on xt. 2. This can be achieved by averaging the average of January up to December and the average of February (t = 2) up to January (t = 13). This average of two moving averages corresponds to t = 7, and the process is called centering. Thus the trend at time t can be estimated by the centered moving average 1x 1x + x + . . . + x + x + . . . + x + t t−6 t−5 t+1 t+5 2 t+6 ˆt = 2 m , 12 where t = 7, . . . , n − 6. 3. The procedure generalizes for any seasonal frequency (e.g., quarterly series), provided the condition that the coefficients sum to unity is still met. 4. An estimate of the monthly additive effect (st) at time t can be obtained by sub-
ˆ t: tracting m ˆt sˆt = xt − m 5. If the monthly effect is multiplicative, the estimate is given by division; i.e., sˆt = xt/mˆ t. 2.4.
Smoothing
1. The centered moving average is an example of a smoothing procedure that is applied retrospectively to a time series with the objective of identifying an underlying signal or trend. 2. Smoothing procedures can, and usually do, use points before and after the time at which the smoothed estimate is to be calculated.
2.5.
Decomposition in R
1. In R, the function decompose estimates trends and seasonal effects using a moving average method. 2. > > > > >
plot(decompose(Elec.ts)) plot(decompose(Elec.ts, type = "mult")) Elec.decom < − decompose(Elec.ts) Trend < − Elec.decom$trend Seasonal < − Elec.decom$seasonal
Finally, it is very important for you that Exercise 1 is assigned as your homework # 1 and that you have to finish it and send it back on next week.