Efficient likelihood evaluation for VARMA processes with missing values

3 downloads 0 Views 337KB Size Report
Sebastian E. Ferrando, Department of Mathematics, Ryerson University, 350 Victoria ... 3.1 Likelihood evaluation via the Sherman-Morrison-Woodbury formula.
Efficient likelihood evaluation for VARMA processes with missing values Kristján Jónasson Sebastian E. Ferrando

Report VHÍ-01-2006 Reykjavík, September 2006

Report VHÍ-01-2006, Reykjavík September 2006 Kristján Jónasson. Efficient likelihood evaluation for VARMA processes with missing values, Engineering Research Institute, University of Iceland, Technical report VHI-01-2006, September 2006 Kristján Jónasson, Department of Computer Science, Hjardarhagi 4, IS-107 Reykjavik, Iceland. Email: [email protected]. Sebastian E. Ferrando, Department of Mathematics, Ryerson University, 350 Victoria Street Toronto, Ontario M5B 2K3. Email: [email protected]. The authors are responsible for the opinions expressed in this report. These opinions do not necessarily represent the position of the Engineering Research Institute or the University of Iceland.  Engineering Research Institute, University of Iceland, and the authors.

Engineering Research Institute, University of Iceland, Hjarðarhagi 4, IS-107 Reykjavík, Iceland

CONTENTS 1. INTRODUCTION

3

2. NOTATION AND THE CHOLESKY DECOMPOSITION METHOD 2.1 Model notation 2.2 Likelihood evaluation for complete data 2.3 Operation count for complete data

4 4 5 6

3. MISSING VALUE CASE 3.1 Likelihood evaluation via the Sherman-Morrison-Woodbury formula 3.2 Estimating missing values and shocks 3.3 Simplification for pure autoregressive models 3.4 Operation count for missing value likelihood

7 7 9 9 10

4. DERIVATIVE OF THE LIKELIHOOD FUNCTION 4.1 Derivatives of the r × r covariance matrices 4.2 Remaining steps in likelihood gradient calculation 4.3 Operation count for gradient calculation and possible savings

10 11 12 12

5. NUMERICAL EXPERIMENTS 5.1 Timing of function evaluations 5.2 Timing of gradient evaluations

13 13 14

APPENDICES A. Differentiation with respect to matrices B. Solution of the vector Yule-Walker equations C. Time series simulation D. Determinant of a low rank update

15 15 17 18 18

ACKNOWLEDGEMENT

19

REFERENCES

19

ÁGRIP Í skýrslunni er sett fram lýsing á aðferð til þess að reikna fallsgildi og afleiðu sennileikafalls fyrir vigurtímaröð af VARMA gerð (vector autoregressive moving average) þegar mæligögn vantar. Aðferðin byggist á að samtvinna svokallaða Cholesky-þáttunar-aðferð fyrir VARMA sennileikafall þegar gögn eru heil og Sherman-MorrisonWoodbury formúluna. Lýst er hvernig ná má fram sparnaði þegar engir MA liðir eru í röðinni og ennfremur er útskýrt hvernig meta má gildi sem vantar og suðhluta raðarinnar. Skýrslunni lýkur með lýsingu á tölulegum tilraunum sem gerðar hafa verið með útfærslu á aðferðunum í Matlab-forritum. Forritin ásamt lýsingu á notkun þeirra eru í sérstakri skýrslu sem gefin er út samhliða þessari. Í viðaukum er síðan sagt frá diffrun með tilliti til fylkja, lausn vigur-Yule-Walker jafna, hermun VARMA líkana, og að lokum er sönnuð setning um fylkjaákveður.

ABSTRACT A detailed description of an algorithm for the evaluation and differentiation of the likelihood function for VARMA processes in the general case of missing values is presented. The method is based on combining the Cholesky decomposition method for complete data VARMA likelihood evaluation and the Sherman-MorrisonWoodbury formula. Potential saving for pure VAR processes is discussed and formulae for the estimation of missing values and shocks are provided. The report concludes with description of numerical results obtained with a Matlab implementation of the algorithm, which is in a companion report. Differentiation with respect to matrices, solution of vector-Yule-Walker equations, VARMA model simulation and the determinant of a low rank update are discussed in appendices.

1. INTRODUCTION A key aspect for the numerical treatment of autoregressive moving average (ARMA) processes is the efficient evaluation of the likelihood function for the parameters. It is necessary to treat the case of vector-valued processes (VARMA) as they are the ones prevailing in practice. There has been a large number of publications dealing with this subject, and there are a number of related approaches, some of which will be mentioned in due time. In order to make the results more relevant to practical applications it is important to study the more general case of missing values. Evaluation of the gradient of the likelihood function is also important for its maximization using traditional numerical optimization methods. This report’s main contribution is to present formulae for the calculation of the likelihood function and its gradient, both for the complete data case, and when there are missing values. We concentrate on the exact likelihood function, not the conditional likelihood (where the initial shocks are assumed to be zero), both because the latter is not easily applicable when values are missing or when the model has moving average terms, and also because the exact likelihood does in many practical give significantly better parameter estimates. Three different approaches for evaluating the exact likelihood function of univariate ARMA processes have been described in the literature: (A) one that we shall refer to as the presample method described by Siddiqui [1958] for pure MA processes, (B) the Cholesky decomposition method, first described by Phadke and Kedem [1978], and (C) a state space Kalman filter method described by Harvey and Phillips [1979]. Several authors have described improvements and generalizations of the originally proposed methods, in particular, all three approaches have been generalized to VARMA models and to ARMA models with missing values. An overview of the developments is given by Penzer and Shea [1997]. Among the papers discussed there are [Ljung and Box 1979] describing a computationally efficient VARMA implementation of the presample method, and [Jones 1980] with a Kalman filter missing value ARMA method. In addition to the references in [Penzer and Shea 1997], Ljung [1989] discusses estimation of missing values for ARMA processes and Mauricio [2002] gives details of a VARMA implementation of the Cholesky decomposition method. Two Fortran programs for VARMA likelihood evaluation in the complete data case have been published: the Kalman filter method is implemented by Shea [1989], and the presample method by Mauricio [1997]. In addition, pure VAR models (with complete data) may be fitted using the Matlab package ARfit, described and published in the pair of papers by Neumaier and Schneider [2001]. In contrast to complete data VARMA and missing value ARMA, the case of VARMA processes with missing values has not been treated carefully in the literature. We only know of [Penzer and Shea 1997], and in that article only a sketch of a technique is presented. Formulae for the efficient evaluation of the likelihood gradient are also lacking in the published literature. In this report we take the Cholesky approach. It is considerably simpler and more direct than the other two approaches, and with complete data it is also in general more efficient [Penzer and Shea 1997; Mauricio 2002]. The original article of Phadke and Kedem [1978] treats VMA models, extension to ARMA models is in [Ansley 1979], Brockwell and Davis [1987, Ch. 11] describe a VARMA implementation (they and some other authors refer to the method as the innovation method) and Penzer and Shea [1997] provide a way of handling missing values in the ARMA case, albeit not the same as our way. Consider equations (2.1) and (2.2) and the associated notation for the definition of a VARMA process. With the simple change of variables, wt = xt – µ for t ≤ p and wt = yt for t > p, wt and w t + k are independent for k > max( p, q ) and thus the covariance matrix Ω of the combined w-vector has a block band structure. Since the likelihood function can be written solely in terms of Ω (c.f. ) it may be evaluated 3

efficiently through Cholesky decomposition of Ω [Golub and Van Loan 1983] and this is from where the method gets its name. Presence of missing values corresponds to crossing out rows and columns of the covariance matrix S of the combined x-vector, giving an expression for the likelihood of the observed data (c.f. (2.4)). One of the novelties of the present work is to show how the Cholesky factors of Ω together with the Sherman-Morrison-Woodbury formula and other tools from numerical linear algebra may be used to evaluate this missing value likelihood efficiently. We have included the mean of the series among the parameters, instead of assuming a zero-mean process as is customary in the literature. This is not important when there are no missing values: one can simply subtract the mean of the series. When there are missing values, this might however cause a bias. Say a weather station was out of function during a cold spell. Then the mean of all observed temperature values would probably overestimate the true mean, but if other nearby stations were measuring during the cold spell then maximizing the likelihood of a VARMA model with the mean as a free parameter would avoid this bias. We refer to the companion report [Jonasson 2006] for a Matlab implementation of the new methods. The programs follow very closely the notation and algorithms of the present report. An implementation of the methods using the C programming language is also under way. The report is organized as follows. Section 2 introduces the basic notation and reviews the Cholesky decomposition method for the complete data case. Section 3, the main section of the report, describes our approach to dealing with the missing value case. Section 4 describes the main ideas and techniques used to compute the derivative of the likelihood function. Section 5 presents some numerical experiments that complement the report. The appendices present technical material. Appendix A describes our approach to differentiation with respect to matrices, Appendix B describes our solution to the YuleWalker equations, Appendix C describes how to generate simulated time series, and, finally, Appendix D provides the proof of a result used in the report.

2. NOTATION AND THE CHOLESKY DECOMPOSITION METHOD 2.1 Model notation

A VARMA model describing a time series of values xt ∈ ℝ for integer t is given by: r

p

x t − µ = ∑ A j ( xt − j − µ ) + y t

(2.1)

j =1

where, q

y t = εt + ∑ B j ε t − j ,

(2.2)

j =1

µ is the expected value of xt, the Aj’s and the Bj’s are r × r matrices, and the εt’s are r-variate N(0, Σ) uncorrelated in time. Let θ denote the ( p + q)r 2 + r (r + 3) 2-dimensional vector of all the parameters (the elements of the Aj’s, the Bj’s, Σ and µ; Σ being symmetric). If there are no missing values, observations xt for t = 1,…, n are given, and x denotes the nr-vector ( x1T ,..., x nT )T of all these values. When there are N M missing values the observations are limited to a subvector xo ∈ ℝ of x, and xm ∈ ℝ is a vector of the missing values, say xm = ( xm1 ,..., xmM ). If the time series is stationary then the complete data loglikelihood function is given by l (θ ) = − 12 ( nr log 2π + log det S + (x − µ )T S −1 (x − µ ) )

4

(2.3)

where S = covθ (x) and µ = Eθ (x) = (µT ,…, µT )T . The log-likelihood function for the observed data is given by lo (θ ) = − 12 ( N log 2π + log det So + (x o − µ o )T So−1 (xo − µ o ) )

(2.4)

where So = covθ (xo) is obtained from S by removing rows m1,…, mM and columns m1,…, mM and µ o = Eθ (x o ) is obtained from µ by removing components m1,…, mM (see for example [Ljung 1989]).

2.2 Likelihood evaluation for complete data We now turn attention to the evaluation of (2.12) and proceed in a similar vein as Mauricio [2002] and Brockwell and Davis [1987] (and as briefly suggested in [Penzer and Shea 1997]). From (2.1),

yt = xt − µ − Σ pj =1 A j ( xt − j − µ) for t > p. Let wt = xt − µ for t ≤ p and wt = yt for t > p and let w = (w1T ,..., w Tn )T . Then w = Λ (x − µ) where Λ is the nr × nr lower triangular block-band matrix given by

 I    ⋱   I  . Λ = −A ⋯ −A I  p  1  ⋱ ⋱ ⋱    − Ap ⋯ − A1 I  

(2.5)

Now let Cj = cov(xt , ε t − j ) , G j = cov(y t , xt − j ), W j = cov(y t , y t − j ) and Sj = cov( xt , xt − j ) , (all these are r × r matrices). Note that with this notation,  S0  S S= 1  ⋮  S n −1

S1T S0 ⋱ ⋯

⋯ ⋱ ⋱ S1

SnT−1  ⋮  . S1T  S0 

(2.6)

Furthermore, let Ai and Bj be zero for i and j outside the ranges implied by (2.1). By multiplying through (2.1) from the right with ε Tt − j for j = 0,…, q and taking expectations the following recurrence formulae for C0, C1, C2,… are obtained: C j = A1C j −1 + … + A j C0 + Bq Σ , for j = 0, 1,…

(2.7)

(so C0 = Σ). With B0 = I, we have by (2.1) and (2.2): G j = B j C0T + … + Bq CqT− j , for j = 0,…, q,

(2.8)

W j = B j ΣB0T + … + Bq ΣBqT− j , for j = 0,…, q.

(2.9)

For j < 0 or j > q, Cj, Gj and Wj are zero. By multiplying (2.1) from the right with (xt − j − µ)T for j = 0,…, p and taking expectations one gets the following linear system (the vector-Yule-Walker equations) for the r (r + 1) 2 + pr 2 elements of S0,…, Sp (note that S0 is symmetric):

5

S0 − A1S1T − … − Ap S Tp = G0 S1 − A1S0 − A2 S1T − … − Ap S Tp −1 = G1 S2 − A1S1 − A2 S0 − A3 S1T − … − Ap S Tp − 2 = G2

(2.10)

⋮ S p − A1S p −1 − A2 S p −2 − … − Ap S0 = G p The solution of (2.10) is dealt with in Appendix B. If q ≤ p, the covariance matrix of w will be given by the nr × nr matrix:  S0   S1  ⋮   S p −1  Ω=       

S1T ⋯ S pT−1 S0 ⋱



T q

G

T 1

⋱ ⋱ S ⋮ ⋯ S1 S0 G1T Gq ⋯ G1 W0 ⋱

⋮ W1 Gq ⋮ Wq

    ⋱  T ⋯ Gq   W1T ⋯ WqT  T W0 W1 ⋱  W1 W0 ⋱ WqT   ⋱ ⋱ W1T ⋮  ⋱ W1 W0 W1T  Wq ⋯ W1 W0 

(2.11)

If q < p the depiction is slightly different, the pr × (n – p)r upper right partition of Ω will be G Tp ⋯ GqT  ⋮ ⋱  T T G G ⋯ ⋯ GqT  1 2

,   

the lower left partition will be the transpose of this, but the upper left and lower right partitions are unaltered. Since Λ has unit diagonal, one finds that

l (θ ) = − 12 ( nr log 2π + log det Ω + w T Ω−1w )

(2.12) T

To evaluate (2.12) it is most economical to calculate the Cholesky-factorization Ω = LL exploiting the –1 block-band structure and subsequently determine z = L w using forward substitution. Then the loglikelihood function will be given by l (θ ) = − 12 (nr log 2π + 2Σ i log lii + z T z ).

(2.13)

We remark that the exposition in [Brockwell and Davis 1987] is significantly different from ours. They talk of the innovation algorithm but it turns out that the actual calculations are identical to the Cholesky decomposition described here. 2.3 Operation count for complete data Let h = max(p, q) and assume that q > 0 (see Section 3.4 for the q = 0 case). Given x it takes r 2 p(n − p ) multiplications to calculate w. Determining the Cj’s for j ≤ q, Gj’s and Wi’s with (2.7), (2.8) and (2.9) 6 3 costs about r 3 (min( p, q)2 2 + q 2 ) multiplications and solving the system (2.10) takes roughly r p /3 3 multiplications. The cost of the Cholesky-factorization of Ω will be about (rh ) 6 multiplications for the upper left partition and r 3 (n − h)(q 2 2 + 7 6) for the lower partition. Finally, the multiplication count for the forward substitution for z is about r 2 (h 2 2 + ( p 2 + q)(n − h)).

6

To take an example of the savings obtained by using (2.13) rather than (2.12) let p = q = 3, r = 8 and n = 3 10 1000. Then Cholesky-factorization of S will cost 8000 /6 ≈ 8.5·10 multiplications (and take about 7 min. on a typical Intel computer) but calculation with (2.13), including all the steps leading to it, will 6 take 4.0·10 multiplications (and take 0.02 s).

3. MISSING VALUE CASE 3.1 Likelihood evaluation via the Sherman-Morrison-Woodbury formula We now consider the economical evaluation of (2.4) in the presence of some missing values. Consider first the term (xo − µ o )T So−1 (x o − µ o ). Let Ω, Λ and S be obtained from Ω, Λ and S by placing rows and columns m1,…, mM after the other rows and columns and partition them as follows (with Ωo, Λo and So being N × N, and Ωm, Λm and Sm being M × M):  Ω Ω om   Λ Λ om  S S  Ω= o , Λ= o , and S =  o om  .    Ω mo Ω m   Λ mo Λ m   S mo S m  T

By the definition of w, Ω = ΛSΛ and therefore T T Ωo = Λo So Λ oT + Λ o Som Λom + Λ om Smo Λ oT + Λ om Sm Λom .

(3.1)

Λo is obtained from Λ by removing rows and corresponding columns, and it is therefore an invertible lower band matrix with unit diagonal and bandwidth at most rp, and Ωo is obtained from Ω by removing rows and corresponding columns, so it is also a band matrix and its triangular factorization will be economical. It is thus attractive to operate with these matrices rather than the full matrix So. Defining

ɶ = Λ S ΛT Ω o o o o

(3.2)

ɶ −1w ɶ oT Ω ɶ o = Λ o (x o − µ) we have (xo − µo )T So−1 (xo − µo ) = w and w o ɶ o . Also, from (3.1) and (3.2) ɶ = Ω − Λ S ΛT − Λ S T Λ T − Λ S ΛT , Ω o o o om om om om o om m om

(3.3)

(keep in mind that S is symmetric). The matrices Som , Λ om and So Λom are N × M , so if the number of missing values, M, is (considerably) smaller than the number of observations, N, then (3.3) represents a low rank modification of Ωo. This invites the use of the Sherman-Morrison-Woodbury (SMW) formula [Sherman and Morrison 1950; Woodbury 1950; c.f. Golub and Van Loan 1983]. To retain symmetry of the matrices that need to be factorized, (3.3) may be rewritten as:

ɶ = Ω + US −1U T − VS −1V T Ω o o m m

(3.4)

where U = Λ o Som and V = Λ o Som + Λ om Sm . It turns out that U is generally a full matrix but V is sparse, and it will transpire that it is possible to avoid forming U. To obtain V economically, select the observed rows and missing columns from ΛS. From (2.5), (2.6) and proceeding as when deriving (2.10) the following block representation of ΛS for the case q > p is obtained:

7

 S0 S1T   ⋮ ⋱  S p −1 ⋯ G ⋯ ΛS =  p  ⋮  Gq  ⋱  

S nT−1  ⋮   T T S0 S1 ⋯ Sn − p  G1 G0 G−1 ⋯ G− n + p +1  . ⋱ ⋮  ⋮  ⋱ G−1  Gq ⋯ G0  ⋯

For q ≤ p the upper partition is the same but the lower partition is:     

Gq ⋯ G0 ⋯ G− n + p +1   ⋱ ⋱ ⋮ . ⋱ ⋱ ⋮  Gq ⋯ G0 

For S p +1 ,..., Sn−1, multiply (2.1) from the right with (xt − j − µ)T for j = p + 1,…, n − 1 and take expectations (as when deriving (2.10)), giving

S j = A1S j −1 + A2 S j −2 + … + Ap S j − p + G j

(3.5)

with Gp = 0 for p > q. The Gj for negative j may be obtained using: G− j = C Tj + B1C Tj +1 + … + Bq C Tj + q where the Ci’s are given by the recurrence (2.7). From (2.10) and (3.5) it follows that blocks (i, j) with i > j + q of ΛS are zero, giving almost 50% sparsity. In practice the missing values will often occur near the beginning of the observation period and this implies that V will be sparser still. To take an example, if q = 1, r = 2, n = 6 and m = (2, 3, 4, 5, 9) then the sparsity pattern of V will be:

× × × × ×  × × × × × ×  ×  ×  ×  × The SMW formula applied to (3.4) gives

ɶ −1 = Ω ˆ −1 + Ω ˆ −1VQ −1V T Ω ˆ −1 Ω o o o o where

ˆ = Ω + US −1U T Ω o o m

(3.6)

ˆ −1V . Moreover, if R is the M × M matrix S + U T Ω −1U then (again and Q is the M × M matrix S m − V T Ω o m o by the SMW formula): ˆ −1 = Ω−1 − Ω−1UR −1U T Ω−1. Ω o

o

o

o

If LR is the Cholesky factor of R and K = L−R1U T Ωo−1V it follows that Q = Sm − V T Ωo−1V + K T K . The first method that springs to mind to evaluate V T Ωo−1V efficiently is to Cholesky factorize Ωo = LLT , use forward substitution to obtain Vˆ = L−1V and form Vˆ TVˆ . However, with this procedure Vˆ will be full and T the computation of Vˆ TVˆ will cost NM ( M + 1) 2 multiplications. In contrast, if an L L-factorization of

8

Ωo is employed instead of Cholesky factorization the sparsity of V will be carried over to Vˆ with large potential savings. This is a crucial observation because, with many missing values, multiplication with Vˆ constitutes the bulk of the computation needed for the likelihood evaluation. T Thus the proposed method is: L L-factorize Ωo = LTo Lo, and back-substitute to get Vˆ = L−o TV and ˆ = L− T Λ , making use of known sparsity for all calculations (the sparsity structure of Λ ˆ will be Λ om o om om T ˆ T ˆ T ˆ ˆ ˆ ˆ ˆ similar to that of V ). With RV = V V , RΛ = Λ om Λ om and P = Λ omV (again exploiting sparsity) we find T that R = Sm + RV + SmRΛSm − SmP − P Sm (all matrices in this identity are full M × M), K = L−R1 ( RV − P ) ˆ o = L−o1w ɶ o, u = and Q = S m − RV + K T K . Let further LQ be the Cholesky factor of Q, w − 1 T T −1 ˆ T T ˆ w ˆ w ˆ ˆ o − Sm Λ ˆ v = L ( V − K u ) LR (V w ) and . A little calculation then gives: Q o om o

ˆ oT w ˆ o − uT u + v T v. (xo − µo )T So−1 (xo − µo ) = w Now turn attention to the other nontrivial term in (2.4), log det S o . From det( I + AB) = det( I + BA) (see Appendix D) we get det( X ± AY −1 AT ) = det X det Y −1 det( X ± ATY −1 A). From (3.3), (3.6) and the defiˆ det S −1 det( L )2. ɶ = det(Ω ˆ − VS −1V T ) = det Ω ˆ det S −1 det(S − V T Ω ˆ −1V ) = det Ω nition of LQ, det Ω o m Q o m o m m o ˆ = det(Ω + US −1U T ) = det Ω det S −1 det( S + U T Ω−1U ) = det Ω det S −1 det( L )2 . Similarly, det Ω o o m o m m o o m R Since det Λo = 1 it now follows from (3.2) and the definition of Lo that log det So = 2(log det Lo + log det LR + log det LQ − log det S m )

3.2 Estimating missing values and shocks An obvious estimate of the vector of missing values is its expected value, x mE = E(xm | xo ,θ ), where θ is the maximum likelihood estimate of the parameters (this is also the maximum likelihood estimate of xm). Since Smo = cov(x m , xo ) and So = var(xo),

x mE = S mo So−1 ( x o − µ o ) + µ m. (where µ m consists of missing components of µ ). Similarly, the maximum likelihood estimate of the shocks εt is given by εtE = E(ε t | x o ,θ ) . For 0 ≤ j ≤ q, cov(εt , xt + j ) = CjT and εt is independent of xt + j for E ɶ −1 (x − µ ) where εE is the column vector with ε E ,..., ε E and Cɶ is obtained other j. It follows that ε = CS 1 n o o o by removing missing columns from the nr × nr matrix:

C0 C1T ⋯ CqT .   C0 ⋱   CqT  ⋱   ⋱ ⋮    C0 C1T    C0  With some calculation one may verify that given the matrices and vectors defined in the previous section, the estimates of xm and ε may be calculated economically using:

xmE = Sm v 2 + µ m , and ˆ S v ) ˆ o + Vˆ ( v1 − v 2 ) − Λ ε E = Cɶ Λ To L−o T (w om m 2 where v1 = L−QT v and v 2 = L−RT (u + Kv1 ).

3.3 Simplification for pure autoregressive models If q is zero and there are no moving average terms considerable simplification results, and it is worthwhile to review this case. Since y t = ε t for all t the Gj and Wj matrices will all be zero apart from G0 and 9

W0, which are both equal to Σ. The upper left S-partition of Ω in (2.11) will be unchanged, the Gpartition will be zero and the lower-right W-partition will be a block diagonal matrix where each block is equal to Σ. For the missing value case, Ωο needs to be Cholesky factorized. It is obtained by removing rows and corresponding columns from Ω, so that its upper left partition is the same as in the general ARMA case, but the lower right partition is a block diagonal matrix: Σ o1    Σo2   ⋱   Σ o,n − p   where Σoi contains rows and columns of Σ corresponding to the observed indices at time p + i. To obtain Lo it is therefore sufficient to Cholesky factorize Σοι for each missing pattern that occurs, which in all realistic cases will be much cheaper than Cholesky factorizing the entire Ωο-matrix.

3.4 Operation count for missing value likelihood Finding the Cj’s, Gj’s and Wj’s and Sj’s will be identical to the complete data case. The Cholesky factorization of Ωo costs at most r 2 N (q 2 2 + 7 6) multiplications (unless the upper left partition is unusuˆ and Vˆ ally big). Forming ΛS costs about r 3 (2 p + q )(n − p ) multiplications. The cost of forming Λ om using back substitution depends on the missing value pattern. In the worst case, when all the missing values are at the end of the observation period the cost is approximately rqNM multiplications for each, since the bandwidth of both is ≤ rq, but typically the missing values will be concentrated near the beginning and the cost will be much smaller. The cost of RV, RΛ and P also depends on the missing value pattern. In the worst case the symmetric RV and RΛ cost NM 2 2 multiplications each and P costs NM 2 , but the typical cost is again much smaller (for example, with the “miss-25” pattern of Table I the cost is 5 times smaller). Next follows a series of order M 3 operations: SmP costs M 3, R costs 3M 3 2, K and Q cost M 3 2 multiplications each. Finally the Cholesky factorizations for each of LR, LQ and det Sm cost M 3 6 multiplications. The multiplication count of other calculations is negligible by comparison unless M is very small. When n and M are large compared to p, q and r the governing tasks will cost 2 fNM 2 + 4 M 2 multiplications where f is the savings factor of having the missing values early on. In the pure autoregressive case the Cj’s, Gj’s and Wj’s come for free, but solving the vector-YuleWalker equations costs the same as before. The cost of Cholesky factorizing Ω will usually be negligible, and much cheaper than when q > 0. When nothing is missing, it is the number rpN of multiplications to find w and the number rpN 2 of multiplications of the forward substitution for z that govern the computational cost. On the negative side, there will be no savings in the governing tasks when M and n are large.

4. DERIVATIVE OF THE LIKELIHOOD FUNCTION Several different matrix operations that need to be differentiated may be identified. Matrix products are used in the calculation of w and the covariance matrices Ci, Gi and Wi, Cholesky factorization gives Ω, linear equations are solved to obtain the Si-matrices and z, and lastly one must differentiate log det L. In the missing value case, several more matrix products, Cholesky factorizations, linear equation solutions and determinants occur. Nel [1980] reviews and develops matrix differentiation methods of scalar and matrix-valued functions with respect to scalars and matrices. He discusses three basic methods, and concludes that a method that he calls the element breakdown method is best for general purposes, and this is the approach we take.

10

For the change of variables described in Section 4.3 we also make use of his vector rearrangement method. Since there is no commonly used notation for differentiation with respect to matrices, we provide the needed notation and formulae in Appendix A for clarity and ease of reference.

4.1 Derivatives of the r × r covariance matrices The matrices Ci, Gi and Wi are all simple matrix-polynomials in the parameter matrices (the Ai’s, Bi’s and Σ), and it is not difficult to verify that they can all be obtained by applying a sequence of operations of the following types:

F ← F + XY F ← F + XY T F ← F + XG F ← F + XGT

(4.1)

where F is the polynomial, X and Y are independent variables (parameter matrices), and G is also a polynomial obtained through such steps. Initialization can be either F ← O (the r × r zero matrix) or F ← X (one of the parameter matrices). The operations (4.1) can all be differentiated using (A.3) and (A.4) as detailed in the following table, where X, Y and Z are different parameter matrices: Corresponding change to: Change to F:

[dF dZ ]lc

[dF dX ]lc

[dF dY ]lc

+ XY

0

+ Xel ecT

+ XY T

0

+e e Y +e e Y

+ XG

+ X [ dG dZ ]lc

+ X [dG dX ]lc + el ecT G

T l c T T l c

+ Xec elT

+ X [dG dZ ]Tlc + X T [dG dX ]Tlc + el eTc G T + XG T For the first few applications of (4.1) the derivatives will be sparse, and for small p, q and/or n it may be worthwhile to exploit this sparsity. There are 5 possible sparsity patterns for dF dX : 1) 2) 3) 4) 5)

all elements are zero in the (i, j)-block only the (i, j)-element is nonzero only the i-th row in the (i, j)-block is nonzero only the j-th column in the (i, j)-block is nonzero the matrix is full

As an example, let p = 1, q = 2 and consider the differentiation of C0, C1, and C2. These matrices are given by C0 = Σ, C1 = A1Σ + B1Σ (the first operation of (4.1) twice) and C2 = A1C1 + B2Σ (the third operation of (4.1) followed by the first operation). Treating Σ as non-symmetric to begin with, one obtains:

dC0 dA1 = 0

[dC1 dA1 ] lc = el eTc Σ

[dC2 dA1 ] lc = A1[dC1 dA1 ] lc + el eTc C2

dC0 dB1 = 0

[dC1 dB1 ] lc = el eTc Σ

[dC2 dB1 ] lc = A1[dC1 dB1 ] lc + el eTc C2

dC0 dB2 = 0

dC1 dB2 = 0

[dC0 d Σ] l c = el eTc

[dC2 dB2 ] lc = el eTc Σ

[dC1 d Σ] lc = ( A1 + B1 )el eTc

[dC2 d Σ] lc = A1[dC1 d Σ] lc + B2el eTc

Here all the sparsity patterns are represented and the only full matrices are the derivatives of C2 with respect to A1 and B1. Finally, the derivatives with respect to the symmetric Σ are adjusted using (A.7). Now we turn attention to the vector-Yule-Walker equations (2.10). Differentiating through these with respect to a parameter gives:

11

(

S ′j ,lc − ( A1S ′j −1,lc + … + A j S0,′ lc ) − A j +1 ( S1,′ lc )T + … + Ap ( S ′p − j ,lc )T

(

= Glc′ + ( A1,′ lc S j −1 + … + A′j ,lc S0 ) + A′

)

)

S + … + A′p S Tp − j for j = 0,… , p.

T j +1,lc 1

(4.2)

This set of equations has exactly the same coefficient matrix as the original equations (2.10), but a different right hand side which can be obtained using the formula for d ( XA) dX in (A.4) (sparsity can be exploited). It can therefore be solved to obtain the derivatives of S0,…, Sp using the same factorization as was used to obtain the Sj. 4.2 Remaining steps in likelihood gradient calculation It follows from (4.1) that the derivative of yt (and thereby wt) with respect the Bj’s and Σ is zero, and (A.5) gives its derivative with respect to the Aj’s and µ. For complete data, the next needed derivative is that of L, the Cholesky factor of Ω. As the derivative of all the submatrices of Ω have been found, this may be obtained using (A.10) and (A.11), making necessary modifications to take advantage of the block-band structure of Ω. To finish the calculation of the gradient of l(θ) in (2.13), use Lz = w together T with (A.8) to differentiate z, followed by (A.6) to differentiate z z, and finally use d (log lii ) dX = (1 lii ) dlii dX . In the missing value case, the operations that must be differentiated are the same: matrix products, Cholesky factorization, forward substitution, and determinants of lower triangular matrices, and there is no need to give details of all of them. They have been implemented in [Jonasson 2006] by writing functions that implement (A.3), (A.9), (A.10) and (A.11). 4.3 Operation count for gradient calculation and possible savings Inspection of the formulae in appendix A for the derivatives of the most costly operations, namely matrix products, Cholesky factorization and forward substitution, shows that they all cost approximately 2nθ times more multiplications than the original operations being differentiated, where nθ = r 2 ( p + q) + r (r + 1) 2 is the total number of model parameters excluding µ which does not enter the costly operations. The gradient calculation will therefore usually dominate the total work needed for likelihood maximization and this is confirmed by the numerical results of Section 5. One way of trying to reduce this work would be to use numerical gradients in the beginning iterations, when the accuracy of the gradients is not as important as closer to the solution. Using forward differencing, ( ∂ ∂θ k ) l (θ ) = ( l (θ + δ ek ) − l (θ ) ) δ , the gradient can be approximated with nθ function calls, giving a potential saving of factor 2. However, judging by the results shown in Table II in the next section, it seems that this technique is not so useful. Another possibility of speeding the computations exists when estimating seasonal models, structural models, or various models with constraints on the parameters such as distributed lag models. Without entering too much into detail, such models may often be described by writing θ as a function of a ren duced set of parameters, θ = g (φ ), where φ ∈ R φ has (often much) fewer components than θ. The loglikelihood for a given set of parameters φ is l ( g (φ )), and the corresponding gradient is l ′( g (φ )) J g (φ ), where Jg is the nθ × nφ Jacobian of the transformation g. The parameter matrices may be sparse and it would be possible to exploit the sparsity, but big savings are also possible by multiplying with the Jacobian earlier in the computation of the gradient, instead of after evaluating l ′(θ ) . A convenient place to make the change of variables is after the differentiation of w, the Cj’s, Gj’s and Wj’s, and the gj’s in the right-hand-side of (B.3). The costly derivatives come after this, so the potential saving approaches a factor of nθ nφ . In [Jonasson 2006] this course of action has been implemented, and the likelihood routines have Jg as an optional parameter.

12

5. NUMERICAL EXPERIMENTS The methods described in Sections 2−4 have been implemented in Matlab as described in the companion report [Jonasson 2006]. The Matlab package includes a function to simulate time series as described in Appendix C below. This function has been used to generate test data with several models, missing value patterns and dimensions, and these data have been used to test and time the likelihood evaluation functions. The tests were run using Matlab 7.1 with its default Intel Math Kernel Library (MKL) on a 1600 MHz Pentium M processor. The primary use of likelihood evaluation is to estimate model parameters by maximizing the likelihood function. The authors have been experimenting using the BFGS method with line search, To take two examples, estimating a VMA(1) model with r = 4, n = 100 and missing value pattern “miss-5a” from Table I (giving 30 model parameters) took 205 function evaluations and 48 gradient evaluations, and a VAR(3) model with r = 4, n = 500 and missing value pattern “miss-5a” (giving 78 parameters) took 393 function evaluations and 73 gradient evaluations. These figures together with the run times reported below indicate what to expect in terms of run time for parameter estimation. For real problems of this size it is however likely that a reduced set of parameters would be used, and then the savings described at the end of Section 4.3 would come into effect. 5.1 Timing of function evaluations Table I shows the run time in seconds required for one function evaluation for each combination of model, missing value pattern, and dimensions. Table I — Run time in seconds per one function evaluation. The missing value patterns shown in the “data” column are a) complete data; b) miss-5a: 5% missing scattered in first quarter of each series; c) miss-5b: 5% missing scattered throughout entire series; d) miss25: 25% missing — half the series have the first half missing. Dimension r = 2

Dimension r = 4 Dimension r = 8

Model

Data

VAR(1)

complete

0.01

0.02

0.01

0.03

0.01

0.03

miss-5a

0.02

0.10

0.03

0.24

0.05

0.85

miss-5b

0.03

0.24

0.04

0.58

0.10

2.21

miss-25

0.04

0.73

0.08

4.07

0.35

28.17

complete

0.04

0.19

0.04

0.21

0.05

0.24

miss-5a

0.06

0.29

0.06

0.47

0.09

1.07

miss-5b

0.07

0.43

0.08

0.86

0.15

2.78

VMA(1)

VAR(3)

VARMA(2,2)

n = 100 n = 500 n = 100 n = 500 n = 100 n = 500

miss-25

0.08

1.00

0.12

4.41

0.39

28.34

complete

0.01

0.03

0.01

0.03

0.04

0.06

miss-5a

0.03

0.11

0.04

0.24

0.08

0.88

miss-5b

0.03

0.19

0.05

0.55

0.12

2.19

miss-25

0.04

0.73

0.09

4.09

0.37

27.90

complete

0.05

0.25

0.06

0.27

0.08

0.34

miss-5a

0.07

0.33

0.08

0.51

0.13

1.24

miss-5b

0.08

0.57

0.10

1.02

0.19

2.98

miss-25

0.09

1.02

0.14

4.47

0.44

28.64

For the pure VAR models the simplifications of Section 3.3 are realized, and for complete data the solution to the vector-Yule-Walker equations and the calculation of w and z will govern the computation. For VMA and VARMA models these calculations still make up a portion of the total, but the factoriza13

tion of Ω is now more expensive and accounts for most of the difference between the complete data execution times of the VAR(1) and VMA(1) models shown in Table I. Missing values add gradually to the cost, and when there are few missing values the execution time is only marginally greater than for complete data. When more values are missing the savings in the VAR model are gradually eradicated. Now the approximately order M 3 operations (independent of p and q) ˆ , and the full M × M matrices Sm, RΛ, RV, P, Q, involving the profile-sparse N × M matrices Vˆ and Λ om R and K become more and more important. If these were the only computations one would expect a factor 125 difference between n = 100 and n = 500, but because of other calculations that do not depend on M the largest factor in the table is 80 (for VAR(1), miss-25, r = 8). Another feature shown by the table is the difference between miss-5a and miss-5b, corroborating the discussion between equations (3.5) and (3.6) in Section 3.1. This ranges from a factor of 1.26 to a factor of 2.61, the average being 1.89. 5.2 Timing of gradient evaluations Timing experiments for gradient evaluation were also carried out. It seems most relevant to compare with the cost of numerical differentiation. Therefore Table II shows the factor between the time of one gradient evaluation and m function evaluations. Where a table entry is less than one, the analytical gradients take less time than (maybe inaccurate) forward-difference numerical gradients, and where it is less than two the analytical gradients are cheaper than central-difference numerical gradients. The average of the 70 factors shown in the table is 0.74. The table is less extensive than Table I because the computer used did not have enough memory to time the largest models. The memory was sufficient to time some runs not shown in the table, and the results were comparable to the figures shown (the average factor for 11 cases not shown in the table was 0.41). Table II — Execution time for one gradient evaluation divided by time for m function evaluations where m is the number of model parameters. See caption of Table I for explanation of the “Data” column. Dimension r = 2

Dimension r = 4

r=8

Model

Data

VAR(1)

complete

0.40

0.47

0.15

0.17

0.09

miss-5a

0.56

0.78

0.35

0.97

0.66

miss-5b

0.58

0.66

0.50

1.02

0.77

miss-25

0.83

2.04

1.33

2.22

2.11

complete

1.16

1.57

1.06

1.08

1.12

miss-5a

0.63

0.68

0.33

0.66

0.52

miss-5b

0.67

0.76

0.42

0.74

0.65

miss-25

0.71

1.39

1.02

2.01

1.92

complete

0.23

0.26

0.10

0.11

0.05

VMA(1)

VAR(3)

VARMA(2,2)

n = 100 n = 500 n = 100 n = 500 n = 100

miss-5a

0.35

0.62

0.30

1.09

0.50

miss-5b

0.38

0.82

0.42

1.05

0.72

complete

1.10

1.19

1.07

1.17

1.08

miss-5a

0.34

0.45

0.27

0.66

0.55

miss-5b

0.36

0.51

0.38

0.75

0.74

The relative cost of gradient evaluation is somewhat lower than expected at the outset, as the derivative of many basic linear algebra operations with the formulae of Appendix A cost 2m times more than the operations themselves. This could be because the evaluation of the gradient involves larger matrices, 14

thus making better use of the Intel MKL. The variable power of the MKL explains partly the variability of the numbers in Table II, but the rest of the disparity probably occurs because different derivative routines make unlike use of the power of Matlab.

APPENDICES A. DIFFERENTIATION WITH RESPECT TO MATRICES Many of the identities that follow may be found in [Nel 1980]; see also [Golub and Van Loan 1983]. If f is differentiable function on the set of M × N matrices, f : ℝ M × N → ℝ, then the N × M matrix with (i, j)element ∂f ∂xij will be denoted by f ′( X ) or df dX . If f is a vector valued function of a matrix, f: ℝ M × N → ℝ m then d f dX or f ′( X ) denotes the block matrix:  ∂ f ∂x11 ⋯ ∂ f ∂x1N   ⋮ , ⋮   ∂ f ∂xM 1 ⋯ ∂ f ∂xMN 

where each block is an m-dimensional column vector (the k-th block row is actually the Jacobian matrix of f with respect to the k-th row of X). If F is matrix valued, F : ℝ M × N → ℝ m× n, then dF dX or F ′( X ) denotes the M × N block-matrix  ∂F ∂x11 ⋯ ∂F ∂x1N   . ⋮ ⋮    ∂F ∂xM 1 ⋯ ∂F ∂xMN 

(A.1)

The (l, c)-block of (A.1) will be denoted by Flc′ or [ dF dX ]lc and it is an m × n matrix with (i, j )-element equal to ∂ f ij ( X ) ∂xlc . It is now easy to verify, that if a is a scalar and Fɶ is another matrix function with same dimensions as F, then d (aF + Fɶ ) dX = a dF dX + dFɶ dX . We also have (where el is the l-th unit vector):

[dX

dX ]lc = el ecT ,

(A.2)

and, if G is another matrix function G: ℝ M × N → ℝ n×k , then

[ d ( FG)

dX ]lc = FGlc′ + Flc′ G.

(A.3)

A.1 Differentiation of matrix products

The following special cases are all consequences of (A.2) and (A.3): [dX T dX ]lc = ec elT [d ( AX ) dX ]lc = Ael ecT

[ d ( XA) dX ]lc = el eTc A

[d ( AF ) dX ]lc = AFlc′

[d ( FA) dX ]lc = Flc′ A

[d ( XF )) dX ]lc = XFlc′ + el eTc F

(A.4)

[d ( FX ) dX ]lc = Flc′ X + Fel eTc

where, in each case, A is a constant matrix with dimensions compatible with those of F and X. When A is actually a vector, A = a, we have:  e1  d ( Xa) dX =  ⋮  a T ,   e M 

15

(A.5)

and similarly, d (aT X ) / dX = a [e1T ⋯ eTN ] . If n = 1 and F is vector-valued, F = f = [ f1 ,…, f m ]T , the l-th block-row of d ( X f ) dX is X [∂ f ∂xl1 … ∂ f ∂xlN ] + el f T and the c-th block-column of d (f TX ) dX is [∂ f ∂x1c … ∂ f ∂xMc ] T X + fecT . Furthermore:

[d (f Tf ) dX ]lc = 2 f T(∂ f ∂ xlc )

(A.6)

A.2 Derivative with respect to a symmetric matrix

When X is square and symmetric and its upper triangle duplicates its lower triangle, the correct derivatives are obtained by using the full X in (A.4), and assigning in the final result: (l, c)-block ← (l, c)-block + (c, l)-block (for all l, c with l > c)

(A.7)

(only the lower block-triangle is relevant). To take an example let n = 2, x21 = x12 and consider the calculation of dX 2 dx21 . By (A.2) and (A.4), x 0 x x + x  dX 2 dx21 = Xe2e1T + e2e1T X =  12 and dX 2 dx12 = Xe1eT2 + e1eT2 X =  21 11 22  .  x21  x22 + x11 x12  0

Adding these matrices and letting x denote the duplicated element in X (i.e. x = x12 = x21) gives the matrix: x11 + x22   2x x11 + x22 2 x 

which is easily verified to be the derivative of X 2 with respect to x. It would be possible to make the calculation of derivatives with respect to symmetric matrices more efficient by developing appropriate formulae analogous to (A.4), but the complications would probably be significant and the pay-back marginal in the present setting. A.3 Derivative of the solution to linear equations

If the vector y is given by Ay = b then it follows from (A.3) that A(∂ y ∂xlc ) + Alc′ y = ∂ b ∂xlc and ∂ y ∂xlc is therefore given by solving the set of linear equations:

A(∂ y ∂ xlc ) = ∂ b ∂ xlc − Alc′ y .

(A.8)

We note that the factorization of A used to obtain y can be reused to obtain its derivative. Similarly, if the matrix F is given by AF = B then Flc′ may be obtained by solving:

AFlc′ = Blc′ − Alc′ F .

(A.9)

A.4 Derivative of Cholesky factorization T

If S = LL is the Cholesky factorization of a symmetric matrix S it follows from (A.3) that Slc′ = L( Llc′ )T + Llc′ LT . If Slc′ , L and Llc′ are partitioned as follows for a given k  S1′   L1   L1′  T T T       Slc′ = s′ skk′ , L = u lkk and Llc′ = u′ lkk′        S 2′ t S3   L3 v L2   L3′ v′ L2′ 

then 2(uT u′ + lkk lkk′ ) = skk′ and L1u′ + L1′u = s′ so that L1u′ = s′ − L1′u

and 16

(A.10)

lkk′ = ( skk′ 2 − uT u′) lkk .

(A.11)

These relations may be used iteratively for k = 1, 2, … to calculate Llc′ line by line, with u′ obtained from (A.10) with forward substitution.

B. SOLUTION OF THE VECTOR YULE-WALKER EQUATIONS In this appendix, we consider the solution to the system of equations (2.10). Our approach closely resembles that given by [Mauricio 1997, eq. (6)] and in particular the system we solve is of the same order, namely r 2 p − r (r − 1) 2. However, Mauricio does not provide a derivation of the system, our notation is significantly different from his, and lastly the system solved is not exactly the same (although it is equivalent). Therefore, we provide an explicit derivation in this appendix. Isolating Sp in the last equation of (2.10) and substituting into the first equation gives S 0 − ( A1T S1T + … + Ap −1S pT−1 ) − ( Ap S 0 ApT + Ap S1T ATp −1 + … + Ap S Tp −1 A1T ) = G0 + Ap G Tp

(B.1)

It is convenient to make use of the Kronecker product ( A ⊗ B is a block matrix with (i, j)-block equal to aijB), the notation vec A for the vector consisting of all the columns of a matrix A placed one after another, and vech A for the columns of the lower triangle of A placed one after another. A useful property here is vec( ASB T ) = ( B ⊗ A) vec S . Let si = vec Si, gi = vec Gi, and denote the k-th column of Ai with aik and the k-th unit vector with ek. Because S0 is symmetric, taking the transpose of (B.1) gives with this notation: ( I − Ap ⊗ Ap ) s 0 − ( A1 ⊗ I + Ap ⊗ Ap −1 ) s1 − … − ( Ap −1 ⊗ I + Ap ⊗ A1 ) s p = vec G0T + ( Ap ⊗ I ) g p

(B.2)

Furthermore, the equations with right hand side G1,…, Gp–1 in (2.10) may be written as

s1 − Aˆ1s 0 − Aɶ 2s1 − … − Aɶ p s p −1 = g 1 s 2 − Aˆ1s1 − Aˆ2s 0 − Aɶ3s1 − … − Aɶ p s p − 2 = g 2 ⋮ ˆ ˆ ɶ s p −1 − A1s p −2 − … − Ap −1s0 − Ap s1 = g p −1

(B.3)

where Aˆi = I ⊗ Ai and Aɶi is also an r2 × r2 sparse block-matrix (which cannot be represented using ⊗):

Aɶi = ai1e1T ⋯ air e1T  .  ⋮ ⋮   T  T ai1er ⋯ air e r  Together (B.2) and (B.3) provide pr2 linear equations in the elements of s0,…, sp – 1, but one can (and should) take into account that S0 is symmetric and s0 contains therefore duplicated elements. Let Sˆ0 be a lower triangular matrix such that S0 = Sˆ0 + Sˆ0T (the diagonal elements of Sˆ0 are halved compared with S0) and let sˆ 0 = vech S0 (the ( r (r + 1) 2 )-vector obtained by removing the duplicated elements from s0). Let also J be an r2 × r (r + 1) 2 matrix such that post-multiplication with it removes columns r + 1, 2r + 1, 2r + 2, 3r + 1, 3r + 2, 3r + 3‚ …, r2 – 1. Then a term of the type Aˆi s 0 in (B.3) may be rewritten: Aˆi s0 = ( I ⊗ Ai )s 0 = vec( Ai S0 I T ) = vec( Ai Sˆ0 + Ai Sˆ0T ) = ( Aˆi + Aɶi ) vec S0 = ( Aˆi + Aɶi ) J sˆ 0,

For (B.2) it is not difficult to verify that ( Ap ⊗ Ap )s0 = ( Ap ⊗ Ap + Aɶ p Aˆ p ) vec( S 0 ) = ( Ap ⊗ Ap + Aɶ p Aˆ p ) J sˆ 0,

17

(B.4)

and furthermore that s0 = D vec( Sˆ0 ) where D is diagonal with dii = 2 when i corresponds to a diagonal element in S0 (i.e. i = 1, r + 2, 2r +3, …); otherwise dii = 1. The matrix Aɶ p Aˆ p is a block-matrix with (i, j)-block equal to a pj a Tpi . Finally, the upper triangle of (B.2) should be removed. These modifications result in r 2 p − r (r − 1) 2 equations in the same number of unknowns, the elements of sˆ 0, s1,…, sp; (B.2) becomes

(

)

(

p −1 J T ( D − Ap ⊗ Ap − Aɶ p Aˆ p ) J sˆ 0 − ∑ i =1 ( Ai ⊗ I + Ap ⊗ Ap −i ) si = J T vec G0T + ( Ap ⊗ I ) g p

)

(B.5)

and (B.3) is modified using (B.4).

C. TIME SERIES SIMULATION Simulation of VARMA time series has many applications e.g. to create test data for modelling methods, analyze such methods, and forecast with fitted models. Given values of εt, xt for t = 1,… , h where h = max( p, q ) one may draw εt from N(0,Σ) for t = h + 1, h + 2,… and apply (2.2) and (2.1) to obtain simulated values of xt for t > h. If the starting values are not given, one may start with any values, for example zeros, and, after simulating, discard an initial segment to avoid spin-up effects. This is for example done in the routine arsim of [Schneider and Neumaier 2001]. For processes with short memory, this procedure works well and the discarded segment need not be very long, but for processes that are nearly non-stationary it may take a long time before they reach their long-term qualities, it is difficult to decide the required length of the initial segment, and the initial extra simulations may be costly. These drawbacks may be avoided by drawing values to start the simulation from the correct distribution.

Let x' = (x1T ,…, xTh )T have mean µ' and covariance matrix S ′, ε' = (ε1T ,…, ε Th )T have covariance matrix Σ′, and let C ′ = cov(x', ε'). S ′, Σ′ and C ′ are given with (2.7) and (2.8) and solution of the vector-YuleWalker equations (2.10) applying (3.5) if necessary, and µ' is the rh-vector (µT ,…, µT )T. Starting values for x' may be drawn from N(µ′, Σ′) , and starting values for ε ′ (that are needed if there are moving average terms) may be drawn from the conditional distribution of ε′ | x′, which is normal with expectation C ′T S −1 (x′ − µ′) and covariance matrix Σ′ − C ′T S −1C ′. This conditional distribution may also be used to draw ε ′ when x1,..., xh are given and ε1,…, εh are unknown, for example when forecasting with a moving average model. This procedure has been implemented in [Jonasson 2006].

D. DETERMINANT OF A LOW RANK UPDATE The economical evaluation of the determinant of the covariance matrix of the observations in the missing value case, described at the end of Section 3.1, is based on the following theorem. As we have been unable to locate a proof of this useful fact in the published literature, we include it here for completeness. An immediate consequence of the theorem is that the determinant of a low rank update of an arbitrary matrix M may often be evaluated efficiently using det( M + UV T ) = det M det( I + V T M −1U ) , in this way complementing the Sherman-Morrison-Woodbury formula. Theorem. If A is m × n , B is n × m and Im and In are the m-th and n-th order identity matrices then det( I m + AB) = det( I n + BA) . Proof. Let C and D be m × m and n × n invertible matrices such that CAD = B1 B2 be a partitioning with B1 a k × k matrix. Then

( ) B3 B4

18

( ) and let D Ik 0 0 0

−1

BC −1 =

 B B    I 0 det( I m + AB ) = det  C −1C + C −1  k  D −1 D  1 2  C   0 0  B3 B4      I 0  B B   = det(C −1 ) det  I m +  k   1 2   det C  0 0   B3 B4     I + B1 B2   I + B1 0  = det  k = det( I k + B1 ) = det  k  I m−k   0  B3 I n − k    B B   I 0 = det  I n + D  1 2  CC −1  k  D −1   0 0  B3 B4    = det( I n + BA). The matrices C and D may, for example, be obtained from the singular value decomposition of A [Golub and Van Loan 1983].

ACKNOWLEDGEMENT We wish to thank Jón Kr. Arason for help with proving the theorem of Appendix D.

REFERENCES Ansley, C. F. 1979. An algorithm for the exact likelihood of a mixed autoregressive-moving average process. Biometrika 66, 1, 59–65. Brockwell P. J. and Davis, R. A. 1987. Time Series: Theory and Models. Springer-Verlag, New York. Golub, G. H. and Van Loan, C. F. 1983. Matrix Computations. North Oxford Academic, Oxford. Harvey, A. C. and Phillips, G. D. A. 1979. Maximum likelihood estimation of regression models with autoregressive-moving average disturbances. Biometrika 66, 1, 49–58. Jonasson, K. 2006. Matlab programs for complete and incomplete data exact VARMA likelihood and its gradient. Report VHI-02-2006, Engineering Research Institute, University of Iceland. Jones, R. H. 1980, Maximum likelihood fitting of ARMA models to time series with missing observations. Technometrics 22, 3, 389–395. Ljung, G. M. 1989. A note on the estimation of missing values in time series. Communic. Statist. – Simul. Comput. 18, 2, 459–465. Ljung, G. M. and Box, G. E. P. 1979, The likelihood function of stationary autoregressive-moving average models, Biometrika 66, 2, 265–270. Luceño, A. 1994. A fast algorithm for the exact likelihood of stationary and nonstationary vector autoregressivemoving average processes. Biometrika 81, 3, 555–565. Mauricio, J. A. 1997. Algorithm AS 311: The exact likelihood function of a vector autoregressive moving average model. Appl. Statist.46, 1, 157–171. Mauricio, J. A. 2002. An algorithm for the exact likelihood of a stationary vector autoregressive moving average model. J Time Series Analysis 23, 4, 473–486. Nel, D. G. 1980. On matrix differentiation in statistics. South African Statistical J. 15, 2, 137–193. Neumaier, A. and Schneider, T. 2001. Estimation of parameters and eigenmodes of multivariate autoregressive models. ACM Trans. Math. Softw. 27 1, 27–57.

19

Penzer, J. and Shea, B. L. 1997. The exact likelihood of an autoregressive-moving average model with incomplete data. Biometrika 84, 4, 919–928. Phadke, M. S. and Kedem, G. 1978. Computation of the exact likelihood function of multivariate moving average models. Biometrika 65, 3, 511–19. Schneider, T. and Neumaier, A. 2001. Algorithm 808: ARfit - A Matlab package for the estimation of parameters and eigenmodes of multivariate autoregressive models. ACM Trans. Math. Softw. 27 1, 58–65. Shea, B. L. 1989. Algorithm AS 242: The exact likelihood of a vector autoregressive moving average model. Appl. Statist. 38, 1, 161–204. Sherman, J. and Morrison, W. J. 1950. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix, Ann. Math. Statist. 21, 124–127. Siddiqui, M. M. 1958. On the inversion of the sample covariance matrix in a stationary autoregressive process. Ann. Math. Statist. 29, 585–588. Woodbury, M. A. 1950. Inverting modified matrices, Memorandum Report 42, Statistical Research Group, Princeton University, Princeton NJ.

20