We must show that S (S S). +. S R X = WW RX ... therefore in the space of W. Thus, S (S S). +. S u = WW u .... in De Nederlandsche Bank (1984b). The sample ...
DIVIDING BY 4 An efficient algorithm for the optimal disaggregation of annual data into quarterly data Jan Jacobs
Siep Kroonenberg
Tom Wansbeek∗
May 1992
Abstract This paper deals with the construction of quarterly variables when only annual observations are available. The pertaining method of Boot, Feibes and Lisman (1967) is further elaborated. They propose to minimize the sum of squared first or second differences of the unknown quarterly series subject to an appropriate constraint. Intuitively, their method comes down to finding the smoothest curve satisfying the constraint. Solving the minimization problem is the same as solving a system of linear equations for the values of the unknown quarterly observations. The solution, the vector of unknown quarterly observations, can be expressed as the product of a matrix and the vector of annual observations. An algorithm can be devised that computes the quarterly series in linear time and linear memory. An explicit expression for the matrix that converts annual totals into quarterly figures is presented too. We find a close connection between the method of Boot, Feibes and Lisman and Generalized Least Squares. Some examples give an indication of the method’s quality. ∗
University of Groningen, Department of Economics, P.O. Box 800, 9700 AV Groningen, The Netherlands. Preliminary versions of this paper were presented at a seminar at the University of Groningen, ECOZOEK 1989, Wageningen and the 1989 European Meeting of the Econometric Society, Munich. The authors would like to thank Paul Bekker, Willem Buiter, Heinz Neudecker and Ken Wallis for their comments on earlier versions of the paper.
1
1
Introduction
In macroeconomic model building, the frequency at which data are collected is an important consideration. When one wants to build a quarterly model of an economy one needs quarterly observations for all variables of the theoretical model. Normally, it is not practical for the modeller to collect all data himself. Data sets are frequently obtained from official authorities, e.g., in the Dutch case, the Central Bureau of Statistics. Of some key macroeconomic variables, only annual observations are available. For example, for the Netherlands no quarterly data are available on aggregate employment and aggregate labour supply. The unknown quarterly observations have to be constructed somehow before such a variable can be used in an empirical model. There are several methods to approximate unknown quarterly time series. The most naive method is dividing the annual observations by 4. This method is only applicable in the case of flow variables, and even for this class of variable one can think of better ways of handling the problem of missing observations. Two approaches can be distinguished in the literature. Firstly, in the data-based approach the missing quarterly observations are obtained from the annual totals of the variable by some kind of interpolation procedure, sometimes with the use of related series when available. Structural information on the generating process of the time series is used in a rudimentary way at best. Examples are Boot, Feibes and Lisman (1967) and Denton (1971), Cohen, M¨ uller and Padberg (1971) and, recently, Stram and Wei (1986b), who arrive at results closely related to ours, and Al-Osh (1989). Friedman (1962), Ginsburgh (1973) and Chow and Lin (1971 and 1976) use related series for the interpolation. Secondly, the model-based approach consists of specifying a structural model containing both the variable(s) with missing observations and other variables. The missing observations are estimated together with the parameters of the structural model. Examples of the model-based approach are Kmenta (1981) and Dempster, Laird and Rubin (1977). More references can be found in Nijman (1985). In general, one might say that using reliable extra information when available leads to better results in dynamic macroeconomic models in terms of efficiency (see Palm and Nijman (1984)). In this paper the method of Boot, Feibes and Lisman (1967) is further elaborated. Below, this method will be referred to as BFL. The motivation for doing this is twofold. First, the model-based approach, though preferable, may simply not be operational when the required additional information is not available. Second, there are interesting algebraic structures behind BFL which have not been fully explored and which can be exploited to yield efficient computational procedures. BFL proposed to minimize the sum of squared first or second differences 2
of the unknown quarterly series of observations, subject to an appropriate constraint. Because they considered flows only, their constraint was that the sum of the quarterly observations in each year should be equal to the annual value. Intuitively, minimizing the sum of squared first or second differences comes down to finding the smoothest curve satisfying the constraint. Regarding the method of BFL two aspects are important. In the first place, a distinction has to be made between stocks and flows. Flow variables are aggregated over time by summing; this leads to the above constraint. For stocks another constraint or aggregation criterion is applicable. Several criteria may be used. Sometimes the annual value of a stock is equal to the value either of the first quarter or of the last quarter. In other situations, the annual value is taken to be the average of the quarterly values which would make the stock case very similar to the flow case. Gandolfo (1980) derives a solution for the stock case from the solution for the flow case. His method for stocks uses the fact that taking first differences transforms stocks into flows (ignoring price changes). In the second place, one must decide which sum of squares to minimize. Minimizing the sum of squared first or of squared second differences (under the appropriate constraint) leads to different results. BFL suggest taking second differences. We performed some experiments in order to shed some light on these points. Several known series of quarterly observations from the database of MORKMON , the quarterly macroeconometric model of the Dutch central bank, are aggregated by different methods into series of annual totals. The quarterly observations were reconstructed using the method of BFL with first, second and third differences. The series of computed values are compared with the original series both graphically and statistically. We found flows to be more difficult to approximate than stocks. Furthermore, the evidence suggests that it might be even better to minimize the sum of squared third differences, although the gain is slight. This paper also serves another purpose. The minimization problem of the method of BFL can be described as a system of linear equations with the quarterly observations as unknowns. The solution, i.e., the vector of unknown quarterly observations, can be expressed as the product of a matrix and the vector of annual observations. Solving a linear system of equations or performing a matrix multiplication generally takes cubic time: if the vector of annual totals increases by a factor t, computing time increases by a factor t3 . (The theoretical lower bound, as yet unknown, lies somewhere between 2 and 3.) In addition, quadratic memory is required. Due to the special structure of the minimization problem one can construct an algorithm that computes the series of quarterly observations in linear time and linear memory. The advantages of computations only taking linear time and linear memory are evident. We shall also present an explicit expression for the matrix that converts the annual totals into quarterly observations. Unfortunately, when 3
implementing the algorithm on a computer, we found a numerical fly in the ointment. This will be pointed out in the appendix. The BFL method to obtain series of quarterly observations has been widely used, see e.g. Driehuis (1972) or Jager (1981). The results of the method are considered quite satisfactory. However, the structure of the matrix that converts annual into quarterly series has not received full attention. We shall find a connection between the BFL approximation method and Generalized Least Squares. There are two disadvantages to the method of BFL. Firstly, BFL assumes that the d-th difference of the series with missing observations follows a white noise process. In other words, the series to be disaggregated is assumed to be autoregressive. Stram and Wei (1986b), who take GLS as their starting point, allow more general processes. Secondly, the entire series is used for each disaggregated value. Al-Osh (1989) uses Kalman filtering techniques to circumvent this problem. The organization of the paper is as follows. Section 2 gives a formal definition of the optimization problem in terms of matrices. The optimization problem is expressed as a system of linear equations with the missing quarterly data as unknowns. Section 3 presents the solution matrix, i.e. the matrix that converts series of annual aggregates into series of quarterly observations. Computational details can be found in the appendix. In order to give an impression of the performance of the BFL method, some examples are given in section 4. The final section of the paper contains concluding remarks.
2
The optimization problem
−1 qT −1 Let v = (vi )Ti=0 be a time series of annual data. For a time series u = (ui )i=0 with q evenly spaced observations per year, the derived series of annual data can be defined as
q−1
j=0
T −1
xj uiq+j
, i=0
where x is a suitable q-vector satisfying x ιq = 0, e.g., x = ιq = (1 1 1 · · ·) if v is a flow variable, or x = e1 = (1 0 0 · · ·) 4
if v is a stock variable and the annual value of the stock equals the stock in the first subperiod. Note that if the annual value of a stock variable is defined as the average of the quarterly figures, the flow form may be used. Maximizing smoothness means minimizing the sum of squared d-th differences, or qT −1
(1 − L)d ui
2
(2.1)
i=d
where L is the lag operator. The disaggregated series is determined on the basis of an autoregressive model for the disaggregate series: (1 − L)d ui = ai , where ai is a series of white noise errors. Estimates of the disaggregates are based on the entire aggregate series. Rewritten in matrix notation the minimization problem becomes: u RR u subject to Z u = v min u
(2.2)
where R is the (qT − d) × qT matrix which maps the qT -vector u to the (qT − d)-vector of its d-th differences, e.g., for d = 1
R =
−1
1 −1 1 . . . . .
. −1
1 −1 1
,
and Z = IT ⊗ x for a given q-vector x with x ι = 0. To give an example, for q = 4 and x = iq = (1, 1, · · ·) (the flow case), we have
Z =
1 1 1 1
.
1 1 1 1 . . . . 1 1 1 1 1 1 1 1
The Lagrange-function following from (2.2) is u RR + µ (Z u − v) or, written out for d = 1: qT −1 i=1
(ui − ui−1 )2 +
T −1 i=0
q−1 µi xj uiq+j − vi . j=0
5
Setting the first derivatives with respect to ui, i = 1, · · · , qT and with respect to µi , i = 1, · · · , T equal to 0, dividing the result for the u-derivatives by 2 and replacing µ with λ = µ/2 we get
D Z Z 0
u λ
=
0 v
(2.3)
where D = RR , e.g., for d = 1
,
1 −1 2 −1 −1 −1 2 −1 . . . D= . . .
. . −1
. 2 −1 −1 1
Note that D = RR is symmetric positive semidefinite and that Z is of full column rank T . The system of equations (2.3) has to be solved for the qT −1 unknown values of the series u = (ui)i=0 .
3 3.1
The solution matrix Version 1
One can solve the linear system of equations (2.3) by inverting the (q + 1)T × (q + 1)T matrix
H=
D Z Z 0
Standard partioning techniques transform the problem into the inversion of a (T + d) × (T + d) matrix (see Cohen, M¨ uller and Padberg (1971)). Using the typical structure of H, we shall find a matrix expression for the upper righthand corner of H −1 , which is all we need from H −1 . For the case d = 2, the appendix gives an algorithm for the corresponding matrix multiplication, applied to a vector, which requires linear time and space. A large part of the argument is valid for general d. Quite possibly, the entire argument can be generalized to arbitrary d. However, we expect that for higher d a linear algorithm would become excessively complex. Balestra (1978) provides the key for finding an explicit expression for the upper righthand corner of the matrix H −1 : 6
Theorem 3.1 Let
H=
D Z Z 0
where D is a symmetric positive semidefinite n × n matrix and Z an n × r matrix of full column rank ( fcr). Let X be an n × (n − r) matrix satisfying X X = In−r and XX = In − Z(Z Z)−1 Z . X can be thought of as the orthogonal complement of Z. Let D = RR with R of fcr. Assume that [R, Z] is of fcr. Then X(X DX)−1 and H −1 exist. Writing
H
−1
=
H1 H2 H2 H4
V = X(X DX)−1 X Q = Z(Z Z)−1 we have H1 = V H2 = −V DQ + Q H4 = −Q DQ + Q DV DQ
Proof It is easily seen that the conditions of [R, Z] having fcr, R X having fcr and X DX being non-singular are all equivalent. Observing that Z V = 0, Z Q = I and ZQ = QZ = I − XX , one can easily verify the expressions for H1 , H2 and H4 . ✷ Geometrically, both Z(Z Z)−1 Z and XX = I −Z(Z Z)−1 Z are projections. In our case, X is a qT × (q − 1)T matrix. For flows, the linear combinations of X are exactly the series with zero year totals; for stocks, the series with zeroes in, for example, each first subperiod. Note that X is not unique. Now we shall show that in our problem [R, Z] has fcr and that theorem 3.1 applies. This will lead to the first version of the solution matrix, which is the main result of this subsection. Proposition 3.2 Let R be the (n − d) × n matrix which maps a n-vector n−1 of d-th differences of v, where v onto the (n − d)-vector (1 − L)d vi i=d n−1
d < n. Let, for 1 ≤ q ≤ d, w q = (iq−1 )i=0 . Then R u = 0 iff u is a linear combination of w 0 , . . . , w d−1.
7
Proof Linear combinations of w 0 , . . . , w d−1 correspond to polynomials in i of degree < d. Therefore, they are zero (have coefficients 0) whenever they have d or more zeroes. This proves that w 0 , . . . , w d−1 are linearly independent. Since R has n − d linearly independent columns, the subspace of vectors ⊥ R has dimension d. Therefore, it suffices to prove that, for 0 ≤ q ≤ d − 1, w q ⊥ R, or that (1 − L)d w q = 0 . This is easily done by induction on d. ✷ Lemma 3.3 Let d ≤ T . Let n = qT and let the n × (n − r) matrix X be an orthogonal complement of Z, i.e. X satisfies X X = In−r and XX = In − Z(Z Z)−1 Z . Then X DX is non-singular and Theorem 3.1 applies. Proof It suffices to show that [R, Z] is of fcr. Assume that for some nvector µ, R µ = 0 and Z µ = 0 We must show that µ = 0. Since R µ = 0, according to the previous proposition we can write
d−1 µ = (µi )n−1 i=0 = c0 + · · · + cd−1 i
n−1 i=0
Define C(τ ) =
q−1
xi c0 + c1 (τ + i) + · · · + cd−1 (τ + i)d−1
i=0
Then Z µ = 0 is equivalent to C(τ ) = 0, τ = 0, q, 2q, · · · , (T − 1)q. C(τ ) is a polynomial in τ of degree d − 1, with T different zeroes, and therefore all coefficients must be 0. Solving successively for cd−1 , cd−2 , · · · , c0 , using that x ι = x0 + · · · + xq−1 = 0 , one finds that all cj must be 0. This proves that µ = 0. ✷ For the sake of elegance, Theorem 3.4 presents not the solution matrix, but a simple transformation thereof. The actual solution matrix can be obtained by postmultiplying the matrix expression given in Theorem 3.4 by an arbitrary right inverse of Z .
Theorem 3.4 The matrix I − X (X DX)−1 X D transforms a series y of quarterly data into an ‘optimal’ series u of quarterly data satisfying Z u = Z y. Proof Let v = Z y be a series of annual data and u the ‘optimal’ series of constructed quarterly data. From (2.3) and theorem 3.1 we have u = H2 Z y −1 −1 = I − X (X DX) X D Z (Z Z) Z y = =
−1
X D (I − XX ) y
−1
X D y
I − X (X DX) I − X (X DX)
8
(3.1)
✷ Note that proposition 3.2, lemma 3.3 and theorem 3.4 impose no restriction on q. As regards the interpretation of expression (3.1) in the context of Generalized Least Squares, we note that D + = R (R R)
−2
R
with D + the Moore-Penrose inverse of D. So equation (3.1) implies that the solution u is the vector of residuals in a Generalized Least Squares model y = Xβ + ε, with Eεε = σ 2 D +
3.2
Version 2
We would rather rephrase Theorem 3.4 in terms of the matrix Z = IT ⊗ x, where x is the aggregation criterion; in other words, we want to eliminate the matrix X from Theorem 3.4. We shall succeed in this only up to a low-rank matrix. As a first step, we compute the rank of R Z. Proposition 3.5 If d ≤ q then (i) Rank (R Z) = T for non-flow variables; (ii) Rank (R Z) = T − 1 for flow variables. Proof If R Z were fcr, then it would have rank T . Thus, it is sufficient to prove that R Zw = 0 iff both x and w are constant vectors. The proof consists of a polynomial argument as in the proofs of propositions 3.2 and 3.3. Assume R Zw = 0. By proposition 3.2, there exist α0 , · · · , αd−1 such that (Zw)i =
d−1
αj ij for i = 0, · · · , qT − 1
j=0
or xl wi =
d−1
αj (qi + l)j for l = 0, · · · , q − 1 and i = 0, · · · , T − 1
j=0
Summation over l gives
xl wi
q−1 l=0
=
q−1 d−1
αj (qi + l)j
l=0 j=0
9
or wi =
d−1
βj ij
j=0
q−1
for suitable β0 , · · · , βd−1 (recall that of l we get T −1
wi xl =
i=0
T −1 d−1
l=0
xl = 0). Summing over i instead
αj (qi + l)j
(3.2)
i=0 j=0
from which follows either T −1
wi = 0
i=0
or xl =
d−1
γj l j
j=0
for suitable γ0 , · · · , γd−1 . In the first case, the right-hand side of (3.2) is a polynomial in l of order < q with at least q zeroes. Therefore, all coefficients must be 0 and it follows that αj = 0 for j = 0, · · · , d − 1, and also that Zw = 0. Since Z is fcr, w is 0 too. In the second case we have
(Zw)qi+l = xl wi =
γj l j
d−1 j=0
d−1
βj ij =
d−1
j=0
αj (qi + l)j
(3.3)
j=0
Keeping l fixed, the latter two expressions are polynomial expressions in i which are equal for i = 0, · · · , T − 1. Therefore, their coefficients must be equal. Let k < d be the highest index with αk = 0. Then the coefficient of ik d−1 j is equal to both βk and αk q k . Allowing l to vary again, we have on j=0 γj l the left side a polynomial in l of order d − 1 < q, on the right side a constant. These must be equal for q different values of l and are therefore equal. We conclude that γj = 0 for j > 0 and that x is of the form γ (1, 1, · · ·) : the flow case. Thus, (3.3) becomes
xl wi = γwi = γ
d−1 j=0
βj ij =
d−1
αj (qi + l)j
j=0
10
or
wi =
d−1 j=0
βj ij =
d−1
αj (qi + l)j /γ
j=0
for i = 0, · · · , T − 1 and l = 0, · · · , q − 1. For fixed i, the left and middle expressions are constants, whereas the rightmost expression is a polynomial in l of order < d. They coincide on at least q different values for l. We conclude that the polynomial cannot have higher-order terms and that aj = 0 for j > 0. Therefore both w and x must be constant vectors. It is easily verified that for constant w and x we do indeed have R Z = 0. This completes the proof. ✷ Z and X are each others orthogonal complement; R Z and R X are not. Instead, we define S = (R R)−1 R Z which has a more tractable relation to R X, as will be proved in the following lemma. Observe that rank(S)=rank(R Z). Lemma 3.6 S = (R R)−1 R Z and R X contain each others ‘orthogonal complement’, i.e., a vector which is orthogonal to all columns of one matrix is a linear combination of columns of the other. Proof S w = 0 ⇔ Z R (R R)−1 w = 0 ⇒ R (R R)−1 w = Xz ⇒ w = R Xz X Rw = 0 ⇒ Rw = Zz ⇒ w = (R R)−1 R Zz = Sz ✷ The geometric relation between S and R X is expressed in the following lemma. Lemma 3.7 Let W be an orthonormal matrix spanning the intersection of the spaces of R X and S Then IkT −d − R X (X DX)
−1
+
X R = S (S S) S − W W
where W is of rank d in the non-flow case and d − 1 in the flow case. Proof Lemma 3.6 implies that R X and S together span IRkT −d . The rank of R X is (k − 1)T ; the rank of S is T in the non-flow case; T − 1 in the flow case. The rank of W equals the sum of their ranks minus kT − d. It is easily seen that both the left-hand side and the right-hand side are projections, i.e. , are symmetric and idempotent. We must show that S (S S)+ S R X = W W RX : let u be a linear combination of the columns of R X.Then u can be written as v + w, where v is a linear combination of columns of S, and w ⊥ S. Since R X contains the orthogonal complement of S, w is in the space of R X and therefore v = u − w is also in the space of R X, and therefore in the space of W . Thus, S (S S)+ S u = W W u = w. ✷ 11
Next, the basic theorem of this subsection is presented, in which the expression I − X (X DX)−1 X D, the conversion matrix from theorem 3.4, is replaced by the sum of an expression in Z and two low-rank terms. Theorem 3.8 Let T ≥ d and q ≥ d; let A = R (R R)−1 R be the projection matrix associated with R and let the orthonormal matrix B satisfy BB = I − A. Then the transformation of a series y of quarterly data into an ‘optimal’ series u satisfying Z u = Z y can be written as
u = D+ Z Z D+ Z
+
Z A − RW W R + BM y
(3.4)
where W is a (qT − d) × (d − 1) matrix in the flow case and a (qT − d) × d matrix otherwise, and M is a d × n matrix. Proof In lemma 3.7 above, premultiply both sides of the equation with R (R R)−1 and postmultiply with R . Write D for RR , D + for R (R R)−2 R , S S for Z D + Z, and A for R (R R)−1 R . Then we get
−1
A I − X (X DX)
X D = D+ Z Z D+ Z
+
Z A − RW W R
or, writing I − BB for A, I − X (X DX)−1 X D = + D + Z (Z D + Z) Z A − RW W R + BB I − X (X DX)−1 X D The last term can be rewritten as BM. Together with (3.1) this proves the theorem. ✷ In the appendix we shall derive an algorithm from this theorem which computes u in linear time and memory from y.
4
BFL in practice
In this section we explore the quality of the BFL method in some practical experiments. Eight series were selected from the database of MORKMON , the quarterly model of the Dutch economy of the Dutch central bank (De Nederlandsche Bank (1984a)). The quarterly series are aggregated in different ways to obtain series of annual observations. From these annual values, quarterly series are computed using four smoothing criteria: parameter d of equation (2.1) varies from 1 to 3. In addition, d = 0 is associated with dividing the annual value by 4 in the flow case, and repeating the annual 12
value each quarter in the stock case. The derived series of quarterly figures are then compared with the original series. The quality of the approximations can be expressed by means of goodness-offit statistics frequently used in the analysis of simulation results of macroeconometric models: Theil’s inequality coefficient and the mean relative absolute error. The coefficient of determination which expresses the relationship between the constructed series and the original series is taken into consideration too. Theil’s inequality coefficient, U, is defined as the positive square root of T
t=1
ˆt − (1 − L) Xt (1 − L) X
2
U 2 = 2 2 T T ˆ + t=1 {(1 − L) Xt } t=1 (1 − L) Xt
(4.1)
see Theil (1966), p.28. U takes on values between 0 and 1. Better approximations lead to lower values of U. Systematic forecast errors are not accounted for by U. For that reason we have included the Mean Relative Absolute Error, MRAE, which is defined as !
ˆ t − Xt T !! X 1 ! MRAE = T t=1 !! Xt
! ! ! ! ! !
(4.2)
When the constructed series of quarterly data is equal to the original series, the value of MRAE is 0. In table 4 the definition of the variables is given. The series can be found in De Nederlandsche Bank (1984b). The sample period is the first quarter of 1970 up to and including the last quarter of 1979, i.e. 40 quarters or 10 years. The first two variables, c and ynnpm, are flows, the next four, Vps , Vcb , F A and Sb , are stocks and two series, al and ap , are constructed, the former by the method of Ginsburgh (1973) with government employment as related series and the latter by applying the method of BFL.
13
Table 4.1 - Definition of series series private consumption net disposable national income wealth of private sector wealth of central bank financial assets (owned by the private sector) time and savings deposits employment in the enterprise sector labour force (population between 14 and 65 years)
symbol c ynnpm Vps Vcb FA Sb al ap
For all eight variables quarterly series are reconstructed in 8 different ways. The first four reconstructions assume that the variable is a flow. The series of annual values is calculated as the sum of the quarterly values. The series are constructed using d = 0, d = 1, d = 2 and d = 3 in equation (2.1) above. The second four values are constructed under the assumption that the variable is a stock and that the annual value of the variable equals the value of the first quarter. Here the value of d in equation (2.1) is varied too. The constructed series are compared with the original series. Table 4.2 shows the values of the coefficient of determination (R-SQ), Theil’s inequality coefficient (U) and the mean relative absolute error (MRAE) for the different approximations. The following notation is used in the table. The capital denotes the type of aggregation: F stands for flows and S for (first-period) stocks. The number indicates the value of parameter d in equation (2.1). Table 4.2 - Quality of the approximations variable c approximation R-SQ U MRAE R-SQ 1 F0 0.975 0.790 4.453 0.866 2 F1 0.993 0.576 2.629 0.910 3 F2 0.993 0.562 2.533 0.912 4 F3 0.994 0.556 2.519 0.913 5 S0 0.930 0.789 7.454 0.569 6 S1 0.978 0.595 3.601 0.709 7 S2 0.984 0.571 3.463 0.703 8 S3 0.984 0.565 3.444 0.700
14
ynnpm U 0.843 0.790 0.782 0.779 0.808 0.780 0.774 0.771
MRAE 3.411 2.631 2.584 2.572 5.797 4.574 4.554 4.555
Table 4.2 (continued) variable Vps approximation R-SQ U 1 F0 0.994 0.435 2 F1 0.995 0.466 3 F2 0.995 0.471 4 F3 0.995 0.473 5 S0 0.979 0.439 6 S1 0.994 0.465 7 S2 0.995 0.466 8 S3 0.994 0.469 Table 4.2 (continued) variable FA approximation R-SQ U 1 F0 0.992 0.524 2 F1 0.999 0.184 3 F2 1.000 0.167 4 F3 1.000 0.167 5 S0 0.974 0.517 6 S1 0.994 0.254 7 S2 1.000 0.168 8 S3 0.999 0.169 Table 4.2 (continued) variable al approximation R-SQ U 1 F0 0.844 0.604 2 F1 0.926 0.495 3 F2 0.949 0.445 4 F3 0.944 0.422 5 S0 0.509 0.613 6 S1 0.717 0.537 7 S2 0.798 0.473 8 S3 0.828 0.452
MRAE 1.758 1.448 1.434 1.438 3.194 1.458 1.374 1.452
MRAE 2.564 0.679 0.555 0.569 4.435 0.914 0.544 0.562
MRAE 0.341 0.244 0.204 0.217 0.537 0.408 0.362 0.337
R-SQ 0.738 0.860 0.848 0.838 0.403 0.848 0.517 UNDEF
Vcb U 0.641 0.626 0.610 0.608 0.612 0.537 0.528 0.573
MRAE 80.687 74.726 77.854 80.877 66.735 65.308 97.140 90.175
R-SQ 0.989 0.996 0.996 0.996 0.962 0.995 0.996 0.990
Sb U 0.514 0.351 0.347 0.342 0.538 0.338 0.332 0.361
MRAE 2.731 1.645 1.704 1.774 4.647 1.670 1.386 1.721
R-SQ 0.990 1.000 1.000 1.000 0.973 0.997 1.000 1.000
ap U 0.577 0.104 0.006 0.006 0.576 0.161 0.010 0.004
MRAE 0.344 0.037 0.002 0.003 0.513 0.063 0.005 0.003
For one of the variables, the wealth of the private sector Vps , we compare the original series and the constructed ones graphically. The order and the notation of the graphs corresponds to table 4.2. First, the approximations based on the assumption of flow variables are given and then the results of assuming beginning-of-period stocks are shown. From the graphs it is obvious that as the parameter d increases, the constructed series becomes smoother. Furthermore, it becomes clear that the assumption on the type of variable 15
to be interpolated definitely influences the outcomes. Which conclusions can be drawn from table 4 and the graphs? Firstly, stocks can be better approximated than flows. Of course, this is not really surprising: Considering that a stock variable can be regarded as the cumulated version of a flow variable, and conversely a flow variable can be regarded as the first difference of a stock variable, it is only natural that fluctuations show up more dramatically in the flow version. Secondly, we come to the same recommendation regarding the choice of d as BFL, viz. d = 2 is to be preferred to d = 1. Based on our experiments, we conclude that d = 3 might be even slightly better. Theory forbids values of d exceeding the number of subperiods. Thirdly, although we know that the series ap is constructed with BFL’s method, it is not clear which constraint has been used. More possibilities lead to excellent results. Finally, it is a reassuring thought that simply dividing by 4 is not the best method.
16
Figure 1: Approximation of wealth private sector: ‘flow data’
17
18
Figure 2: Approximation of wealth private sector: ‘stock data’
19
20
5
Concluding remarks
This paper deals with the construction of quarterly data when only annual observations are available. The method of Boot, Feibes and Lisman (1967), denoted by BFL, has been further elaborated. BFL propose to minimize the sum of squared first or second differences of the unknown quarterly series, subject to an appropriate constraint. Intuitively, their method comes down to finding the smoothest curve satisfying the constraint. Regarding the method two aspects are important. Firstly, stocks have to be distinguished from flows. Flow variables can be aggregated over time by summing. For stocks several possibilities exist. The annual value of a stock may be equal to its value in the first or the last quarter, or the average of the quarterly values. The last case is mathematically equivalent to the flow case. The constraint of the minimization problem depends on the aggregation criterion used. Secondly, one must decide which squared sum to minimize. We performed some experiments to shed light on these two points. Known quarterly series were compared with constructed data. We see that flows are more difficult to approximate than stocks, as would be expected. Third differences might lead to even slightly better results than second differences. Second differences perform better than first differences, as proposed by BFL. Solving the minimization problem of the method of BFL is the same as solving a system of linear equations for the values of the unknown quarterly observations. The solution, the vector of unknown quarterly observations, can be expressed as the product of a matrix and the vector of annual observations. An explicit expression for the matrix that converts annual totals into quarterly data has been presented. The appendix describes an algorithm, based on this analysis, that computes the quarterly series in linear time and linear memory. A close connection between the method of BFL and Generalized Least Squares has been found. Our algebra must look familiar for readers acquainted with seasonal adjustment methods. Series of quarterly observations are transformed into other quarterly series in order to get rid of seasonal patterns. Den Butter and Fase (1988) present an overview (in Dutch) of methods applicable. In connection with our analysis, Henderson moving averages are interesting. Henderson moving averages, applied in the well-known Census X-11 method, are designed to reproduce a cubic polynomial trend. Three alternative smoothing criteria are equivalent (see the appendix of Kenny and Durbin (1982)): • minimization of the variance of the third differences of the output series, • minimization of the sum of squares of the third differences of the moving average coefficients,
21
• fitting a cubic by weighted least squares, with the sum of squares of third differences of the weights a minimum. Until this moment we have not paid much attention to the relationship between seasonal adjustment, especially Henderson moving averages, and optimal disaggregation in time to construct data for missing observations. We hope to deal with this phenonemon in a subsequent study. Finally, we mention two other topics for future research. Firstly, since the structure of the matrix that converts series of annual observations into quarterly data is known, the effects of truncation can be studied. A comparison can be made with seasonal adjustment in which filters with fixed length are used. Secondly, one might consider variables with more general time series properties. In this respect a paper of Stram and Wei (1986a) on the aggregation of ARIMA-processes might be interesting. They state that aggregation which take basic series ARIMA models to aggregate series ARIMA models induces a many-to-one mapping. This type of mapping makes it impossible, in general, to find a unique basic ARIMA process when the model for the aggregates is known.
22
A
A linear algorithm for optimal disaggregation
Below, we shall restrict ourselves to the case d = 2. The case d = 1 is left as an exercise. It is quite possible that this part of the analysis can also be generalized to higher d. We refer to the notation introduced in sections 2 and 3.
A.1
Breaking up the problem
In theorem 3.8, we derived the matrix expression
u = D + [Z(Z D + Z)+ Z A − RW W R ] + BM y where y is an arbitrary series of quarterly data and u an ‘optimal’ series satisfying Z u = Z y. A equals the projection matrix R(R R)−1 R . It follows from proposition 3.2 that the orthogonal complement of A (or of R) is spanned by the vectors ι = (1, 1, 1, . . .) and δ = (0, 1, 2, . . .) . Thus, A = I − BB , where B is obtained by orthonormalizing (ι δ), which can be done in linear time. The latter equality gives a linear algorithm for left multiplying a vector by A. Left multiplication of a vector by Z or Z can trivially be done in linear time. We shall demonstrate that the other components of this expression also require only linear time to compute: in section A.2 we shall see that D + has a simple structure which leads to a linear algorithm for D + ; section A.3 decomposes Z D + Z into a high- and a low-rank part such that the (Moore-Penrose) inverse of Z D + Z can be derived from the (true) inverse of the high-rank part. The high- and low rank parts are analyzed in A.4 and A.5 respectively. Section A.6 contains an algorithm for W and section A.7 shows how the term BMy can be solved in linear time from the restriction Z y = Z u. Section A.8 discusses a numerical problem.
A.2
Computing D+
Before computing the MP-inverse D + of the matrix D, we introduce additional notation. Let, for a positive integer n,
N ≡ Nn ≡
0 1 0 1 1 0 1 1 1 0 . . . .
an n × n matrix
23
Ln ≡ Nn2 ≡
0 0 1 2 3 . .
0 0 0 1 0 0 2 1 0 0 . .
Below, we shall often drop the subscript n when we mean n = qT . In section 2 we introduced the matrix Rn . In the case d = 2 we have
R ≡ Rn ≡
1 −2 1 1 −2 1 1 −2 . 1 . . . . .
,
an n × (n − 2) matrix
Lemma A.1 [RR ]+ = ALL A Proof A necessary and sufficient condition for A.1 to hold is: ALL ARR = A Since
In−2 L AR = L (I − BB )R = L R = 0 0 we have
In−2 Rn ALL ARR y = AL 0 R y = AL 0 y 0 0 L[Rn , 0, 0] y turns out to be equal to y up to a series with constant first differences. Since A maps such series onto 0, this proves that ALL ARR y = Ay. Now it is easy to show that all four conditions of the Moore-Penrose inverse are satisfied. ✷
24
A.3
Computing (Z D+ Z)+
Above, we saw that Dn+ = An Ln Ln An and An = I − BB . Thus, we can write Z D + Z = Z Ln Ln Z − Z (BB LL + LL BB )Z + Z BB LL BB Z The first term has rank n − 2 and can be written as the sum of a full rank matrix U and of a matrix of rank 2 such that U −1 = V V where left multiplication with V requires linear time; the remaining terms, including this rank 2 matrix, will be shown to be of the form XY X with X a T × 4 matrix of fcr and Y a non-singular 4 × 4 matrix. Theorem A.2 Let U be n × n symmetric and non-singular, Y k × k symmetric and non-singular and X n × k of fcr. Then (U + XY X )−1 exists iff (X U −1 X + Y −1 )−1 exists and (U + XY X )−1 = U −1 − U −1 X(X U −1 X + Y −1 )−1 (U −1 X) Proof Straightforward.
✷
Assume U −1 = V V and V X = W in the above expression. Then it can be rewritten as (U + XY X )−1 = V (Ik − W (W W + Y −1 )−1 W )V Thus in the non-flow case, this gives us an algorithm for the inverse of Z D + Z. In the flow case, we first invert Z D + Z + cιι for some constant c, with ι a vector of ones, and then apply the following Theorem A.3 Let U be n×n positive semidefinite of rank n−1 and Uv = 0 with v v = 1. Then, for any non-zero scalar c, U + cvv is non-singular, and the Moore-Penrose inverse of U can be written as 1 U + = (U + cvv )−1 − vv c
A.4
The high rank term Z LLZ
For n = T × q we can write Nn = IT ⊗ Nq + NT ⊗ Jq 25
q−1 Observing that Nq ιq = δq and Nq ιq = 7q , where δq = (i)i=0 and 7q = q−1 (q − 1 − i)i=0 , we can also write
Ln = Nn Nn = In ⊗ Ln + Nn ⊗ (ιn δq + δq ιq ) + LT ⊗ (qJq ) Recall that Z was defined as a Kronecker product IT ⊗ x. Thus, Z LL Z can be written as a sum of 9 Kronecker products, in which the second Kronecker factor in each case reduces to a constant of the form x M1 M2 x, where M1 and M2 are q × q matrices: Z Ln Ln Z = c0 LT LT + c1 (LT + LT ) + c2 (LT NT + NT LT ) + c3 (NT + NT ) + c4 IT + c5 NT NT This can be rewritten as Z Ln Ln Z = c6 (IT + c7 NT + c8 LT )(IT T + c7 NT + c8 LT ) + c9 JT + c10 NT JT NT + c11 (JT NT + NT JT ) The coefficients c6 . . . c11 can be found as follows: using the identities N + N = J − I LN + NL = NJN − NN L + L = NN + N N = N(J − I − N ) + (J − I − N)N = I − 2NN + NJ + JN − J c11 . This we can write a system of six equations for the six unknowns c6 . . .√ system can be reduced to a fourth order polynomial equation in c6 from the solution of which the other constants can be derived. We don’t know whether this system always has a real solution; in all cases we encountered, including the flow and pure stock cases, it did. Writing ∆ = (I + c7 N + c8 L)−1 , we must show now that w = ∆v can be computed in linear time. Observe that
∆=
d0 d1 d0 d2 d1 d0 . . . . . .
26
where d0 = 1 d1 = −c7 di = −c8 (d0 + · · · + di−2 ) + (1 − c7 )di−1 for i > 1 Thus, wi =
j≤i
wi+1 =
di−j vj , where, for i + 1 > 1, wi+1 can be rewritten as
di+1−j vj + d1 vi + d0 vi+1
j