Changing sets of parameters by partial mappings Michael Wiedenbeck and Nanny Wermuth Center of Survey Research, Mannheim, Germany and Department of Mathematical Statistics, Chalmers/G¨oteborgs Universitet, Sweden Abstract: Changes between different sets of parameters are needed in multivariate statistical modeling. There may, for instance, be specific inference questions based on subject matter interpretations, alternative well-fitting constrained models, compatibility judgements of seemingly distinct constrained models, and different reference priors under alternative parameterizations. We introduce and discuss a partial mapping, called partial replication, and derive from it a more complex mapping, called partial inversion. Both operations are shown to be tools for decomposing matrix operations, for explaining recursion relations among sets of linear parameters, for deriving effects of omitted variables in alternative model formulations, for changing between different types of linear chain graph models, for approximating maximum-likelihood estimates in exponential family models under independence constraints, and for switching partially between sets of canonical and moment parameters in distributions of the exponential family or between sets of corresponding maximum-likelihood estimates. Key words and phrases: Exponential family, graphical Markov models, independence constraints, partial replication, partial inversion, reduced model estimates, REML-estimates, sandwich estimates, structural equations.
1. Definitions and properties of two operators 1.1. Partial inversion We start with a linear function connecting two real valued column vectors y and x via a square matrix M of dimension d for which all principal submatrices are invertible, My = x.
(1.1)
The index sets of rows and columns of M coincide and are ordered as V = (1, . . . , d). For an arbitrary subset a of V we consider first a split of V as (a, b) with b = V \ a. A linear operator for equation (1.1), called partial inversion, may be used to invert M in a sequence of steps and forms a starting point for proving properties of graphical 1
Markov models, see Wermuth and Cox (2004), Wermuth, Wiedenbeck and Cox (2006), and Marchetti and Wermuth (2007). Partial inversion is a minimally changed version of Gram-Schmidt orthogonalisation and of the sweep operator, the changes being such that an operator with attractive features results. After partial inversion applied to rows and columns a of M, denoted by inva M, the argument and image in equation (1.1), relating to a, are exchanged inva M
xa yb
!
=
! ya , xb
inva M =
−1 −1 M Maa −Maa ab −1 Mba Maa Mbb.a
!
.
(1.2)
−1 M The matrix Mbb.a = Mbb − Mba Maa ab is often called the Schur complement of Mbb , −1 after Issai Schur (1875-1941), and Maa denotes the inverse of the submatrix Maa of M. Partial inversion with respect to b applied to inva M is denoted in several equivalent ways depending on the context
invb (inva M) = invb ◦ inva M = invV M = invab M, where invV M yields the inverse of M. Some of the basic properties of partial inversion are (i)
inva ◦ inva M = M,
(ii)
(inva M)−1 = invb M,
(1.3)
(iii) inva M = invb (M −1 ). The operator is commutative, can be undone, and gives the same result if applied with respect to an index set within a submatrix of M than if applied to M and then selecting the submatrix. More precisely, for three disjoint subsets α, β, γ of V , with a = α ∪ β (i)
invα ◦ invβ M = invβ ◦ invα M,
(ii)
invαβ ◦ inv βγ M = invαγ M,
(1.4)
(iii) invα Maa = [invα M]a,a . Indeed all properties, given so far, result with equations (1.8) below, where partial inversion is expressed in terms of what will below be defined as partial replication. Further properties, not used in this paper, may be derived by viewing partial inversion as a group isomorphism. From this point of view, partial inversion operates on the power 2
set of V which has group structure with the symmetric difference, denoted by △, as law of composition. The neutral element is the empty set and the inverse of a subset a of V is a itself. Then, the image space of partial inversion is the set of transformations {inva , a ⊂ V }, a group by the composition of transformations with the identity as neutral element. Again, for every subset a of V , inva is the inverse of itself. Both groups are Abelian. 1.2. Partial replication We now introduce another operator for equation (1.1), called partial replication, which represents a partial mapping. After partial replication applied to rows and columns a of M, denoted by repa M, the argument relating to a and denoted by ya , is replicated while the relation for the argument of b = V \ a is preserved as in equation (1.1) repa M
ya yb
!
=
! ya , xb
repa M =
! Iaa 0ab . Mba Mbb
(1.5)
Partial replication with respect to b applied to repa M is denoted in a way analogous to partial inversion repb (repa M) = repb ◦ repa M = repV M = repab M, where repV M yields the identity matrix I. A basic property of partial replication is repa ◦ repa M = repa M.
(1.6)
The following matrix forms relate directly to components of invb M ! I 0 aa ab (i) (repa M)−1 = −1 −1 , −Mbb Mba Mbb ! −1 M M M aa.b ab bb (ii) M(repa M)−1 = , 0ba Ibb and the last matrix product has been used in the numerical analytical technique of block Gaussian elimination as one special form in which the Schur complement Maa.b of Maa occurs. The operator is commutative, but cannot be undone. It gives the same result as when applied with respect to an index set within a submatrix of M as when applied to M and 3
then selecting the submatrix. For three disjoint subsets α, β, γ of V and a = α ∪ β (i)
repα ◦ repβ M = repβ ◦ repα M,
(ii)
repαβ ◦ rep βγ M = repαβγ M,
(1.7)
(iii) repα Maa = [repα M]a,a Thus, it shares commutativity (i) and exchangeability (iii) with partial inversion, but differs regarding property (ii). Nevertheless, partial inversion can be expressed in terms of partial replication. By direct computation and with property (1.3)(ii) invb M = (repb M)(repa M)−1 , inva M = (repa M)(repb M)−1 .
(1.8)
1.3. Partial inversion combined with partial replication Basic properties of the two operators combined are (i)
inva ◦ repa M = repa M,
(ii)
invb ◦ repa M = (repa M)−1 = repa ◦ invb M
(iii) repb ◦ invb M = M(repa M)
(1.9)
−1
For a partition of V into α, β, γ, δ and s = V \ δ, some of the derived properties are (i)
invα ◦ repβ M = repβ ◦ invα M,
(ii)
invαβ ◦ repβγ M = invα ◦ repβγ M,
(1.10)
(iii) repα ◦ invβ Mss = [repα ◦ invβ M]s,s . Thus, the order of combining the operators matters unless applied to disjoint index sets and for partial inversion applied after partial replication, overlapping index sets are ignored, a contraction instead of the expansion property (1.7)(ii) of partial replication. Partial replication applied after partial inversion and to an overlapping index set coincides, just like (1.8), with a matrix product involving two partially replicated matrices repβγ ◦ invαβ M = (repαγ M)(repγδ M)−1 .
(1.11)
For the first component on the right-hand side, partial replication is with respect to the symmetric difference of c = β ∪ γ and a = α ∪ β. The complement to the argument of 4
partial inversion, b = V \a = γ ∪δ, defines the partial replication in the second component. Conversely, given the partially replicated matrices on the right-hand side, partial inversion of M on the left-hand side is with respect to V \ (γ ∪ δ) = a and the result is partially replicated with respect to β ∪ γ, the symmetric difference of a and a△c. Proof. For the proof of equation (1.11), we obtain vectors from the basic equation (1.1) and partitioned in the order (α, β, γ, δ), with as before a = α ∪ β and b = γ ∪ δ, by
yα y y = (ya ; yb ) = β , yγ yδ
xα x (xa ; yb ) = β , yγ yδ
yα x (xβδ ; yαγ ) = β . yγ xδ
From the definitions of partial replication (1.5) and partial inversion (1.2) we use for (1.11) repb My = (xa ; yb ),
inva M(xa ; yb ) = (xb ; ya )
and let N = inva M. Partial replication of N with respect to c = β ∪γ modifies the vector (xa ; yb ) into (xβδ ; yαγ ) so that (repc N) repb My = (xβδ ; yαγ ) and repαγ My = (xβδ ; yαγ ), thus completing the proof. It is instructive to see also a direct matrix proof. For this, we denote the matrix obtained by partial replication with respect to a△c by Q and with respect to b by R
Iαα 0αβ M M repa△c M = βα ββ 0γα 0γβ Mδα Mδβ
0αγ Mβγ Iγγ Mδγ
0αδ Mαα Mαβ Mβδ Mβα Mββ , repb M = 0γα 0γβ 0γδ Mδδ 0δα 0δβ
Mαγ Mβγ Iγγ 0δγ
Mαδ Mβδ . 0γδ Iδδ
For equation (1.11) to hold, one needs to show that QR−1 = repc N, where R−1 =
−1 −M −1 M Maa ab aa 0ba Ibb
!
=
! Naa Nab . 0ββ Iββ
The rows of components α and γ in QR−1 coincide directly with those of repc N. The 5
rows of component β in repc N result since for Q and R, the rows of β coincide so that the product QβV R−1 gives zeros whenever the row index within β for QβV differs from the column index in R−1 and equals one, otherwise. Finally, for the rows of components δ we have for QδV R−1 by the defining equation for partial inversion (1.2) −1 −1 −1 (Mδa Maa Mδγ − Mδa Maa Maγ Mδδ − Mδa Maa Maδ ) = (Nδa Nδb ),
which completes the proof. The following use of equation (1.11) represents a change of basis from (xa ; yb ) to (xd ; yc ) repa△c N(xa ; yb ) = (xd ; yc) = invc M(xc ; yd ). (1.12)
Proof. For equation (1.12), note that, by the definition of partial replication (1.5) with respect to a△c = α ∪ γ, components xα and yγ of (xa ; yb ) are replicated. By the definition of partial inversion of M with respect to a (1.2), one knows further that Nβa xa + Nβb yb = yβ ,
Nδa xa + Nδb yb = xδ .
In the following sections, partial replication is applied to quite distinct statistical themes. In section 2, we simplify proofs of a number of results known for different aspects of linear models. In sections 3 and 4, linear relations are obtained for sets of parameters and for sets of estimates in nonlinear models within the exponential family of distribution. 2. Decompositions of partial inversion 2.1. A partially inverted covariance matrix Partial replication provides with (1.8) a decomposition of partial inversion. Let Σ denote the joint covariance matrix of mean-centred random vector variables Ya , Yb , then for instance ! ! ! Σaa|b Πa|b I 0 Σ Σ aa ab aa ab −1 invb Σ = = −1 = repb Σ (repa Σ) −1 0ba Ibb −ΠT Σ −Σ−1 bb a|b bb Σba Σbb gives with Σaa|b = Σaa − Σab Σ−1 bb Σba ,
Πa|b = Σab Σ−1 bb ,
Σ−1 bb ,
a one-to-one transformation of marginal covariances and variances, namely the elements of Σ, into parameters in the concentration matrix Σ−1 bb of Yb and into linear least-squares 6
regression parameters of Ya on Yb , i.e. in the model defined by equations Ya − Πa|b Yb = ǫa ,
E(ǫa ) = 0,
cov(ǫa , Yb ) = 0,
cov(ǫa ) = Σaa|b ,
where the interpretation of Πa|b as a matrix of regression coefficients results by post multiplication with YbT and taking expectations. Zero constraints on (Πa|b , Σaa|b ) have been studied especially in econometrics. The possible existence of multiple solutions of estimating equations, derived by maximizing the Gaussian likelihood function, in a Zellner model, also called seemingly unrelated regressions (Zellner, 1962), has more recently been demonstrated by Drton and Richardson (2004). Zero constraints on concentrations, such as in Σ−1 bb , had been introduced as a tool for parsimonious estimation of covariances, called covariance selection (Dempster, 1972). The Gaussian likelihood function has a unique maximum for all possible sets of zero concentrations though for some models it is necessary to use iterative algorithms to find the solution. Zero constraints on covariances, such as in Σaa|b , had been studied by Anderson (1969, 1973). When zero constraints on Σ−1 are preserved for all variable pairs under the transformation from invV Σ−1 = Σ, is one typical problem of independence equivalence, studied within the field of graphical Markov models. Similarly, one investigates in this context when zero constraints on Σ−1 are preserved in inva Σ−1 = invb Σ. 2.2. Three recursion relations among linear parameter matrices Let a set of variables be partitioned according to V = (a, γ, δ), again with b = γ ∪ δ, but a = α and let an estimate of invb Σ be given, so that implicitly Ya is regressed on Yb . Assume that invδ Σ has been estimated in another study. Then, which systematic changes are to be expected among the two sets parameters and the corresponding estimates? Written explicitly, the two sets of parameters are
invb Σ =
Σaa|b Πa|γ.δ Πa|δ.γ ∼ ∼
Σγγ.a ·
Σγδ.a Σδδ.a
Σaa|δ Σaγ|δ Σγγ|δ , invδ Σ = · ∼
∼
Πa|δ Πγ|δ , δδ.aγ Σ
where · indicates an entry in a symmetric matrix and ∼ an entry in a matrix that is symmetric except for the sign. The matrices Σaa|b , Πa|b are as defined before, but the latter is split into two components corresponding to the explanatory variables Yγ , Yδ as Πa|b = (Πa|γ.δ Πa|δ.γ ) and
7
Σγγ.a denotes the submatrix of components γ in the concentration matrix of Yb , that is after having marginalized in the overall concentration matrix Σ−1 over Ya . Since invδ Σ = invγa Σ−1 , one knows from (1.8) that invδ Σ = (repγ ◦ invb Σ)(repaδ ◦ invb Σ)−1 , or Σaa|b Πa|γ.δ Πa|δ.γ Iaa 0aγ 0aδ Iγγ 0γδ invδ Σ = 0γa (2.1) Σγa|δ Σγγ|δ Πγ|δ −ΠT a|δ.γ
Σδγ.a
Σδδ.a
0δa
0δγ
Iδδ
by using the following implications of (1.3)(iii) Σγγ|δ = (Σγγ.a )−1 ,
Πγ|δ = −(Σγγ.a )−1 Σγδ.a
Σγγ|δ ΠT a|γ.δ = Σγa|δ .
By the matrix product (2.1), one obtains directly several known recursion relations, one for covariances (Anderson, 1958, section 2.5) in position (a, a) Σaa|δ = Σaa|γδ + Πa|γ.δ Σγa|δ , for concentrations (Dempster, 1969, chapter 4) in position (δ, δ) Σδδ.aγ = Σδδ.a + Σγδ.a Πγ|δ , and for linear least-squares regression coefficients (Cochran, 1938) in position (a, δ) Πa|δ = Πa|δ.γ + Πa|γ.δ Πγ|δ .
(2.2)
Each of the three equations relates a marginal to a conditional parameter matrix. There will be no change in each of the three types of parameter matrices if and only if at least one of the terms in the added matrix products is zero. For the regression coefficient matrices in (2.2), this can be achieved by study design. If for instance Yδ represent treatments, Yγ factors influencing treatments, and there is randomized allocation of individuals to treatments, then the resulting independence of Yδ and Yγ implies Πγ|δ = 0. The same result is achieved by an intervention in which all levels yγ of variable Yγ can be fixed. By contrast, an only notional intervention is assumed in literature on causal models (Rubin 1974; Pearl, 2000; Lauritzen, 2000) to describe the special situation in which overall effects (Πa|δ yδ ) coincide with conditional effects (Πa|δ.γ yδ ). For a more detailed discussion of assumptions used in the literature on causal models,
8
especially of the assumed partial order, here V = (a, γ, δ), see Lindley (2002), Cox and Wermuth (2004), for extensions of the recursion relation for regression coefficients to nonlinear relations, see Cox and Wermuth (2003), Ma, Xie and Geng (2006), and Cox (2007), and for conditions under which other types of association measures remain unchanged, see Xie, Ma and Geng (2007). Recursion relations are important for comparisons of results of any two empirical studies in which similar research questions are investigated on a core set of common variables but that differ with respect to some variables that are explicitly included in one study but not in the other. 2.3. Reduced form equations and omitted variables Suppose that, on subject matter grounds, a full ordering of variables can be given to mean-centred random variables Yi , for i = 1, . . . , d, and that linear equations are generated which have uncorrelated residuals. Then this system can be described with the random vector Y , in which element i corresponds to Yi and AY = ε,
cov(ε) = ∆ diagonal,
(2.3)
where A is a unit upper-triangular matrix for which the negative element in position (i, j) is a least-squares regression coefficient −Aij = βi|j.r(i)\j , with r(i) = (i + 1, . . . , d). Let there be some zero constraints on the upper off-diagonal elements of A and the effects of the generating process be of interest on the joint dependency of each component of Ya on Yb where a = 1, . . . da , and b = da + 1, . . . , d; and then we call V = (a, b) an order-compatible split of V = (1, . . . , d). From ! ! −1 Aaa Aab A−1 aa −Aaa Aab A= , inva A = 0ba Abb 0ba Abb and the mapping εa inva A Yb
!
=
! Ya , εb
the resulting equation in Ya is Ya = −A−1 aa Aab Yb + ηa , 9
ηa = A−1 aa εa .
(2.4)
In the context of structural equations, this type of equation has been called a reduced form equation. For it, the starting equation (2.3) implies cov(ηa , Yb ) = 0 so that each component Yi of Ya is related by a linear least-squares regression to Yb and Πa|b = −A−1 aa Aab ,
−T Σaa|b = A−1 aa ∆aa Aaa .
The equations imply that, typically, some zero constraints of the generating equations (2.3) will carry over directly to the reduced form equations (2.4) while others take on a more complex form than zero constraints. For more general types of structural equations than those of a triangular system (2.3) questions of identifiability of parameters arise and early results (Fischer, 1966) provided answers for models in which the residual covariance matrix in the reduced form equation is unconstrained, see also Eusebi (2007), or in which it is still a diagonal matrix. Conditions under which such assumptions are satisfied are provided by more recent results on Markov equivalence of models. Let now a = (α, β) be a further order-compatible split of a and Yβ be unavailable in a new study. Which effects are to be expected for the dependence of Yα on Yb ? By equation (1.8) invβ A = (repβ A)(repαb A)−1 , which gives Iαα 0αβ Aαα Aαβ Aαb −1 0βα Iββ 0βb 0βα Aββ 0bα 0bβ Abb 0bα 0bβ
Aαα Aαβ A−1 0αb ββ −1 -A−1 A = 0βα Aββ ββ βb Ibb 0bα 0bβ
Aαb.β
. -A−1 ββ Aβb Abb
Thus, the equation for Yα in (2.3) is changed via (1.2) to Aαα Yα + Aαb.β Yb = ξa ,
ξa = εα − Aαβ A−1 ββ εβ ,
cov(ξa ) = K,
with residuals for some component Yi of Yα typically correlated with regressors for Yi so that the residual covariance matrix, K, is no longer diagonal. Then, the equations parameters Aαα , Aαb.β need not define least-squares regression coefficients in the variables Yα , Yb so that, in particular, a zero equation parameter for Yi , Yj need not correspond to Yi and Yj being uncorrelated given any subset of the remaining variables. In econometrics, one speaks then of an endogenous response variable Yi in a structural equation and of the endogeneity problem. The change in the dependence of Yα on Yb after ignoring the intermediate variable Yβ may be large whenever Aαb deviates much from Aαb.β . There is no change if in the starting system (2.3) either Aαβ = 0, i.e. the final response Ya is unaffected by the intermediate 10
variable, or Aβb = 0, i.e. the intermediate variable does not depend on Yb . By contrast, the reduced form equations after omitting Yβ , Yα = −A−1 αα Aαb.β Yb + ηα , are the subset of equations (2.4) corresponding to Yα with Πα|b = −A−1 αα Aαb.β . Thus, the structural equations after omitting Yβ make also the changes explicit in the dependence of Yα on Yb , when moving from least squares coefficients in (2.3) to the least squares coefficients in reduced form equations (2.4). It may also be verified for the given order V = (α, β, b), that marginalizing over Yα has no effect on the remaining parameters in (2.3) but that marginalizing over variables Yb leads, in general, to a non-diagonal residual covariance matrix K, say. As a consequence of a non-diagonal K, not only the interpretation of equation parameters as least-squares regression coefficients in the observed variables may be lost, but an identification problem may result also. However, a large identified subclass has been characterized recently, see Brito and Pearl (2002), Wermuth and Cox (2007), Drton, Eichler and Richardson (2007). It has Kij 6= 0 only if Aij = 0. Expressed differently, the two sets of variable pairs with constraints in A and K may overlap, but the two unconstrained sets of variable pairs within A and within K are disjoint. This class of models helps to detect a type of confounding called indirect that, so far, had essentially been overlooked, see Wermuth and Cox (2007). Indirect confounding poses a problem even for intervention studies that use randomized allocation of treatments to individuals, i.e. for what is regarded as the ”gold standard” of designed studies. 2.4. Changing to a different split of three types of linear parameters Let V = (α, β, γ, δ), again with a = α ∪β, b = γ ∪δ and another split of V be c = β ∪γ and d = α ∪ δ and suppose that invb Σ has been studied and estimated in one study, but invd Σ in another. Then the change is between the following two sets of parameters
invb Σ =
Σαα|b Σαβ|b Πα|γ.δ Πα|δ.γ · Σββ|b Πβ|γ.δ Πβ|δ.γ , invd Σ = ∼ ∼ Σγγ.a Σγδ.a δδ.a ∼ ∼ · Σ
T αδ.c Σαα.c ΠT β|α.δ Πγ|α.δ Σ ∼ ∼
·
Σββ|d Σβγ|d · Σγγ.d ∼
∼
Πβ|δ.α . Πγ|δ.α δδ.c Σ
If we take E(Y ) = 0 and let Z = Σ−1 Y 11
(2.5)
then the implicit corresponding change in linear systems is between Zα Yα Z Y invc Σ−1 β = β . Zγ Y γ Zδ Yδ
Yα Zα Z Y inva Σ−1 β = β , Y γ Zγ Zδ . Yδ
(2.6)
Proof. For equation (2.6), we start with Σ−1 Y = Z in mean-centred Y , let σ ij denote the overall concentration of Yi and Yj , and Ya|b = Ya − Πa|b Yb variable Ya corrected for linear least-squares prediction on Yb . Then E(ZZ T ) = Σ−1 ,
E(Y Z T ) = I,
so that Yi corrected for least-squares regression on all remaining variables is Yi|j6=i = Yi +
P
j6=i (σ
ij /σ ii )Y
j
= Zi /σ ii .
In matrix notation, with D −1 a diagonal matrix having elements 1/σ ii = σii|j6=i and U = D −1 Z, the equation Σ−1 Y = Z is modified first into D −1 Σ−1 Y = U and, after partial inversion of D −1 Σ−1 with respect to a, into the left equation in (2.6), i.e. into inva Σ−1 (Za ; Yb ) = (Zb ; Ya ).
(2.7)
For the final step of the proof of (2.6), we use the following result. For any square matrix M, defined as before, and for any a positive definite diagonal matrix D ∗ inva (D ∗ M) = (repa D ∗ )(inva M)(repb D ∗ )−1 . Equation (2.7) can be rewritten as Ya|b = Σaa|b Za ,
Yb = Σba Za + Σbb Zb
(2.8)
so that e.g. linear independence of Ya|b and Yb in the given form is verified directly with cov(Σba Za + Σbb Zb , Za ) = Σba Σaa + Σbb Σba = 0.
12
After writing (2.8) as Ya − Πa|b Yb = ǫa ,
cov(ǫa ) = Σaa|b ,
Σbb.a Yb = Zb − ΠT a|b Za ,
(2.9)
the change in parameter sets for equations (2.6) is seen to be from (Πa|b , Σaa|b , Σbb.a ) to (Πc|d , Σcc|d , Σdd.c ), and the change in the equations (2.6) is defined by the following three reversible transformations invc Σ−1 = inva△c ◦ inva Σ−1 , (Zc ; Yd ) = (rep a△c ◦ inva Σ−1 )(Za ; Yb )
(2.10)
(Zd ; Yc ) = (repa△c ◦ inva Σ−1 )(Za ; Yb ). Proof. The transformations in (2.10) result by equations (1.8) and (1.12) and they lead, again with (1.8), to invc Σ−1 (Zc ; Yd ) defined in terms of (repa△c ◦ inva Σ−1 )(Za ; Yb ). 2.5. Changing from concentrations to parameters of linear chain graphs Parameter sets in linear multivariate regression chains are obtained when, starting with equations (2.5), (2.9), the equations on the right of (2.9), involving the marginalized concentration matrix, are repeatedly modified in the same way as Σ−1 Y = Z is transformed into equations (2.9). Multivariate regression chains are used for the study of panel data and are simplified using zero constraints, see Cox and Wermuth (1993, 1996). By using the same type of repeated transformations as for multivariate regression chains, but selecting different sets of parameters, distinct types of linear chain graph models are defined, see Wermuth, Wiedenbeck and Cox (2006). For instance, models result that have been called blocked concentration chains or LWF-chains (for Lauritzen and Wermuth (1989) and Frydenberg( 1990)) and those called mixed concentration-regression chains or AMP-chains (for Andersson, Madigan and Perlman (2001)). The interpretation of zero constraints depends on the selected type of parameter so that the different sets of zero constraints permit to formulate various types of research hypothesis that may be of interest in different subject-matter contexts. Some direct extensions are to constraints in nonlinear chain graph models within the context of exponential family models discussed in the next sections. Typically, any conditional independence hypothesis involves then several parameters that are constrained to be zero. 13
3. Estimation in reduced exponential families For a full exponential family, we take the log likelihood, after disregarding terms that do not depend on the unknown parameter, in the form sT φ − K(φ), where φ is the canonical parameter and s the sufficient statistic, a realization of the random variable S. The cumulant generating function of S under the full exponential family, K(φ + t) − K(φ), gives the mean or moment parameter, η = ∇K(φ), as the gradient of K(φ) with respect to φ, i.e. as a vector of first derivatives. The maximum-likelihood estimate of η is ηˆ = s. The gradient of η with respect to φ gives the covariance matrix of S but also the concentration matrix of the maximum-likelihood estimate φˆ of the canonical parameter, that is ˆ cov(S) = ∇∇T K(φ) = con(φ), where φˆ denotes for simplicity both a maximum-likelihood estimate of φ and the corresponding random variable. The notation I = ∇∇T K(φ) reminds one that R. A. Fisher had interpreted minus this derivative as the information about the canonical parameter contained in a single observation. Given ηˆ and ¯I, the observed information matrix, i.e. minus I evaluated at s, studentized statistics for testing that an individual component ηi of η is zero, are obtained for a large sample size n as ηˆi / ¯Iii . Similarly, since the random variable I−1 (ˆ η − η) has mean zero and −1 covariance matrix I and hence the same mean and variance as (φˆ − φ), a studentized statistics for testing that an individual component φi of φ is zero, may for large n be −1 computed as φˆi / (¯I )ii . At a maximum of the likelihood function under a full exponential model, also called often the saturated or the largest covering model, it holds that ¯I−1 ηˆ = z,
−1
z = ¯I s,
(3.1)
which is in the form of (2.5) so that the results of the previous section apply to it, especially as used below for (3.4). We now consider special reduced exponential models, those given by constraints ηc = 0
14
for some subset of elements of η, writing η = (ηu , ηc ). By differentiating the Lagrangian sT φ − K(φ) − λT ηc . with respect to φ, the maximum likelihood estimating equations are obtained as −1 ηˆu = su − ˆIucˆIcc sc ,
ηˆc = 0,
(3.2)
where the matrix I is evaluated at the maximum likelihood estimate in the reduced model. Since the solution of maximum-likelihood equations requires in general iterative algorithms and may not be unique, an efficient closed form approximation is useful, see Cox and Wermuth (1990), Wermuth, Cox and Marchetti (2006), that has been called the reduced model estimate of ηu . Equation (3.2) is thereby modified into −1 η˜u = su − ˜Iuc˜Icc sc ,
η˜c = 0,
(3.3)
where the (u, c) and (c, c) components of ˆI in (3.2) have been replaced by the corresponding components of the asymptotic covariance matrix ˜I of S and do not involve unknown parameters. Equations (3.2) and (3.3) result also, by using (2.9), and invu I−1 = invc I in (3.1) when the (u, c) and (c, c) components of the matrix I are replaced by ˆI and by ˜I, respectively, and the (u, u) component is evaluated at ηc = 0. Since the use of reduced model estimates is recommended only in situations in which the constraints agree well with the observed ˆ under the saturated model and then data, choosing the full observed matrix, ¯I of φ, partially inverting it on c, should not give estimates which differ much from those in equations (3.2) and (3.3). By equations (2.9), (3.1), we then have −1 η¯u = su − ¯Iuc¯Icc sc ,
¯Icc.u η¯ = sc − (¯I ¯I−1 )T su c uc cc
(3.4)
with η¯u = η˜u , cov(¯ ηu ) = ¯Iuu|c and cov(¯ ηc ) = ¯Icc . In the case of a poor fit to the hypothesis ηc = 0, i.e. with some components of η¯c deviating much from zero, equations (2.10) permit a direct change to the fit under an alternative hypothesis. −1 Since ¯Iuc¯Icc can be viewed as the observed coefficient of Sc in linear least-squares regression of Su on Sc , standard results apply for testing that 0 = Iuc I−1 cc given an estimate of the appropriate covariance matrix. Expansions into relevant interaction parameters lead to studentized statistics of interaction terms and provide insight into where a possibly poor fit is located. See Cox and Wermuth, (1990) for examples and Lauritzen and Wermuth 15
(1989) for a discussion of interaction parameters in the case of Conditional-Gaussian (CG) distributions and of CG-regressions. The latter contain for instance logistic regression as a special case for which, in general, iterative fitting algorithms are needed to give ηˆ even under the saturated model, see Edwards and Lauritzen (2001). The closed form estimates of ηu in (3.4) provide a new justification for the reduced model estimates: they result by partially inverting equations (3.1) with respect to the subset given by the set of unconstrained parameters. These estimates can also be viewed as the generalized least squares estimates of Aitken (1935), which turn for instance for the multinomial distribution to the estimates of Grizzle, Starmer and Koch (1989), see Cox and Snell (1981, appendix 1), Cox and Wermuth (1990, section 7). Under some general regularity conditions, equations (3.1) and hence the estimates of ηu in (3.4) arise in more general settings than for the exponential family from asymptotic theory, see Cox (2006, chapter 6). They are then called sandwich estimates which had been introduced in a special context by Huber (1964) or, in the context of generalized linear models, approximate residual maximum-likelihood (REML) estimates which had been derived by Patterson and Thompson (1971). By similar arguments as above, the maximum-likelihood equations with zero constraints on canonical parameters are obtained in a form comparable to equations (3.2). From a theoretical viewpoint, these may be more attractive than zero constraints on moment parameters. Firstly, if the constrained canonical parameters exist, then the maximum-likelihood equations have a unique solution. Secondly, the sets of minimal sufficient statistics are of reduced size. Thirdly, estimates are available in closed form for Gaussian and for multinomial distributions provided the model is decomposable that is the associated independence graph can be arranged in a sequence of possibly overlapping but complete prime graphs, see Cox and Wermuth (1999). For non-decomposable models, it is in addition often simple to find a not much larger, decomposable covering model. 4. Switching partially between canonical and moment parameters So far, we have considered only linear mappings even though these were relevant both for nonlinear model formulations and for correlated data. We shall now illustrate how changes between moment and canonical parameters in exponential families may be obtained in terms of partial replication. A general formulation has been given by Cox (2006), section 6.4, including a short proof for orthogonality, i.e. uncorrelatedness, of canonical and moment parameters and of the asymptotic independence of the corresponding maximum-likelihood estimates, see also Barndorff-Nielsen (1978, section 9.8). These general results have often been proven 16
for specific members of exponential families, involving the then necessary lengthy, detailed arguments. In the notation of the present paper, for the moment parameter η = ∇K(φ), and the ˆ let s = (sa , sb ), canonical parameter φ with a maximum-likelihood estimate denoted by φ, represent a split of the sufficient statistic into two column vectors and let φb be replaced by ηb , the corresponding component of η, to give the mixed parameter vector ψ = (φa , ηb ). Then with ! Iaa 0ab ∂η T ∂ψ T I= , repa I = = , ∂φ ∂φ Iba Ibb ˆ a maximum-likelihood estimate of ψ under the saturated model as one obtains, given φ, ! −1 ¯ I 0 −1 ˆ ˆ = (rep ¯I) ¯I (rep ¯I)T = aa|b ab . (4.1) ψˆ = repa¯I φ, cov( ˆ ψ) a a 0ba ¯Ibb The two random variables corresponding to φˆa and ηˆb are uncorrelated and, hence given their asymptotic joint Gaussian distribution, they are also asymptotically independent. One simple example is a joint Gaussian distribution with ηˆb the observed mean of Yb and φˆa the observed overall concentration matrix of Ya . Another is for two dichotomous variables with ηˆb the difference in observed frequencies for Yb and φˆa the observed log-odds ratios of Ya . Two complementary mappings may, respectively, be represented by (φa ; ηb ) = repa I φ,
(ηa ; φb ) = repb I φ
so that for such mappings, we have with the specific choice M = I, as in equations (1.2), (1.8), (ηa ; φb ) = inva I (φa ; ηb ), inva I = (repa I)(repb I)−1 . (4.2) One important consequence of (4.2) is that, given (4.1), the change from a split (a, b) to another split (c, d) with canonical parameters for c and moment parameters for d is possible by using (2.10) after just replacing Σ by I. Another application is to constrained chain graph models of different type when the conditional distributions are members of the exponential family. For the computation of the observed information matrix in the case of both discrete and continuous variables, see Dempster (1973), and Cox and Wermuth (1990). Whenever the population matrix I is replaced by its observed counterpart ¯I, the linear relation in 17
equation (4.2) between sets of parameters in possibly nonlinear models turns into a linear relation between the corresponding maximum-likelihood estimates. Acknowledgement We thank the Swedish Research Society for supporting our cooperation via the Gothenburg Stochastic Center and D.R. Cox and G.M. Marchetti for their most helpful comments. References Aitken, A. C. (1935). On least squares and linear combination of observations. Proc. Roy. Soc. Edin. A 55, 42-48. Anderson, T. W. (1958). An Introduction to Multivariate Statistical Analysis. Wiley, New York. Anderson, T. W. (1969). Statistical inference for covariance matrices with linear structure. In: Multivariate Analysis II (Edited by P.R. Krishnaiah), 55-66, Academic Press, New York. Anderson, T. W. (1973). Asymptotically efficient estimation of covariance matrices with linear structure. Ann. Statist. 1, 135-141. Andersson, S., Madigan, D., and Perlman, M.D. (2001). Alternative Markov properties for chain graphs, Scand. J. Statist., 28, 33-86. Barndorff-Nielsen, O. E. (1978). Information and Exponential Families in Statistical Theory. Wiley, Chichester. Brito, C. and Pearl, J. (2002). A new identification condition for recursive models with correlated errors. Struct. Equation Mod. 9, 459-74. Cochran, W. G. (1938). The omission or addition of an independent variate in multiple linear regression. Suppl. J. Roy. Statist. Soc. 5, 171-176. Cox, D. R. (2006). Principles of Statistical Inference Cambridge University Press, Cambridge. Cox, D.R. (2007). On a generalization of a result of W.G. Cochran. Biometrika 94. To appear. Cox, D. R. and Wermuth, N. (1990). An approximation to maximum-likelihood estimates in reduced models. Biometrika 77, 747-761. Cox, D. R. and Wermuth, N. (1993). Linear dependencies represented by chain graphs (with discussion). Statist. Science 8, 204-218; 247-277. Cox, D. R. and Wermuth, N. (1996). Multivariate Dependencies – Models, Analysis and Interpretation. Chapman and Hall, London. 18
Cox, D.R. and Wermuth, N. (1999). Likelihood factorizations for mixed discrete and continuous variables. Scand. J. Statist. 26, 209-220. Cox, D.R. and Wermuth, N. (2003). A general condition for avoiding effect reversal after marginalization. J. Roy. Statist. Soc. B, 65, 937-941. Cox, D. R. and Wermuth, N. (2004). Causality: a statistical view. Int. Statist. Rev. 72, 285-305. Dempster, A. P. (1969). Elements of Continuous Multivariate Analysis. Addison-Wesley, Reading. Dempster, A. P. (1972). Covariance selection. Biometrics 28, 157-175. Dempster, A.P. (1973). Aspects of the multinomial logit model. In: Proc. 3rd Aymp. Mult. Anal. (Edited by P.R. Krishnaiah), 129-142. Academic Press, New York. Drton, M. and Richardson, T.S. (2004). Multimodality of the likelihood in the bivariate seemingly unrelated regression model. Biometrika 91, 383-392. Drton, M., Eichler, M. and Richardson, T. S. (2007). Identification and likelihood inference for recursive linear models with correlated errors. Submitted. Edwards, D. and Lauritzen, S. L. (2001). The TM algorithm for maximising a conditional likelihood function. Biometrika 88, 961-972. Eusebi, P. (2007). A graphical method for assessing the identification of linear structural equation models. Struct, Equation Modelling. To appear. Fischer, F. (1966). The Identification Problem in Econometrics. McGraw Hill, New York. Frydenberg, M. (1990). The chain graph Markov property. Scand. J. Statist. 17, 333353. Grizzle, J. E., Starmer, C. F. and Koch, G. C., (1969). Analysis of categorical data by linear models. Biometrics 28, 137-156. Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Stat. 35, 7-101. Lauritzen, S. L. (2000). Causal inference from graphical models. In Complex stochastic systems (Edited by O.E. Barndorff-Nielsen, D. R. Cox and C. Kl¨ uppelberg), 63-107. Chapman and Hall, London. Lauritzen, S. L. and N. Wermuth (1989). Graphical models for associations between variables, some of which are qualitative and some quantitative. Ann. Statist. 17, 31-54. Lindley, D. V. (2002). Seeing and doing: the concept of causation. Int. Statist. Rev. 70, 191-214. 19
Ma, Z., Xie, X. and Geng, Z. (2006). Collapsibility of distribution dependence. J. R. Statist. Soc. B 68, 127-33. Marchetti, G. M. and Wermuth, N. (2007). Matrix representations and independencies in directed acyclic graphs. Submitted. Patterson, H. D. and Thompson, R. (1971). Recovery of interblock information when block sizes are unequal. Biometrika, 58, 545-554. Pearl, J. (2000). Causality. Cambridge University Press. Rubin, D. B. (1974). Estimating causal effect of treatments in randomized and nonrandomized studies. J. Educational Psychol. 66, 688-701. Wermuth, N. and Cox, D. R. (2004). Joint response graphs and separation induced by triangular systems. J. Roy. Statist. Soc. B 66, 687-717. Wermuth, N. and Cox, D. R. (2007). Distortion of effects caused by indirect confounding. Biometrika. To appear. Wermuth, N., Cox, D. R. and Marchetti, G. M. (2006). Covariance chains. Bernoulli 12, 841-862. Wermuth, N., Wiedenbeck, M. and Cox, D. R. (2006). Partial inversion for linear systems and partial closure of independence graphs. BIT, Num. Math. 46, 883-901. Xie, X., Ma, Z. and Geng, Z. (2007). Some association measures and their collapsibility. Statistica Sinica. To appear. Zellner, A. (1962). An efficient method of estimating seemingly unrelated regressions and tests for aggregation bias. J. Amer. Statist. Assoc. 57, 348–368.
Center of Survey Research, Mannheim, Germany Email:
[email protected] Department of Statistics, Chalmers/G¨oteborgs Universitet, Sweden Email:
[email protected]
20