1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48
Biometrika (2008), 94, 2, pp. 1–7
C 2008 Biometrika Trust Printed in Great Britain
Advance Access publication on 31 July 2008
On assessing vector valued parameters B Y A. C. DAVISON Institute of Mathematics, Ecole Polytechnique F´ed´erale de Lausanne, Station 8, 1015 Lausanne, Switzerland
[email protected] D. A. S. F RASER AND N. R EID Department of Statistics, University of Toronto, Toronto, Canada M5S 3G3
[email protected] [email protected] N. S ARTORI Departimendo di Statistica, University of Venice, 30121 Venice, Italy
[email protected] 1. I NTRODUCTION We consider the assessment of a vector valued parameter for a regular statistical model with data. The approach involves the Taylor expansion of the log-model, the separation of terms by asymptotic accuracy O(1), O(n−1/2 ), and O(n−1 ), and the subsequent recombination of terms in a coordinate-free form. Our particular interest centers on vector valued parameters in the presence of nuisance parameters, with special concern for discrete data. The model f (y; θ) with observed data y 0 leads immediately to the observed likelihood L(θ) = cf (y 0 ; θ) which can often be examined directly. In turn the log-likelihood `(θ) = log L(θ) gives the maximum likelihood value θˆ and observed information ˆθθ = −(∂ 2 /∂θ∂θ0 )`(θ)|θˆ recording ˆ curvature, a second-derivative value or matrix at the maximum `(θ). The Taylor expansion of the model as demonstrated below gives an approximate Normal ˆ ˆ distribution (θ; ˆ−1 θθ ) for the maximizing value θ. The departure of θ from some hypothesis H = {ψ(θ) = ψ} can be put in a conventional coordinate-free form as twice the drop in the ˆ to the constrained maximum `(θˆψ ), likelihood from the overall maximum `(θ) ˆ − `(θˆψ )}, rψ2 = 2{`(θ) where the constrained θˆψ gives the maximum subject to ψ(θ) = ψ. The first-order reference distribution for rψ2 is the chi-square distribution with degrees of freedom given as the number of constrained parameter coordinates in ψ(θ) = ψ. Our focus is on the higher-order separation of information concerning component parameters of interest, free of related nuisance parameters, and using directed measures of departure. First consider the case where the variable y and the parameter θ have the same dimension and suppose the model has asymptotic properties so that `(θ; y) = log f (y; θ) has O(n) dependence on some antecedent sample size. This can arise if the model is conditional or marginal or a mix of such from some larger background model with increasing data size n. We consider the Taylor expansion of the log-model `(θ; y) in terms of both parameter θ and variable y about the observed (θˆ0 , y 0 ) using centered and scaled coordinates.
D. A. S. F RASER
2 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96
We begin by examining the scalar y and scalar θ case. Let a new θ designate departure of the 1/2 initial θ from θˆ0 as standardized by observed information: (θ − θˆ0 )ˆjθθ . And then let a new y designate departure of the initial y from y 0 scaled so that the new cross Hessian ∂ 2 `(θ; y)/∂θ∂y = 1 at the expansion point (θˆ0 , y 0 ). This gives the following form to the model neglecting terms of order O(n−1/2 ), (
θ2 y2 1 exp − + yθ − f1 (y; θ) = 2 2 (2π)1/2
)
;
(1)
the coefficient of the y 2 term follows from the norming property. The observed p-value function p(θ) = F1 (y 0 ; θ) can then be written p(θ) = Φ{r(θ)} where r(θ) = sgn(θˆ0 − θ)[2{`(θˆ0 ) − `(θ)}]1/2
(2)
is the signed likelihood root. We can however further expand the log model to include cubic terms and then neglect terms of order O(n−1 ). And to get a familiar form for the model we reexpress the variable and reexpress the parameter in accord with exponential model form thus eliminating the θ2 y/2n1/2 and θy 2 /2n1/2 terms; this gives (
y3 y θ3 f2 (y; θ) = f1 (y; θ) exp −α3 1/2 + α3 1/2 − α3 1/2 6n 6n 2n
)
where α3 is a mathematical parameter partly describing observed likelihood and the coefficients of y and y 3 are determined by the norming constant using E{y 3 − 3y − θ3 ; θ} = 0 for the Normal (θ; 1) distribution indicated by f1 (y; θ); for some related detail see Cakmak et al (1998) and Andrews et al (2005). Now suppose we expand the model to include quartic terms and then neglect terms of order O(n−3/2 ); and suppose we also reexpress the parameter and reexpress the variable towards exponential model form; this reexpression forces the coefficients of θ2 y/2n1/2 , θ3 y/6n, θy 2 /2n1/2 and θy 3 /6n to be zero but cannot do so for the quadratic-quadratic term γθ2 y 2 /4n where γ is a coefficient that reflects non-exponentiality in the model form. If γ = 0, then the model is exponential to order O(n−3/2 ) and can be written (
y4 y2 1 θ4 f3 (y; θ) = f2 (y; θ) exp −α4 + (α4 − 3α32 ) − (α4 − 2α32 ) + (3α4 − 5α32 ) 24n 24n 4n 24n
)
;
where α4 is a mathematical parameter also describing an aspect of observed likelihood. The coefficients and the norming constant are available from Normal (θ, 1) integrals; see Andrews et al (2005). As an exponential model, the observed p-value p(θ) = F3 (y 0 ; θ) is available from Lugannani & Rice (1980), Barndorff-Nielson (1986), Daniels (1987) or by Normal(θ, 1) integrals in Andrews et al (2005). It can be written
p(θ) = Φ r − r
−1
r log q
(3)
where r is the signed likelihood root (SLR) and q is the standardized maximum likelihood departure on a scale provided by the canonical parameter of the exponential model, ϕ(θ) =
∂ log f (y; θ)|y0 ; ∂y
(4)
3
On assessing vector valued parameters 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
thus 1/2 q = {ϕ(θˆ0 ) − ϕ(θ)}jϕϕ .
(5)
The full O(n−3/2 ) expansion of the log-model as log f3 (y; θ) = aij θi y j /i!j! can be summarized by recording the matrix array A = {aij } of coefficients to order O(n−3/2 ) including contributions from the term γθ2 y 2 /4n omitted above. The norming property gives all terms by using E{y 2 θ2 − y 4 + 5y 2 − 2; θ} = 0 for the Normal (θ; 1) distribution: P
A=
a+
3α4 −5α23 −12γ 24n
0 −1 3 − nα1/2 α4 −n
−α3 2n1/2
−{1 +
1 0 0 −
α4 −2α23 −5γ } 2n
0 γ/n − −
α3 n1/2
α4 −3α23 −6γ 24n
− − − −
0 − − −
where a = − 21 log(2π) and omitted terms are O(n−3/2 ). The observed p-value p(θ) = F3 (y 0 ; θ) for the general case that includes γ is immediately available from the same expression (3) above, as a consequence of the extraordinary fact that the p-value does not depend on γ to the third order; this can be verified by direct computer integration (Andrews et al, 2005) or direct integration by parts although it is implicit in many of the references cited above. If the general density f4 (y; θ) acquires an appended factor a(y; θ) that is constant to first order and comes say from the elimination of nuisance parameters, then the observed distribution function is again given to third order by (3) but with the following modification of (5), 0 ˆ0 1/2 a(y ; θ ) ; q = {ϕ(θˆ0 ) − ϕ(θ)}jϕϕ a(y 0 ; θ)
for details, see Cheah et al (1995). The third order p-value (3) uses only `(θ) and ϕ(θ), the observed likelihood and observed likelihood gradient. One of the possible forms for the initial model is the exponential which can be written f (y; θ)dy = c exp{`(θ) + s0 (ϕ(θ) − ϕˆ0 (θ))}h(s)ds
(6)
and then rewritten (Daniels, 1954) in the saddlepoint form as (
r2 (s; θ) ek/n exp − f (y; θ)dy = 2 (2π)p/2
)
|ˆ ϕϕ (s)|1/2 ds
(7)
with observed value say s0 = 0, where s designates a score variable and k is constant to third order. We have written (7) for the vector y and θ case of dimension p although the preceding discussion focused on the scalar y and θ case. As (6) and (7) involve only likelihood and first derivative likelihood at the data, the exponential approximation is reasonably called the tangent exponential model at the observed data. In Section 2 for an interest parameter ψ(θ) we record the various levels of accuracy that are available for the observed likelihood `(ψ) and also for the observed likelihood gradient ϕ(ψ), depending on the available model information. In Section 3 we show how the related density function can be determined on a score line that joins the expected score say s0 under the hypothesis ψ(θ) = ψ with the observed score s0 , and then how a directed p-value for testing ψ(θ) = ψ can be determined. Examples including discrete data models are examined in Section 4; Section 5 presents some discussion.
D. A. S. F RASER
4 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192
2. ACCURACY OF LIKELIHOOD AND LIKELIHOOD GRADIENT. (i) Accuracy of p-values. In Section 1 we have indicated how p-values for a scalar parameter can be obtained from `(θ) and ϕ(θ) and thus from the nominal exponential model f (s; θ)ds = exp{`(θ) + s0 ϕ(θ)}h(s)ds, with data s0 = 0. The accuracy of the p-value is the lesser of the accuracies of `(θ) and ϕ(θ) and functions such as ϕ(s) ˆ and ˆ(s) are defined as functions of the score variable s of the nominal exponential model.. (ii) Accuracy concerning the full parameter θ. The observed likelihood `(θ) = `(θ; y 0 ) is obtained by substituting y 0 in the log-model log f (y; θ) and would then typically be available with full accuracy. For discrete models however we will use a continuous approximation to the model; the usual third order accuracy for the continuous model makes available just O(n−1 ) accuracy for the discrete model; for some discussion see Davison et al (2006). The observed likelihood gradient ϕ3 (θ) = (d/dV )`(θ; y)|y0 is the gradient of likelihood in critically chosen directions V , known to be tangent to a second order ancillary (Fraser et al, 1999; Fraser & Reid, 1995; Fraser et al, 2009). The individual directions recorded in V provide exact first derivative ancillary directions for one-dimensional departures from the observed maximum likelihood value (Fraser et al, 2009), but can often be difficult to calculate or determine. The resulting ϕ3 (θ) however does lead to third order accuracy for inference concerning a scalar parameter. Several alternatives are available that lead to a ϕ2 (θ) having second order accuracy. BarndorffNielsen (1986) works within a finite dimensional conditioned model and then further conditions iteratively on likelihood ratio values within an embedding exponential model; this gives direction V˜BN . Fraser & Reid (1998, 2001) and Davison et al (2006) replace the given variable y by the score variable s = `;θ (θˆ0 ; y) and examine directions V˜S = (d/dθ)E{s; θ}|θˆ0 . Skovgaard (1996) estimates directions for conditioning and derives p-values for scalar parameters. Reid & Fraser (2009) note that the resulting Skovgaard directions depend on just the maximum likelihood value and thus in effect can be obtained from the marginal model ˆ θ) for θ. ˆ In this latter case a second order accurate version of ϕ(θ) can be obtained as f (θ; ˆ log f (θ; ˆ θ)| ˆ ˆ0 ; and a variation on this is given as ϕ2 (θ) = (∂/∂ θ) θ=θ ϕ˜2 (θ) =
∂ E{log f (y; θ); θ0 }|θ0 =θˆ0 . ∂θ0
(8)
The use of ϕ˜2 (θ) reproduces Skovgarrd’s second order p-values; in effect the calculation of ϕ(θ) replaces an observed log-likelihood by an observed average likelihood, E{log f (y; θ); θ0 }θ0 =θˆ0 , which is called an information by Kent (1982). (iii) Accuracy concerning a component ψ(θ). Now consider inference for a component parameter of dimension d. A first-order accurate likelihood is available immediately as the profile ˆ ψ ) = `(ψ). `1 (ψ) = sup `(ψ, λ) = `(ψ, λ λ
ˆ ψ is the constrained maximum given a complementary nuisance parameter λ. where λ A second-order accurate version of marginal likelihood is available (Fraser, 2003) using a nuisance parameter correction based on a second order accurate likelihood gradient ϕ(θ): ˆ ψ ) + 1 log |ˆ `2 (ψ) = `(ψ, λ (λλ) (θˆψ )| 2
On assessing vector valued parameters 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240
5
where ˆ(λλ) (θˆψ ) is the estimated nuisance information ˆλλ (θˆψ ) but rescaled to the ϕ(θ) parameterization giving |ˆ (λλ) (θˆψ )| = |ˆ λλ (θˆψ )| · |ϕλ (θˆψ )|−2 where |X| = |X 0 X|1/2 uses the p × d array of partial derivatives
X=
∂ϕ(ψ, λ) . ∂λ λ=λˆ ψ
A third-order accurate likelihood can be obtained by using a third order ϕ(θ) and making the result sample space invariant; this is achieved by using a third order version of ϕ(θ) that has observed information equal to the identity, 1/2 ϕ(θ) ¯ = ˆϕϕ ϕ(θ), 1/2
1/2
(9)
1/2
where ˆϕϕ is a right square root of ˆϕϕ = (ˆ ϕϕ )0 ˆϕϕ . This gives ˆ ψ ) + 1 log |ˆ ˆψ )| `∗3 (ψ) = `(ψ, λ (λ¯ λ) ¯ (θ 2 where ˆψ )| = |ˆ |ˆ (λ¯ λ) λλ (θˆψ )| · |ϕ0λ (θˆψ )ˆ ϕϕ ϕλ (θˆψ )|−1 . ¯ (θ The likelihood gradient parameterization to use with the component ψ(θ) is available as rotated ˆ ψ contour at ϕ( coordinates of ϕ(θ) ¯ perpendicular to the λ = λ ¯ θˆψ ). The likelihood is modified by the transformation since it is calculated typically in different subspaces for different ψ values Fraser (2003).
3. A SSESSING A VECTOR INTEREST PARAMETER Consider a d-dimensional interest parameter ψ(θ), and suppose that a corresponding observed log-likelihood `(ψ) is chosen and an observed log-likelihood gradient ϕ(ψ) is chosen, each with specified accuracies as described in the preceding section. To be of particular interest here the accuracies would need to be second order or higher. But both of third order is possible in some situations, and just one of third order can have advantages in removing certain biases. We assess the parameter ψ(θ) = ψ0 in the context of the exponential model based on {`(ψ), ϕ(ψ)} using a nominal canonical variable s of dimension d and observed value s0 = 0; from (6) and (7) we then have f (s; ψ)ds = c exp{`(ψ) + s0 ϕ(ψ)}h(s)ds =
ek/n (2π)d/2
(
exp −
r2 (ψ; s) 2
(10)
)
|ˆ ϕϕ (s)|1/2 ds,
(11)
with subsequent third-order accuracy relative to the model (10)using the observed value s0 = 0. The specified value ψ = ψ0 has a corresponding mean score value s0 = E{s; ψ0 } = −`ϕ (ψ0 ). The line L+ (s0 − s0 ) radiating from s0 to and beyond the observed s0 makes available a vector space departure of data s0 from expectation s0 , and defines a conditional p-value using the related conditional distribution. This gives the directional test of ψ0 following Fraser & Massam (1985),
6 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288
D. A. S. F RASER
Skovgaard (1988), Cheah et al (1994) and Fraser & Reid (2006): R 1 d−1 h{s0 + t(s0 − s0 )}dt 0 t p(ψ0 ) = R ∞ d−1 0 0
t
h{s0 + t(s − s0 )}dt
(12)
where h(s) is the density (10) and (11) with ψ = ψ0 . The preceding references focus on approximations of the type (3), but here we will work with direct numerical integration of the scalar variable integrals in (12); this bypasses various instabilities that arise with the approximation formulas coming from the Jacobian factor td−1 when the dimension d is greater than one. The density function h(s) in the integrals for the p-value (12) is available from (11) with thirdorder relative accuracy: computationally, all that is needed is the maximum likelihood value ϕ(s) ˆ and the observed information value ˆϕϕ (s) as calculated from the log-model `(ψ, s) = `(ψ) + s0 ϕ(ψ) using the available {`(ψ), ϕ(ψ)}; and the values ϕ(s) ˆ and ˆϕϕ (s) are needed only for points s on the line s0 + L(s0 − s0 ). If we take a = (a1 , . . . , ad )0 to be orthogonal to s0 − s0 and for convenience, of unit length, then the scalar parameter X(ψ) = a0 ϕ(ψ) provides the basic tilting and the corresponding profile likelihood for the computation. The directional test can reveal important aspects of the departure of data from the value ψ = ψ0 for a vector parameter. In particular the upper limit for t in the p-value (12) may be less than infinity, as forced by the density h{s0 + t(s0 − s0 )}. This can arise with continuous and with discrete models and would be important additional information to record alongside a likelihood ratio r2 (ψ, y), or alongside the related chi-square p-value; see Examples . . .. In addition a likelihood ratio value may be associated with a longer tail or a shorter tail, important additional information that would only emerge as part of the p-value calculation (12); this would typically be of order O(n−1/2 ) and inappropriately would be removed by averaging the conditional distribution of the likelihood ratio itself. Similar aspects arise with the Bartlett correction when varying types of departure are hidden by the key marginalization step.
4.
E XAMPLES
Some possibilities 1. Simple continuous exponential model with say 2 coordinates and a finite boundary in one or two directions. 2. A continuous generalized exponential model with a non canonical link function. 3. A discrete example, possibly from Davison et al (2006), where the directional value could be compared with the usual p-value for some scalar parameter. 4. Contingency tables, testing say independence or symmetry.
REFERENCES Andrews, D.F., Fraser, D.A.S. and Wong, A. (2005). Computation of distribution functions from likelihood information near observed data. Journal Statist. Plann. Inference 134, 180-193. Barndorff-Nielsen, O.E. (1986). Inference on full or partial parameters based on the standardised signed log likelihood ratio. Biometrika 73, 307-22. Cakmak, S., Fraser, D.A.S., McDunnough, P., Reid, N., and Yuan, X. (1998). Likelihood centered asymptotic model exponential and location model versions. J. Statist. Planning and Inference 66, 211-222. Cheah, P.K., Fraser, D.A.S., and Reid, N. (1994). Multiparameter testing in exponential models: third order approximations from likelihood. Biometrika 81, 271-278.
On assessing vector valued parameters 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336
7
Cheah, P. K., Fraser,, D. A. S. and Reid, N. (1995). Adjustments to likelihood and densities: calculating significance. J. Statist. Res., 29, 1-13. Daniels, H. E. (1954). Saddlepoint approximation in statistics. Ann. Math. Statist., 25, 631650. Daniels, H. E. (1987). Tail probability approximations. Int. Statist. Rev., 54, 37-48. Davison, A.C., Fraser, D.A.S. and Reid, N. (2006). Improved likelihood inference for discrete data. J. Royal Statist. Soc., B 68, 495-508. Fraser, D.A.S. (2003). Likelihood for component parameters. Biometrika 90, 327-339. Fraser, A.M., Fraser, D.A.S. and Staicu, A.-M. (2009). The second order ancillary: A differential view with continuity. Bernoulli, in revision. Fraser, D. A. S. and Massam, H. (1985). Conical tests: Observed levels of significance and confidence regions. Statistische Hefte 26, 1-17. Fraser, D.A.S., and Reid, N. (1995). Ancillaries and third order significance. Utilitas Mathematica, 47, 33-53. Fraser, D.A.S., and Reid, N. (2001). Ancillary information for statistical inference. In S.E. Ahmed and N. Reid (Eds), Empirical Bayes and Likelihood Inference, 185-209. New York: Springer-Verlag. Fraser, D. A. S. and Reid, N. (2006). Assessing a vector parameter. Student, 5, 247-256. Fraser, D.A.S., Reid, N., and Wu, J. (1999). A simple general formula for tail probabilities for frequentist and Bayesian inference. Biometrika, 86, 249-264. Kent, J. T. (1982). Robust properties of likelihood ratio tests. Biometrika, 69, 19-27. Lugannani, R. and Rice, S. (1980), Saddlepoint approximation for the distribution of the sum of independent random variables, Advances in Applied Probability 12, 475–490. Reid, N, and Fraser, D.A.S. (2009) Mean likelihood and higher order inference. Biometrika; in review. Skovgaard, Ib M. (1988). Saddlepoint expansions for directional test probabilities. J. R. Statist. Soc. B 50, 269-280. Skovgaard, Ib M. (1996). An explicit large-deviation approximation to one-parameter tests. Bernoulli, 2(2), 145-165. [Received January 2008. Revised March 2009]