Feb 23, 2006 - In this paper we first generalize Fisher's fiducial argument and ob- tain a fiducial recipe ... the notion of Fiducial inference for µ based on a random sample from a. N(µ, Ï2) .... The main controversy was related to the philosophical and mathematical ...... They also lead to different answers reaffirming Borel's.
On Fiducial Inference – the good, the bad and the ugly Jan Hannig∗ Department of Statistics Colorado State University February 23, 2006
Abstract R. A. Fisher’s fiducial inference has been the subject of many discussions and controversies ever since he introduced the idea during the 1930’s. The idea experienced a bumpy ride, to say the least, during its early years and one can safely say that it eventually fell into disfavor among mainstream statisticians. However, it appears to have made a resurgence recently under the label of generalized inference. In this new guise fiducial inference has proved to be a useful tool for deriving statistical procedures for problems where frequentist methods with good properties were previously unavailable. Therefore we believe that the fiducial argument of R.A. Fisher deserves a fresh look from a new angle. In this paper we first generalize Fisher’s fiducial argument and obtain a fiducial recipe applicable in virtually any situation. We demonstrate this fiducial recipe on many examples of varying complexity. We also investigate, by simulation and by theoretical considerations, some properties of the statistical procedures derived by the fiducial recipe. In particular, we compare the properties of fiducial inference to the properties of Bayesian inference and observe that the two share many common strengths and weaknesses. ∗
Jan Hannig’s research is supported in part by National Science Foundation under grant DMS-0504737.
1
In addition to the theoretical considerations mentioned above we also derive the fiducial distribution and verify its viability by simulations for several examples that are of independent interest. In particular we derive fiducial distributions for the parameters of a multinomial distribution, for the means, variances, and the mixing probability of a mixture of two normal distributions, and for the variance components in a simple one-way random linear model.
Key words: Fiducial inference, structural inference, generalized inference, asymptotics, multinomial distribution, mixture of normal distributions, MCMC.
1
Introduction
R. A. Fisher introduced the idea of fiducial probability and fiducial inference (Fisher 1930) in an attempt to overcome what he saw as a serious deficiency of the Bayesian approach to inference – use of a prior distribution on model parameters even when no information was available regarding their values. Although he discussed fiducial inference in several subsequent papers, there appears to be no rigorous definition of a fiducial distribution for a vector paramater θ based on sample observations. In the case of a one-parameter family of distributions, Fisher gave the following definition for a fiducial density f (θ|x) of the parameter based on a single observation x for the case where the cdf F (x|θ) is a monotonic decreasing function of θ: f (θ|x) ∝ −
∂F (x|θ) . ∂θ
(1)
Fisher illustrated the application of fiducial probabilities by means of a numerical example consisting of four pairs of observations from a bivariate normal distribution with unknown mean vector and covariance matrix. For this example he derived fiducial limits (one-sided interval estimates) for the population correlation coefficient ρ. Fisher proceeded to refine the concept of fiducial inference in several subsequent papers (Fisher 1935a). In his 1935 paper titled “The Fiducial Argument in Statistical Inference” Fisher explained the notion of Fiducial inference for µ based on a random sample from a N (µ, σ 2 ) distribution where σ is unknown. The process of obtaining a fiducial distribution for µ was based on the availability of the student’s t-statistic that served as a pivotal quantity for µ. In this same 1935 paper, Fisher discussed the notion of a fiducial distribution for a single future observation x 2
from the same N (µ, σ 2 ) distribution based on a random sample x1 , . . . , xn . For this he used the fact that x − x¯ T = √ s/ n is a pivotal quantity. He then proceeded to consider the fiducial distribution for the mean x¯0 and s0 , the mean and the standard deviation, respectively, of m future observations xn+1 , . . . , xn+m . By letting m tend to infinity, he obtained a simultaneous fiducial distribution for µ and σ. He also stated “In general, it appears that if statistics T1 , T2 , T3 , . . . contain jointly the whole of the information available respecting parameters θ1 , θ2 , θ3 , . . ., and if functions t1 , t2 , t3 , . . . of the T ’s and θ’s can be found, the simultaneous distribution of which is independent of θ1 , θ2 , θ3 , . . ., then the fiducial distribution of θ1 , θ2 , θ3 , . . . simultaneously may be found by substitution.” In essence Fisher had proposed a recipe for constructing simultaneous fiducial distributions for vector parameters. He applied this recipe to the problem of interval estimation of µ1 −µ2 based on independent samples from two normal distributions N (µ1 , σ12 ) and N (µ2 , σ22 ) with unknown means and variances. This is the celebrated Behrens-Fisher problem. Fisher noted that the resulting inference regarding µ1 − µ2 coincided with the approach proposed much earlier by Behrens (1929). He alluded to the test of the null hypothesis of no difference, based on the fiducial distribution of µ1 − µ2 as an exact test. This resulted in much controversy as it was noted by Fisher’s contemporaries that the Behrens-Fisher test was not an exact test in the usual frequentist sense. Moreover, this same test had been obtained by Jeffreys (1940) using a Bayesian argument with noninformative priors (now known as Jeffreys priors). Fisher argued that, while Jeffreys approach gave the same answer as the fiducial approach, the logic behind Jeffreys derivation was unacceptable because of the use of an unjustified prior distribution on the parameters. Fisher particularly objected to the practice of using uniform priors to model ignorance. This led to further controversy especially between Fisher and Jeffreys. In the same 1935 paper, Fisher gave a second example of application of his recipe by deriving a fiducial distribution for φ in the balanced one-way random effects model Yij = µ + ai + eij ,
i = 1, . . . , n1 ; j = 1, . . . , n2
where ai ∼ N (0, φ), eij ∼ N (0, θ), and all random variables are independent. An issue that arose from his treatment of this problem is that the fiducial 3
distribution assigned a positive probability to the event φ < 0 in spite of the fact that φ is a variance. Fisher’s 1935 paper resulted in a flurry of activity in fiducial inference. Most of this activity was directed towards finding deficiencies in fiducial inference and philosophical concerns regarding the interpretation of fiducial probability. The controversy seems to have risen once Fisher’s contemporaries realized that, unlike the case in early simple applications involving a single parameter, fiducial inference often led to procedures that were not exact in the frequentist sense. For a detailed discussion of the controversies concerning fiducial inference, the reader is referred to Zabell (1992). Fraser, in a series of articles (Fraser (1961), Fraser (1966)), attempted to provide a rigorous framework for making inferences along the lines of Fisher’s fiducial inference. He called his approach structural inference. Wilkinson (1977) attempted to explain and/or resolve some of the controversies regarding fiducial inference. Dawid & Stone (1982) provided further insight by, among other things, studying situations where fiducial inference led to exact confidence statements. A wealth of additional references on fiducial inference can be found in Salome (1998). Nevertheless, it is fair to say that fiducial inference failed to secure a place in mainstream statistics. In Tsui & Weerahandi (1989), a new approach was proposed for constructing hypothesis tests using the concept of generalized P values and this idea was later extended to a method of constructing generalized confidence intervals using generalized pivotal quantities (Weerahandi 1993). Several papers have appeared since, in leading statistical journals, where confidence intervals have been constructed using generalized pivotal quantities in problems where exact frequentist solutions are unavailable. For a thorough exposition of generalized inference see Weerahandi (2004). Iyer & Patterson (2002) and Hannig, Iyer & Patterson (2006b) noted that every published generalized confidence interval was obtainable using the fiducial/structural arguments. In fact, Hannig et al. (2006b) not only established a clear connection between fiducial intervals and generalized confidence intervals, but also proved the asymptotic frequentist correctness of such intervals. They further provided some general methods for constructing GPQs. In particular, they showed that a special class of GPQs called fiducial GPQs (FGPQ) provide a direct frequentist interpretation to fiducial inference. However, their article focused on continuous distributions and did not address discrete distributions. It is interesting to note that not much has been written about fiducial inference for parameters of a discrete distribution. Even for the single param4
eter case such as the binomial distribution Fisher was aware, that there were difficulties with defining a unique fiducial density for the unknown binomial parameter π. In his 1935 paper (Fisher 1935b) titled “The Logic of Inductive Inference”, Fisher gives an example where he suggests a clever device for “turning a discontinuous distribution, leading to statements of fiducial inequality, into a continuous distribution, capable of yielding exact fiducial statements, by means of a modification of experimental procedure.” His device was to introduce randomization into the experimental procedure and is akin to randomized decision procedures. Inspired by Fisher’s example, Stevens (1950) gave a more formal treatment of this problem where he used a supplementary random variable in an attempt to define a unique fiducial density for a parameter of a discrete distribution. He discussed his approach in great detail using the binomial distribution as an illustration. Unfortunately, this idea seems to have got lost and subsequent researchers mostly focused on fiducial inference for continuous distributions. In 1996, in his Fisher Memorial Lecture at the American Statistical Association annual meetings, Efron gave a brief discussion of fiducial inference with the backdrop of binomial distribution. He said, “Fisher was uncomfortable applying fiducial arguments to discrete distributions because of the ad hoc continuity corrections required, but the difficulties caused are more theoretical than practical.” See Efron (1998). In fact, Efron’s suggestion for how to handle discrete distributions is a special case of Stevens (1950) proposal. In this paper we provide a general definition for fiducial distributions for parameters that applies equally well to continuous as well as discrete parent distributions. The resulting inference is termed weak fiducial inference, rather than fiducial inference, to emphasize the fact that, multiple fiducial distributions can be defined for the same parameter. However, the resulting interval estimates have, under certain regularity conditions, asymptotic frequentist exactness. We close this section with some quotes. Zabell (1992) begins his Statistical Science paper with the statement “Fiducial inference stands as R. A. Fisher’s one great failure.” On the other hand, Efron, in his 1998 Statistical Science paper (based on his Fisher Memorial Lecture of 1996), in the section dealing with fiducial inference, has said “I am going to begin with the fiducial distribution, generally considered to be Fisher’s biggest blunder.” However, in the closing paragraph of the same section (Section 8), he says “Maybe Fisher’s biggest blunder will become a big hit in the 21st century !”
5
2
The Fiducial Argument
The main aim of fiducial inference is to devise a distribution for parameters of interest that captures all of the information that the data contains about these parameters. This fiducial distribution can later be used for devising inference procedures such as confidence sets. In this sense, a fiducial distribution is much like a Bayesian posterior distribution. Fisher wanted to accomplish this without assuming a prior distribution on the parameters. While our understanding of the fiducial argument cannot be entirely new given the large number of great minds who have thought about this problem, we are unaware of any prior work that formulates it in exactly the same way. The idea behind a fiducial distribution, as we understand it, can be explained using the following simple example. Consider a random variable X from a normal distribution with unknown mean µ and variance 1, i.e., X = µ + Z where Z is standard normal. If x is a realized value of X corresponding to the realized value z of Z, then we have µ = x−z. Of course the value z is not observed. However, a contemplated value µ0 of µ corresponds to the value x − µ0 of z. Knowing that z is a realization from the N (0, 1) distribution, we can evaluate the likelihood of Z taking on the value x − µ0 . Speaking informally, one can say that the “plausibility” of the parameter µ taking on the value µ0 “is the same” as the plausibility of the random variable Z taking on the value x − µ0 . Using this rationale, we write µ = x − Z where x is regarded as fixed but Z is still considered a N (0, 1) random variable. This step, namely, shifting from the true relationship µ = x − z (z unobserved) to the relationship µ = x − Z, is what constitutes the fiducial argument. We can use the relation, µ = x − Z, to define a probability distribution for µ. This distribution is called the “fiducial distribution” of µ. In particular, a random variable M carrying the fiducial probability distribution of µ can be defined based on the probabilities of observing the value of Z needed to get the desired value of µ, i.e., define M so that P (M ∈ (a, b)) = P (x − Z ∈ (a, b)) = P (Z ∈ (x − b, x − a)).
(2)
It will be useful to consider the random variable M ? = x − Z ? , where Z ? is a standard normal random variable independent of Z. This random variable has the same distribution as M , the fiducial distribution for µ. In conclusion, notice that to obtain a random variable that has a distribution described in (2) we had to take the structural equation X = µ + Z, solve for µ = X − Z and set M = x − Z ? , where x is the observed value of X 6
and Z ? is a random variable independent of Z having the same distribution as Z. We will generalize this idea in Section 3. There has been a lot of controversy surrounding the fiducial argument. For example Le Cam & Yang (2000) call it a “logically erroneous” argument. The main controversy was related to the philosophical and mathematical foundations of the procedure and some non-uniqueness paradoxes. In this paper we look at fiducial inference differently. We approach the fiducial argument as a tool for deriving inference procedures (much like the maximum likelihood principle). We then apply it to several examples and study its properties both analytically and through simulations. In general, like Bayesian inference, fiducial inference often leads to procedures with very good frequentist properties. In fact, we believe that if computer simulations had been feasible when Fisher introduced his fiducial argument, fiducial inference might not have been dismissed by mainstream statisticians. Fiducial inference is often asymptotically correct for much the same reasons as Bayesian inference is (see Section 5). Bayesian inference suffers from non-uniqueness due to the choice of prior. We will show that Fiducial inference, as we present it, also suffers from a similar form of non-uniqueness (see Section 9).
3
Fiducial Recipe
We will now generalize the idea described in Section 2 to arbitrary statistical models. Let X be a (possibly discrete) random vector with a distribution indexed by a parameter ξ ∈ Ξ. Assume that X could be expressed in the following form X = G(U, ξ), (3) where G is a jointly measurable function and U is a uniform(0, 1) random variable. We define a set-valued function Q(x, u) = {ξ : x = G(u, ξ)}.
(4)
The function Q(X, U ) could be understood as an inverse of the function G. Here G defines u as an implicit function of ξ and x is regarded as fixed. Finally, assume for any measurable set S, there is a random element V (S) with support S, where S is the closure of S. We define a weak fiducial distribution of ξ as the conditional distribution of V (Q(x, U ? )) | Q(x, U ? ) 6= ∅. 7
(5)
Here x is the observed value of X and U ? is an independent copy of U . Remark 1. In Equation 3, without loss of generality, U could be taken as any random variable or random vector whose distribution is free of unknown parameters, since any such distribution can be generated starting from a uniform [0,1] variate. We will take advantage of this fact whenever convenient without further comment. Remark 2. Notice that under Fisher’s assumptions his fiducial density is a special case of our definition as seen in the remark 18 in Section 9. Our form of the fiducial distribution (5) is influenced by Fraser’s structural inference – see Appendix 3 of Dawid, Stone & Zidek (1973) for a very concise description of structural inference idea. The main difference is that we do not assume a group structure which is in our opinion unnecessary and in fact conceals the main issues. See also remark 15 in Section 8. Remark 3. The choice of a particular form of the structural equation (3) could influence the fiducial distribution. In the remainder of this paper we will regard data represented by a different structural equation as a different statistical problem even if they have the same distribution, c.f., (Fraser 1968). Remark 4. This definition could be applied, at least in principle, to semiparametric problems. Of course then Q(X, U ) will be a very large set and the choice of V ( ) would influence the properties of the procedure to a great extent. This is similar to the big influence the choice of a prior has in Bayesian non-paramteric problems. The following examples provide simple illustrations of the definition of a weak fiducial distribution. Example 1. Suppose U = (E1 , E2 ) where Ei are i.i.d. N (0, 1) and
·
X = (X1 , X2 ) = G(µ, U ) = (µ + E1 , µ + E2 ) for some µ ∈ R. So Xi are iid N (µ, 1). Given a realization x = (x1 , x2 ) of X, the set-valued function Q maps u = (e1 , e2 ) ∈ R2 to a subset of R and is given by ( {x1 − e1 } if x1 − x2 = e1 − e2 , Q(x, u) = ∅ if x1 − x2 6= e1 − e2 . By definition, a weak fiducial distribution for µ is the distribution of x1 − E1? conditional on E1? − E2? = x1 − x2 where U ? = (E1? , E2? ) is an independent copy of U . Hence a weak fiducial distribution for µ is N (¯ x, 1/2) where x¯ = (x1 + x2 )/2. 8
Example 2. Suppose U = (U1 , . . . , Un ) is a vector of iid uniform (0, 1) random variables Ui . Let p ∈ [0, 1]. Let X = (X1 , . . . , Xn ) be defined by Xi = I(Ui < p). So Xi are iid Bernoulli random variables with success Pn probability p. Suppose x = (x1 , . . . , xn ) is a realization of X. Let s = i=1 xi be the observed number of 1’s. The mapping Q : [0, 1]n → [0, 1] is given by [0, u1:n ] if s = 0, if s = n, (u1:n , 1] Q(x, u) = (us:n , us+1:n ] if s = 1, . . . , n − 1 Pn and i=1 I(xi = 1)I(ui ≤ us:n ) = s, ∅ otherwise. Here ur:n denotes the rth order statistic among u1 , . . . , un . So a weak fiducial distribution for p is given by the distribution of V (Q(x, U ? )) conditional on the event Q(x, U ? ) 6= ∅ where V (Q(x, U ? )) is any random variable whose support is contained in Q(x, U ? ). By the exchangeability of U1? , . . . , Un? it follows that the stated conditional distribution of V (Q(x, U ? )) is the same as ? ? ? ]) for 0 < s < n, , Us+1:n ]) when s = 0, V ((Us:n the distribution of V ([0, U1:n ? and V ((Un:n , 1]) for s = n. It will be useful to denote a random variable having the distribution described in (5) by a Rξ (x). We will call this random variable a Fiducial Quantity (FQ). Notice that D
Rξ (x) = (Rξ (X) | X = x) and the distribution of Rξ (x) does not depend on the parameter ξ. Remark 5. We are often interested in estimating θ = π(ξ) ∈ Rq . We can then define Rθ (x) = π(Rξ (x)). (6) In some cases this does not lead to satisfactory results. This happens when the function π has a zero derivative at the true value of ξ. In this case one can sometimes obtain an alternate solution by finding Y = η(X) sufficient for θ with distribution depending only on θ and base the fiducial distribution of θ on Y instead of X. See Hannig et al. (2006b) for an example. Remark 6. Since the distribution of Rθ (x) for each observed x is known (or at least accessible through simulations), we can use it to set up confidence 9
sets. The idea is that any confidence set based on the distributions of Rθ should be a reasonably good confidence set for θ. This is often true at least asymptotically and is confirmed by simulations for small samples in examples we have considered. Remark 7. The definition in (5) does not lead to a unique distribution. In fact there are two sources of non-uniqueness. First source of non-uniqueness is the choice of the random variable V (Q(x, u)) if the set Q(x, u) has more than one element. This typically happens if we deal with discrete random variables. In this case the choice of V (Q(x, u)) is necessarily subjective. The second source of non-uniqueness comes from the fact that in some situations P (Q(x, U ? ) 6= ∅) = 0. This situation typically arises if we deal with continuous distributions. The nonuniqueness is caused by the fact that the event {Q(x, U ? ) 6= ∅} could be expressed using many different equation representations each leading to a different conditional distribution. This is related to the Borel’s paradox described for example in Casella & Berger (2002), Section 4.9.3. We believe that this issue is actually more serious than the first. We will discuss these issues in much greater detail in Section 8 below. Remark 8. Consider a function F (X? , ξ), such that U ? = F (X? , ξ) has uniform distribution on (0, 1) . Such a function always exists if we allow for a possible additional randomization. For example this additional randomization is needed if X? is discrete. For any value of X and U ? , if we have |Q(X, U ? )| = 1 then F (X? , ξ) exists without any additional randomization and Q(X, F (X? , ξ)) is a generalized pivot (Weerahandi 1993). In fact this is the basic idea of the construction of Iyer & Patterson (2002). If |Q(X, U ? )| ≤ 1 one can still define a generalized pivot using conditional distribution functions. This construction can be found in Hannig et al. (2006b). In the general case where |Q(X, U ? )| > 1, the construction of Hannig et al. (2006b) still applies but one will need to use additional randomization to derive a slightly more “generalized” version of a generalized pivot. We do not further discuss this general case here. Finally we reiterate the observation of Hannig et al. (2006b) that all published generalized inference results are identical to corresponding fiducial results. These observations suggest that generalized inference could be viewed as yet another attempt of defining fiducial distributions. 10
Remark 9. Our definition of a fiducial distribution accommodates, in a very natural way, problems where the parameter space is constrained to a smaller set Ξ0 , e.g., N (µ, σ 2 ) with µ > 0. All we have to do to incorporate this additional information into the weak fiducial distribution is to consider only parameters ξ ∈ Ξ0 in (4). The conditioning in (5) then makes sure that this additional information is incorporated into the weak fiducial distribution. Remark 10. The approach for handling parameter constraints discussed above simply truncates the fiducial distribution to the constrained parameter space. Notice that Bayesian inference deals with the problem of constrained parameter space in the same way. Alternatively, one can deal with the constrained parameter space by mapping all the fiducial probability outside of the constrained space to the boundary, e.g., for N (µ, σ 2 ) with µ > 0 one can consider max(Rµ (x), 0) instead of the constrained fiducial quantity calculated based on (4) with Ξ0 = (0, ∞) × (0, ∞). While this approach is not consistent with the fiducial argument or with Bayesian inference it often leads to good frequentist properties. Fisher himself faced the problem of a constrained parameter region in the one-way random effects model Yij = µ + Ai + eij where Ai ∼ N (0, φ) and eij ∼ N (0, θ). Fisher derived a fiducial distribution for φ that assigned a positive probability for the event φ < 0. Buehler (1980), in his article on fiducial inference in R. A. Fisher: An Appreciation, points out that the problem of where to put the fiducial probability associated with the region φ < 0 has puzzled later researchers. The approach of assigning the forbidden probability to the boundary of the parameter space has been used by many authors in published work on generalized inference. See, for instance, Krishnamoorthy & Mathew (2004), Iyer, Wang & Mathew (2004), and Krishnamoorthy & Mathew (2002).
4
Examples
The purpose of this section is to explain the use of the fiducial recipe on examples. We present one discrete, one continuous and one more complicated example. Fiducial inference for the multinomial distribution The first series of examples considers fiducial inference for the Multinomial distribution on k + 1 categories {1, . . . , k + 1}. The special case of Binomial 11
distribution (k = 1) has received some recent attention by Brown, Cai & DasGupta (2001), Brown, Cai & DasGupta (2002), and Cai (2005). These authors show that the classical solutions based on normal approximations do not have good small sample properties. They recommend some alternative solutions. The one recommendation that stands out consistently is the interval estimate based on the posterior distribution arising from the Jeffreys prior. Later in this article we show that this is in fact one of the fiducial intervals. We also show that there is another fiducial solution for the binomial parameter p that does just as well. Example 3. Let X1 , . . . , Xn be i.i.d. Multinomial(p) P random variables, where p = (p1 , p2 , . . . , pk ), pj ∈ [0, 1], j = 1, . . . , k, and kj=1 pj ≤ 1. We will derive P a weak fiducial distribution for p. Set q0 = 0 and qj = jl=1 pl , j = 1, . . . , k. The structural equations for the Xi , i = 1, . . . , n could be expressed as Xi =
k X
I[qj ,1] (Ui ),
(7)
j=0
where U1 , . . . , Un are i.i.d. Uniform(0, 1) random variables. Assume that we have observed x1 , . . . , xn and denote Pj the number of occurences of j by nj . For j = 1, . . . , k + 1, define tj = r=1 nr . In particular, tk+1 = n. Let Us:n denote the sth order statistic among U1 , . . . , Un . For simplicity of notation define t0 = 0, U0:n = 0 and Un+1:n = 1. The set Q(x, U) 6= ∅ if and only if n=
k+1 X n X
I(Xi = j)I Ui ∈ (Utj−1 :n , Utj :n ] .
j=1 i=1
In this case Q(x, U) = Q? (x, U), where ( Q? (x, U) = (p1 , . . . , pk ) (q1 , . . . , qk ) ∈
×
k
× j=1
Utj :n , Utj +1:n
)
.
Here i Ai denotes the cartesian product of the sets Ai and qi is as in (7). In particular for j = 1, . . . , k, pj = qj − qj−1 and pk+1 = 1 − qk . The exchangeability of Ui , i = 1, . . . , n then implies that the conditional distribution of V (Q(x, U)), conditional on the event Q(x, U) 6= ∅ is the same as the (unconditional) distribution of V (Q? (x, U)). By our definition 12
the weak fiducial quantity is Rp (x) = V (Q? (x, U)). Equivalently there is a random vector D = (D1 , . . . , Dk ) with support [0, 1]k such that Rp (x) = (R1 , R2 − R1 , . . . , Rk − Rk−1 )> ,
(8)
where Rj = Utj :n + Dj (Utj +1:n − Utj :n ). Notice that if nj = 0 for some j = 2, . . . , k it would be possible to get a negative value for Rpi the ith element of Rp . This can be prevented by requiring the random vector D to satisfy Dj ≥ Dj−1 whenever nj = 0. The observation made in the previous paragraph implies, that the fiducial distribution depends on the particular choice of the structural equation (7). In particular, if one or more categories are not observed in our sample, we might get a different fiducial distribution by relabeling. We now further investigate this fiducial quantity in two special cases, the binomial distribution (k = 1) and the trinomial distribution (k = 2). Special case 1 - the Binomial distribution. Example 4. For the special case of a binomial distribution, a fiducial quantity for p is, Rp (x) = Us:n + D(Us+1:n − Us:n ) (9) with D being any random variable with support contained in [0, 1] and s being the observed number of successes. Recall that the joint density of (Us:n , Us+1:n ) is f(Us:n ,Us+1:n ) (u, v) =
n! us−1 (1 − v)n−s−1 , 0 < u < v < 1. (s − 1)!(n − s − 1)!
Therefore, the density of Rp is Z
1
Z
fRp (p) = 0
0
p 1−p ∧ d 1−d
n s(p − dq)s−1 s
× (n − s) ((1 − p) − (1 − d)q)n−s−1 dq dFD (d) I(0,1) (p), (10) where FD (d) is the distribution function of D and x ∧ y = min{x, y}. If additionally D is continuous with density fD , (10) simplifies to Z pZ 1 n p − u sus−1 (n − s)(1 − v)n−s−1 fRp (p) = fD dv du I(0,1) (p). s v−u v−u 0 p (11) 13
There are many reasonable choices for the distribution of D in the description of Rp . We have considered five different choices that appeared natural to us. For the first three choices we assumed D is random and independent of U1 , . . . , Un . The maximum entropy choice is D ∼ uniform(0, 1). The maximum variance choice suggested implicitly by Efron (1998) is D ∼ uniform{0, 1}. We remark that a direct calculation, cf., (Grundy 1956), shows that these two choices lead to fiducial distribution that is not a Bayesian posterior with respect to any prior. The third choice D ∼ Beta(1/2, 1/2) leads to Rp ∼ Beta(s + 1/2, n − s + 1/2) which is the Bayesian posterior for Jeffreys prior. The fourth choice is a little harder to describe in terms of D. It is Rp ∼ Beta(s + 1, n − s + 1). This is the scaled likelihood, or posterior with respect to the flat prior. Beta(s + 1, n − s + 1) is a fiducial distribution according to our definition, since it is stochastically larger than the distribution of Us:n , which is Beta(s + 1, n − s), and stochastically smaller than the distribution of Us+1:n , which is Beta(s, n − s + 1). This can be seen by noticing that conditional on U1 , . . . , Un the distribution of D is given by D = 0 with probability Us:n , D = 1 with probability 1 − Us+1,n and D ∼ U (0, 1) with probability Us+1:n − Us:n . The last choice is D = 1/2 corresponding to the midpoint of the interval (Us:n , Us+1:n ). To evaluate the performance of the fiducial distribution and compare the performance of the various choices of D we carried out an extensive simulation study. As shown in Section 6, the fiducial inference is correct asymptotically. Therefore our simulation study concentrated mostly on small values of n. In particular we considered n = 3, 6, 9, . . . , 45, 48, 100, 1000 and p = 0.01, 0.02, . . . , 0.99. For each of the combinations of n and p we simulated 5000 evaluations of the probability Q(X) = P (Rp (X? ) < p|X) using each of the five variations of fiducial distribution. If the fiducial inference were exact, the Q(X) should follow U(0, 1) distribution. The level of agreement of Q(X) with U(0, 1) distribution was examined using QQ-plots. Since fiducial inference is a non-randomized procedure, the distribution of Q(X) can take only n values. Therefore it cannot be expected that the agreement with uniform distribution would be very good for small values of n. However, the agreement improves dramatically as n increases. To illustrate this we show the QQ-plots for n = 12 and p = .1, .3, .5, .7, .9 in Figure 1. We also show QQ-plots for n = 6, 21, 48, 100, 1000 and p = .3 in Figure 2. 14
Lower CI − coverage
Lower CI − coverage
1
1
0.9
entropy
n = 12
variance
p = 0.1
0.9
Jeffreys
0.6
0.6
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.1
p = 0.3
midpoint 0.7
Nominal p−value
Nominal p−value
midpoint
0
variance
likelihood
0.8
0.7
0
n = 12
Jeffreys
likelihood
0.8
entropy
0.2
0.3
0.4
0.5 0.6 Actual p−value
0.7
0.8
0.9
0
1
0
0.1
0.2
0.3
0.4
Lower CI − coverage 1 entropy
n = 12
variance
p = 0.5
0.9
Jeffreys
0.8
0.9
1
midpoint
p = 0.7
0.7
0.8
0.9
1
midpoint
0.6
0.6
Nominal p−value
0.7
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.1
n = 12
variance
likelihood
0.8
0.7
0
entropy
Jeffreys
likelihood
0.8
Nominal p−value
0.7
1
0.9
0
0.5 0.6 Actual p−value
Lower CI − coverage
0.2
0.3
0.4
0.5 0.6 Actual p−value
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5 0.6 Actual p−value
Lower CI − coverage 1
0.9
entropy
n = 12
variance
p = 0.9
Jeffreys likelihood
0.8
midpoint
Nominal p−value
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5 0.6 Actual p−value
0.7
0.8
0.9
1
Figure 1: QQ-plots of Q(X) for n = 12 and p = .1, .3, .5, .7, .9. The black color correspond to an area of natural fluctuation of a QQ-plot due to randomness. The colored graphs correspond to the QQ-plots of the various fiducial distributions. 15
Lower CI − coverage
Lower CI − coverage
1
1
0.9
entropy
n=6
variance
p = 0.3
0.9
Jeffreys
0.6
0.6
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.1
p = 0.3
midpoint 0.7
Nominal p−value
Nominal p−value
midpoint
0
variance
likelihood
0.8
0.7
0
n = 21
Jeffreys
likelihood
0.8
entropy
0.2
0.3
0.4
0.5 0.6 Actual p−value
0.7
0.8
0.9
0
1
0
0.1
0.2
0.3
0.4
Lower CI − coverage 1 entropy
n = 48
variance
p = 0.3
0.9
Jeffreys
0.8
0.9
1
midpoint
p = 0.3
0.7
0.8
0.9
1
midpoint
0.6
0.6
Nominal p−value
0.7
0.5
0.4
0.5
0.4
0.3
0.3
0.2
0.2
0.1
0.1
0.1
n = 100
variance
likelihood
0.8
0.7
0
entropy
Jeffreys
likelihood
0.8
Nominal p−value
0.7
1
0.9
0
0.5 0.6 Actual p−value
Lower CI − coverage
0.2
0.3
0.4
0.5 0.6 Actual p−value
0.7
0.8
0.9
1
0
0
0.1
0.2
0.3
0.4
0.5 0.6 Actual p−value
Lower CI − coverage 1
0.9
entropy
n = 1000
variance
p = 0.3
Jeffreys likelihood
0.8
midpoint
Nominal p−value
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
0.1
0.2
0.3
0.4
0.5 0.6 Actual p−value
0.7
0.8
0.9
1
Figure 2: QQ-plots of Q(X) for n = 6, 21, 48, 100, 1000 and p = .3. The black color correspond to an area of natural fluctuation of a QQ-plot due to randomness.. The colored graphs correspond to the QQ-plots of the various fiducial distributions. 16
The closer the points on the QQ-plot are to the line y = x the better the performance of the procedure. We can see straightaway that the scaled likelihood performs worse than any of the other choices. To make this comparison more rigorous we compute, for each of the choices of D, the following statistics Z 1 Z 1 |FQ (x) − x| dx, and D = (x − FQ (x)) dx, A= 0
0
where FQ (x) is the empirical distribution function of the observed values of the Q(X). Smaller values of A and D signify better overall fit. Since we are planning to use the fiducial distribution for inference one can argue that the center of the distribution of Q(X) is of little importance. Therefore we will also check for the level of agreement in the tails. To this end we define Z .1 Z .1 Al = |FQ (x) − x| dx, Dl = (x − FQ (x)) dx, 0 0 Z 1 Z 1 Au = |FQ (x) − x| dx, and Du = (FQ (x) − x) dx, .9
.9
Here we chose Al , Dl to describe the average fit for typical lower tail CIs and Au , Du to describe the average fit for typical upper tail CIs. In both cases positive values of Dl and Du corresponds to being conservative while negative values of Dl and Du correspond to being anticonservative. For each fixed n we plotted the graphs of these statistics as functions of the probability p. For illustration we show plots of of these quantities for n = 6, 21, 48, 50, 100 in Figures 3,4, and 5. The overall conclusion is that the best choice is the maximum variance choice of D ∼ uniform{0, 1} which is consistently better than other choices. However, D ∼ U (0, 1) and D ∼ B(1/2, 1/2) (the maximum entropy and posterior with respect to Jeffreys prior) were typically very close to the best choice. The last two choices were found not to be performing as well. In particular the scaled likelihood underperformed the other choices by a large margin. In light of this we recommend to use the choice D ∼ uniform{0, 1}. Remark 11. Cai (2005) has investigated the two term Edgeworth expansions for coverage of several one-sided Binomial Confidence Intervals. We remark that similar calculations can be used to derive the two term Edgeworth expansion for the fiducial distributions discussed here. In particular one can show that just like confidence intervals based on the Jeffreys posterior, the 17
LowerTail n=6
0.025
entropy variance
0.02
Jeffreys likelihood
0.015
midpoint
0.01
Integral
0.005 0 −0.005 −0.01 −0.015 −0.02 −0.025 0.1
0.2
0.3
0.4
0.5 p
0.6
LowerTail 0.01
0.7
0.8
0.9
−3
LowerTail
x 10
n = 21
entropy
n = 48
8
entropy
variance 0.008
variance
Jeffreys
0.006
likelihood
midpoint
midpoint 4
0.004 0.002
2 Integral
Integral
Jeffreys
6
likelihood
0
0 −0.002 −0.004
−2
−0.006 −4 −0.008 −6
−0.01 0.1
0.2
0.3
0.4
−3
0.5 p
0.6
0.7
0.8
0.9
0.1
0.3
0.4
0.5 p
0.6
n = 1000
entropy
Jeffreys
likelihood
likelihood
1
midpoint
midpoint
2
0.5
Integral
Integral
0.9
variance
Jeffreys 4
0.8
entropy
1.5
variance
0.7
LowerTail
x 10
n = 100
6
0.2
−3
LowerTail
x 10
0
0
−0.5
−2
−1 −4
−1.5 −6 0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
Figure 3: Plots of Al (solid line) and Dl (dashed line) as functions of p for n = 6, 21, 48, 100, 1000. Small values of Al and Dl are preferable. Positive values of Dl correspond to the method being conservative on average. The various colors correspond to various choices for the fiducial distribution. 18
UpperTail 0.03
n=6
entropy variance Jeffreys
0.02
likelihood midpoint
Integral
0.01
0
−0.01
−0.02
−0.03 0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
−3
UpperTail
UpperTail
x 10
0.01 n = 21
entropy
n = 48
8
entropy
variance
0.008
variance
Jeffreys 0.006
Jeffreys
6
likelihood
likelihood
midpoint
midpoint 4
0.004
2 Integral
Integral
0.002 0
0
−0.002 −2 −0.004 −4
−0.006 −0.008
−6
−0.01 0.1
0.2
0.3
0.4
−3
0.5 p
0.6
0.7
0.8
0.9
0.1
6
0.2
0.3
0.4
−3
UpperTail
x 10
n = 100
0.5 p
0.6
n = 1000
entropy variance
0.9
variance Jeffreys
likelihood
likelihood 1
midpoint
2
midpoint
0.5 Integral
Integral
0.8
entropy
1.5
Jeffreys 4
0.7
UpperTail
x 10
0
0
−0.5 −2 −1 −4 −1.5 −6 0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
Figure 4: Plots of Au (solid line) and Du (dashed line) as functions of p for n = 6, 21, 48, 100, 1000. Small values of Au and Du are preferable. Positive values of Du correspond to the method being conservative on average. The various colors correspond to various choices for the fiducial distribution. 19
Overall 0.4
Overall
n=6
entropy
n = 21
0.3
entropy
variance
variance
Jeffreys
0.3
Jeffreys
likelihood
likelihood
0.2
midpoint
0.2
midpoint
0.1
Integral
Integral
0.1
0
0
−0.1 −0.1 −0.2 −0.2 −0.3 −0.3
−0.4 0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
Overall
0.6
0.7
0.8
0.9
Overall
n = 48
0.2
0.5 p
0.15
entropy
n = 100
entropy
variance
variance
Jeffreys 0.15
Jeffreys 0.1
likelihood
likelihood
midpoint
midpoint
0.1 0.05
Integral
Integral
0.05
0
0
−0.05 −0.05 −0.1
−0.15
−0.1
−0.2 0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
−0.15
0.9
0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
Overall n = 1000
0.04
entropy variance Jeffreys
0.03
likelihood midpoint
0.02
Integral
0.01
0
−0.01
−0.02
−0.03
−0.04 0.1
0.2
0.3
0.4
0.5 p
0.6
0.7
0.8
0.9
Figure 5: Plots of A (solid line) and D (dashed line) as functions of p for n = 6, 21, 48, 100, 1000. Small values of A and D are preferable. The various colors correspond to various choices for the fiducial distribution.
20
maximum variance fiducial distribution leads to confidence intervals that is first order matching, cf. Ghosh (1994). Special case 2 - the Trinomial distribution Example 5. Some aspects of the fiducial distribution for the parameters of a trinomial has been investigated by Dempster (1968), where he used a trinomial distribution as an example for his definition of upper and lower probabilities. In this example we investigate the small sample frequentist properties of the fiducial distribution for the trinomial parameters. There are many reasonable choices for the distribution of D in (8). We have considered five different choices that appeared natural to us. Based on our experience from example 4 we take D independent of U1 , . . . , Un . Here are the choices: The maximum entropy choice is achieved by taking D as a uniform distribution on (0, 1)2 if s2 > 0 and D ∼ uniform{(x, y), 0 < x < y < 1} if s2 = 0. The Bayesian posterior for Jeffreys prior is achieved by taking D1 , D2 as i.i.d. Beta(1/2, 1/2) if s2 > 0 and D1 ∼ Beta(1/2, 1/2), D2 = 1 if s2 = 0. The third choice is a first version of a maximum variance distribution. Here D ∼ uniform{0, 1}2 if s2 > 0 and D ∼ uniform{(0, 0), (0, 1), (1, 1)} if s2 = 0. This is obtained by maximizing the determinant of the covariance matrix of Rp (x). Notice that it is also the uniform distribution on the vertices of Q(x, U ). The fourth choice is a second version of a maximum variance distribution. This is obtained by maximizing the smallest eigenvalue of the covariance matrix of Rp (x). Notice that this distribution is supported on the vertices of Q(x, U ). The last choice is the uniform distribution on the boundary of Q(x, U ).. Finally we remark that the scaled likelihood (Bayesian posterior with respect to flat prior) is not among the fiducial distributions and will not be included in the simulation. To evaluate the performance of the fiducial distribution and compare the performance of the various choices of D we performed an extensive simulation study. As shown in Section 6, the fiducial inference is correct asymptotically. Therefore our simulation study concentrated mostly on small values of n. In particular we considered n = 5, 10, 15, . . . , 30, 300 and p1 , p2 ∈ {0.05, 0.1, . . . , 0.95} with p1 + p2 < 1. For each of the combination of the parameters n, p1 , p2 we simulated a sample of 2000 observations from the 21
trinomial distribution. For each of the trinomial observation and each of the choice of D we generated a sample of 3000 observations from the fiducial distribution Rp (x). In order to evaluate the quality of the joint fiducial distribution we then evaluated the empirical coverage of the one-sided equal tailed region. In particular, for any random vector X and 0 < α < 1 we define the one sided equal tailed region C(X, α) as the set {(x.y); x ≤ x0 , y ≤ y0 } satisfying P (X ∈ {(x.y); x ≤ x0 , y ≤ y0 }) = α and P ({(x, y); x > x0 }) = P ({(x, y); y > y0 }). Also for simplicity of formulas denote A(X, p) = inf α {p ∈ C(X, α)}. Then the performance can be evaluated by estimating the probability Q(X) = P (Rp (X) ∈ C(Rp (X), A(Rp (X), p))|X) using the simulated data for each of the five variations of fiducial distribution. If the fiducial inference were exact, the Q(X) should follow U(0, 1) distribution. The level of agreement of Q(X) with a U(0, 1) distribution was examined using QQ-plots. Since fiducial inference is a non-randomized procedure, the distribution of Q(X) can take only finitely many values. Therefore it can be expected that the agreement with uniform distribution will be poor for small values of n and will improve dramatically as n increases. Since the QQ-plots generated for the trinomial distribution are very similar to the figure shown in example 4 we do not display them here to save space. The closer the points of the QQ-plot are to the line y = x the better the performance of the procedure. We define A, Al and Au as in example 4. Since we have one more parameter than in the binomial case we need a new way to display the comparison between the two procedures. For each fixed n, p1 , p2 and each of the five procedures we calculated a relative efficiency of procedure j as minj A(j)/A(i), where A(i) is the value of A for procedure i. Values close to 1 then mean a relatively good performance, while small values mean relatively bad performance. For each fixed n we plotted an image containing a matrix of cells comparing these relative efficiencies. The cells are then placed on the image depending on its value of p1 and p2 . For illustration we show plots of these quantities for n = 5, 10, 30, 300 in Figures 6,7, and 8. The overall conclusion is that the best choice for D is the first maximum variance choice (called vertex in the figures) for which we have D ∼ uniform{0, 1}2 . This is typically better than other choices. In particular this choice seems to consistently outperform the Bayesian posterior computed with respect to Jeffreys prior.
22
Relative Efficiency for joint, LowerTail
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
p2
p2
Relative Efficiency for joint, LowerTail
0.5
0.6
0.5
0.6
0.7
0.7 entropy
entropy
Jeffreys
n=5
0.8
Looking at joint 0.9 0.2
0.3
0.4
0.5 p1
0.6
0.7
vertex
maximin
Looking at joint 0.9
edge 0.1
0.8
0.9
maximin edge
0.1
0.2
0.3
Relative Efficiency for joint, LowerTail
0.4
0.5 p1
0.6
0.7
0.8
0.9
Relative Efficiency for joint, LowerTail
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
p2
p2
Jeffreys
n = 10
0.8
vertex
0.5
0.6
0.5
0.6
0.7
0.7 entropy
0.8
entropy
Jeffreys
n = 30
vertex Looking at joint
0.9 0.2
0.3
0.4
0.5 p1
0.6
0.7
vertex
maximin
Looking at joint 0.9
edge 0.1
0.8
Jeffreys
n = 300
0.8
0.9
maximin edge
0.1
0.2
0.3
0.4
0.5 p1
0.6
0.7
0.8
0.9
Figure 6: Plots of relative efficiency based on Al for n = 5, 10, 30, 300. The longer the bar corresponding to each method the better the method. The various colors correspond to various choices of D.
Fiducial Inference for N (µ, σ 2 ) In the following example we will derive a joint fiducial distribution for the parameters µ and σ 2 based on a random sample from N (µ, σ 2 ). We believe that it is worthwhile to demonstrate the use of the fiducial recipe in this simple case. We will derive the fiducial distribution using two different methods with additional discussion following in Section 8. Example 6. Let X1 , . . . , Xn be i.i.d. N(µ, σ 2 ). We will offer two different approaches to finding the fiducial distribution. Our first approach uses a ¯ n , S 2 ). One has the following structural equaminimal sufficient statistic, (X n 23
Relative Efficiency for joint, UpperTail
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
p2
p2
Relative Efficiency for joint, UpperTail
0.5
0.6
0.5
0.6
0.7
0.7 entropy
entropy
Jeffreys
n=5
0.8
Looking at joint 0.9 0.2
0.3
0.4
0.5 p1
0.6
0.7
vertex
maximin
Looking at joint 0.9
edge 0.1
0.8
0.9
maximin edge
0.1
0.2
0.3
Relative Efficiency for joint, UpperTail
0.4
0.5 p1
0.6
0.7
0.8
0.9
Relative Efficiency for joint, UpperTail
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
p2
p2
Jeffreys
n = 10
0.8
vertex
0.5
0.6
0.5
0.6
0.7
0.7 entropy
0.8
entropy
Jeffreys
n = 30
vertex Looking at joint
0.9 0.2
0.3
0.4
0.5 p1
0.6
0.7
vertex
maximin
Looking at joint 0.9
edge 0.1
0.8
Jeffreys
n = 300
0.8
0.9
maximin edge
0.1
0.2
0.3
0.4
0.5 p1
0.6
0.7
0.8
0.9
Figure 7: Plots of relative efficiency based on Au for n = 5, 10, 30, 300. The longer the bar corresponding to each method the better the method. The various colors correspond to various choices of D.
tions.
σZ ¯n = µ + √ X , n
Sn2 =
σ2V , n−1
(12)
where Z is standard normal and V has chi-square distribution with n − 1 degrees of freedom. By solving the structural equation (12) we get ( !) r 2 2 (n − 1)s (n − 1)s n n . (13) Q(¯ xn , s2n ; z, v) = x¯n − z, nv v
24
Relative Efficiency for joint, Overall
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
p2
p2
Relative Efficiency for joint, Overall
0.5
0.6
0.5
0.6
0.7
0.7 entropy
entropy
Jeffreys
n=5
0.8
Looking at joint 0.9 0.2
0.3
0.4
0.5 p1
0.6
0.7
vertex
maximin
Looking at joint 0.9
edge 0.1
0.8
0.9
maximin edge
0.1
0.2
0.3
Relative Efficiency for joint, Overall
0.4
0.5 p1
0.6
0.7
0.8
0.9
Relative Efficiency for joint, Overall
0.1
0.1
0.2
0.2
0.3
0.3
0.4
0.4
p2
p2
Jeffreys
n = 10
0.8
vertex
0.5
0.6
0.5
0.6
0.7
0.7 entropy
0.8
entropy
Jeffreys
n = 30
vertex Looking at joint
0.9 0.2
0.3
0.4
0.5 p1
0.6
0.7
vertex
maximin
Looking at joint 0.9
edge 0.1
0.8
Jeffreys
n = 300
0.8
0.9
maximin edge
0.1
0.2
0.3
0.4
0.5 p1
0.6
0.7
0.8
0.9
Figure 8: Plots of relative efficiency based on A for n = 5, 10, 30, 300. The longer the bar corresponding to each method the better the method. The various colors correspond to various choices of D.
Since the set Q(¯ xn , s2n ; z, v) is always a singleton we have ! r 2 2 (n − 1)s (n − 1)s n n Z, . R(µ,σ2 ) (¯ xn , s2n ) = x¯n − nV V A simple calculation shows that the density of R(µ,σ2 ) is −
fR(µ,σ2 ) (m, h) =
e
(n−1)s2 (m−¯ xn )2 − 2h n 2h/n
((n − 1)s2n ) n π/n Γ n−1 2n/2 h 2 +1 2
p
n−1 2
I(0,∞) (h).
This is the joint fiducial density proposed by Fisher (1935a). 25
(14)
We will now derive the fiducial distribution without the use the minimal sufficient statistic. This particular derivation is based on Fraser’s structural approach (Fraser 1968). We can describe the distribution of X by means of the structural equations Xi = µ + σZi , i = 1, . . . , n. Here Zi are i.i.d. standard normal random variables. Solving the first and second equations for µ, σ 2 we get 2 z x −z x x −x 1 2 2 1 1 2 , z1 −z2 z1 −z1 xi −xj zi xj −zj xi Q(x1 , . . . , xn ; z1 , . . . , zn ) = if xl = zi −zj − zi −zj zl , l = 3, . . . , n ∅ otherwise. Defining z1 x2 − z2 x1 M= ,H = z1 − z2
x1 − x2 z1 − z2
2
z1 x2 − z2 x1 x1 − x2 − zl , and Rl = z1 − z2 z1 − z2
we can then interpret the Fiducial Distribution (5) as the conditional distribution of (M, H) given R = x, where R = (R3 , . . . , Rn ) and x = (x3 , . . . , xn ). A simple calculation shows that the joint fiducial density of (M, H) is −
fM,H,R (m, h, x) =
e
Pn 2 i=1 (m−xi ) 2h
|x1 − n/2 2(2π) hn/2+1
x2 |
I(0,∞) (h).
(15)
and therefore the fiducial distribution fM,H|R=x (m, h) is the same as the one stated in (14). The fiducial density is associated with the usual t and chi-square distributions and therefore inference based on it will lead to classical inference. Therefore it is well-known (Mood, Graybill & Boes 1974) that inference based on the distribution (14) leads to exact frequentist inference even for small sample sizes. Fiducial inference for a mixture of two normals Example 7. In this example we consider the fiducial distribution for the parameters of a mixture of two normal distributions. This is a prototypical 26
example that can be used to construct fiducial distributions for many other problems. In particular one can use the ideas demonstrated in this example to construct a robust fiducial confidence interval for a mean of a normal sample by considering a mixture of normal and Cauchy distributions. To our knowledge this is the first time the fiducial paradigm has been used in such a complex situation. Let X1 , . . . , Xn be independent random variables following either N(µ1 , σ12 ) or N(µ2 , σ22 ) distributions. Moreover assume that each of the observation comes from the second distribution with probability p independently of others. For identifiability reasons we assume that µ1 < µ2 . We also assume that we observe at least two data points from each distribution. Our goal will be to find the fiducial distribution of (µ1 , σ12 , µ2 , σ22 , p). We can write a set of structural equations for X1 , . . . , Xn as Xi = (µ1 + σ1 Zi )I(0,p) (Ui ) + (µ2 + σ2 Zi )I(p,1) (Ui ), i = 1, . . . , n, where Zi are i.i.d. N(0,1) and Ui are i.i.d. U(0,1) random variables. When finding the set valued function Q we need to realize that this inversion will be stratified based on the possible assignment of the observed values xi to one of the two groups. For simplicity of notation the observed points x and corresponding z values assigned to groups 1 and 2 are denoted by v1 . . . , vk and h1 , . . . , hk , and w1 , . . . , wn−k and r1 , . . . , rn−k respectively. We can then write Q(x1 , . . . , xn ; z1 , . . . , zn , u1 , . . . un ) 2 2 v −v r w −r w w −w h v −h v 1 2 2 1 , h11 −h22 , 1 r12 −r22 1 , r11 −r22 × (us:n , us+1:n ), h1 −h2 for each assignment of the xi to the two groups, if −v2 h1 v2 −h2 v1 = vl = h1 −h2 − hv11 −h hl , l = 3, . . . , s, 2 w1 −w2 r1 w2 −r2 w1 and wl = r1 −r2 − r1 −r2 rl , l = 3, . . . , n − s; ∅ otherwise. Similarly as in previous examples, for each possible assignment 2 of the H1 v2 −H2 v1 v1 −v2 observations to the two groups, set M1 = H1 −H2 , N1 = H1 −H2 , M2 = 2 R1 w2 −R2 w1 w1 −w2 −H2 v1 , N = , P = Us:n + U¯ (Us+1:n − Us:n ), Kl = H1Hv12 −H − 2 R1 −R2 R1 −R2 2 v1 −v2 w1 −w2 w2 −R2 w1 − R H1 −H2 Hl , l = 3, . . . , s and Ll = R1R Rl , l = 3, . . . , n − s. 1 −R2 1 −R2 27
We then interpret the conditional distribution (5) as lim
n−2 X
X
s=3
assignments
ε→0+
P M1 ∈ (m1 , m1 + ε), N1 ∈ (n1 , n1 + ε), M2 ∈ (m2 , m2 + ε), N2 ∈ (n2 , n2 + ε), P ∈ (p, p + ε) Kl ∈ (vl , vl + ε), Lj ∈ (wj , wj + ε)
= C −1
n−2 X
X
s=3
assignments
fP (p, s) fM1 ,N1 ,K (m1 , n1 , v)fM2 ,N2 ,L (m2 , n2 , w), n s
(16) where fP is as defined in (10) and both fM1 ,N1 ,K and fM2 ,N2 ,L are as defined in (15). The constant C on the left-hand-side of (16) is C=
=
n−2 X
X
s=3
assignments
n−2 X
X
s=3
assignments
Z
Z ···
fP (p, s) fM1 ,N1 (m1 , n1 , v)fM2 ,N2 (m2 , n2 , w) n s
)Γ( n−s−1 ) Γ( s−1 2 2 n s
π n/2−1
P
1≤i 1 we can also consider the equal tailed regions. In fact the conditions on the region so flexible that they allow most typical multiple comparison regions. We will demonstrate this in example 12 in Section 7. Theorem 1. Suppose Assumptions 1 hold and γn → γ. Furthermore assume k d k that √ there is a function ζ : R → R such that for any sn ∈ R satisfying n(sn − t(ξ)) → h, we have √ n(ζ(sn ) − θ) → Ah, (19) where the matrix A was defined in 2b. Then lim Pξ (θ ∈ C(Rθ (S), ζ(S), S, γn )) = γ. n→∞
In particular C(Rθ (S), ζ(S), S, γ) is a confidence region for θ with asymptotic coverage probability equal to γ. Proof. By assumption 1 and Skorokhod’s representation (Billingsley 1995) theorem imply that we can assume without loss of generality that √ n (S − t(ξ)) → H a.s. (20) 34
This, assumptions 2a, and (19) assure √
D
n(Rθ (S) − θ) −→ R(H) a.s.
√
n(ζ(S) − θ) → AH a.s.
(21)
(Here the a.s. means for almost all sample paths of the process Sn and subsequently almost all values of H.) Therefore by (20), (21) and assumption 3d √ √ C n(Rθ (S) − θ), n(ζ(S) − θ), S, γn → C (R(H), AH, t(ξ), γ) a.s. (22) Also, by assumption 3c we see that √ √ Pξ (θ ∈ Cn (Rθ (S), ζ(S), S, γn )) = Pξ 0 ∈ C( n(Rθ (S) − θ), n(ζ(S) − θ), S, γn ) . To finish the proof, we will show the following convergence: √ √ Pξ (0 ∈ C( n(Rθ (S)−θ), n(ζ(S)−θ), S, γn )) → Pξ (0 ∈ C (R(H), AH, t(ξ), γ)). First notice that R(h) − Ah has a multivariate normal distribution with mean zero and covariance matrix ΣR . It is the same distribution as the distribution of −AH. Assumption 2a implies {h : 0 ∈ C(R(h), Ah, t(ξ), γ))} = {h : −Ah ∈ C(R(h) − Ah, 0, t(ξ), γ)} = {h : −Ah ∈ C(−AH, 0, t(ξ), γ)}. (23) √ For simplicity of notation denote Hn = n(S − t(ξ)). Also denote √ √ √ √ √ Bn = {h : 0 ∈ C( n(Rθ t(ξ) + h/ n −θ), n(ζ t(ξ) + h/ n −θ), (t(ξ)+h/ n), γn )} and B = {h : 0 ∈ C(R(h), Ah, t(ξ), γ)}. The sets are chosen to satisfy √ √ {0 ∈ C( n(Rθ (S) − θ), n(ζ(S) − θ), S, γn )} = {Hn ∈ Bn } and {0 ∈ C(R(H), AH, t(ξ), γ)} = {H ∈ B}. D
As noted before we have Hn −→ H. Moreover assumptions 2b and 3d implies that B is open, ∂B = {h : 0 ∈ ∂C(Rθ (h), γ)} and Bn → B. Assumption 3a and (23) additionally imply that P (H ∈ ∂B) = 0. S T∞ ◦ Denote Dm = ∞ B \ ( k k=m k=m Bk ) . Notice that by assumption 3d we have Dm ↓ D ⊂ ∂B and P (H ∈ D) = 0. Moreover if m ≤ n, Bn 4B ⊂ Dm . 35
Fix an ε > 0. Continuity of probability implies that there is m1 such that Pξ (H ∈ Cm1 ) < ε. Consequently convergence in distribution implies that there is m2 such that for all n > m2 , Pξ (Hn ∈ Cm2 ) < ε. This implies that for n > max(m1 , m2 ) |Pξ (Hn ∈ Bn ) − P (Hn ∈ B)| ≤ P (Hn ∈ Cm1 ) < ε. Finally notice that |Pξ (Hn ∈ Bn ) − Pξ (H ∈ B)| ≤ |Pξ (Hn ∈ Bn ) − Pξ (Hn ∈ b)| + |Pξ (H ∈ Bn ) − Pξ (H ∈ B)|. Thus the assumption 3b and (23) together with the definition of convergence in distribution imply √ √ Pξ (0 ∈ C( n(Rθ (S) − θ), n(ζ(S) − θ), S, γn ) = Pξ (Hn ∈ Bn ) → Pξ (H ∈ B) = Pξ (0 ∈ C (R(H), AH, t(ξ), γ) = γ. This concludes the proof of the theorem. Remark 14. It is fairly straightforward to generalize the statements of Theorem 1 for distributions that are not in the domain of attraction of the normal distribution. Some examples in that direction have been explored in (Hannig et al. 2006b). However, the main ideas are better demonstrated within the setting we have chosen. In particular the key condition 2b is easier to understand if the limiting distribution is normal. The main issue faced when dealing with continuous distributions is related to the need to use conditioning. In this section we first prove a corollary to the general theorem showing that under some suitable conditions the fiducial distribution is approximately correct even in the presence of conditioning. We illustrate this with examples. Assume that the set Q(x, u) defined in (4) is either a singleton or empty. Additionally assume there are functions Rθ (x, u) and R0 (x, u) satisfying {Rθ (x, u)} = Q(x, u) and R0 (x, u) = 0 whenever Q(x, u) 6= ∅. The fiducial distribution in (5) can be then interpreted as the distribution of Rθ (x, U ) | R0 (x, U ) = 0.
(24)
We will now state assumptions under which confidence regions based on (24) lead to asymptotically correct inference. 36
Assumptions 2. For a fixed γ ∈ (0, 1) assume the following 1. There exist t(ξ) ∈ Rk such that √
D
n (S1 − t1 (ξ), . . . , Sk − tk (ξ)) −→ H = (H1 , . . . , Hk )> ,
(25)
where H has a non-degenerate multivariate normal distribution with mean 0 and variance ΣH . 2. There are matrices Aθ and A0 such that For each fixed h ∈ Rk and for √ any xn ∈ Rk satisfying n(xn − t(ξ)) → h (a) √
Aθ Rθ (xn , U ) θ D −→ (h−H? ) = R(h). − Rn (x) = n A0 R0 (xn , U ) 0 (26) ? Here H is independent of and has the same distribution as H.
(b) Assume that R(h) (defined on the right-hand-side of (26)) has a non-degenerate normal distribution. The density of Rn (x) (denoted by fn (rθ , r0 )) converges to the density of R . Moreover, for each fixed rθ the functions fn (rθ , ) are uniformly integrable.
·
3. We consider region C(X, z, s, γ) to satisfy Assumptions 1.3 with matrix > −1 A = Aθ − Aθ Σ H A> 0 (A0 ΣH A0 ) A0 . Theorem 2. Suppose Assumptions 2 hold, γn → γ and there is a function ζ satisfying the same condition as in Theorem 1. Then lim EPξ (θ ∈ C(Rθ (s, U ? )|R0 (s, U ? ) = 0, ζ(S), S, γn )|S = s) = γ.
n→∞
In particular for each observed s, C(Rθ (s, U ? )|R0 (s, U ? ) = 0, γ) is a confidence region for θ with asymptotic coverage probability equal to γ. Proof. By Assumption 2 and Skorokhod’s representation (Billingsley 1995) theorem imply that √ n (S − t(ξ)) → H a.s. This together with assumption 2.2a and 2.2b this assures that √
D
n(Rθ (S, U ? ) − θ)|Rθ (S, U ? ) = 0 −→ Aθ (H − H? )|A0 (H − H? ) = 0 a.s. 37
(Here the a.s. again means for almost all values of Sn and H.) > −1 ? Set A = Aθ −Aθ ΣH A> 0 (A0 ΣH A0 ) A0 . Random vector Aθ (h−H )|A0 (h− H? ) has a normal distribution with mean Ah and variance > > −1 > Aθ Σ H A> θ − Aθ ΣH A0 (A0 ΣH A0 ) A0 ΣH Aθ ,
which is also the variance of AH. This verifies the Condition 2b. The theorem now follows by Theorem 1.
6
Discrete Distributions
In this section we will explore some issues related to fiducial inference for discrete distributions. We show that the conditions of Theorem 1 can be directly verified for the most common discrete distributions. Example 8. (Continuation of Example 3) Let X1 , . . . , Xn be i.i.d. Multinomial(p) random variables, where p = (p1 , p2 , . . . , pk ), pj ∈ (0, 1), j = 1, . . . , k, and Pk j=1 pj < 1. The fiducial distribution for this model was derived in (8). Using Theorem 1 we now show that inference based on Rp (s) has good frequentist properties asymptotically. Since we will consider equal tailed regions based on the distribution of Rp (s), the Assumptions 1.3 are automatically verified. Denote Sj the number of times we observe value j among the X1 , . . . , Xn . Recall that √ S = (S1 , . . . , Sk )> has a multinomial(n, p1 , . . . , pk ) distribution. Therefore n(S − p) → H, where H ∼ N(0, Σ) and Σ = Diag(p) − pp> . This verifies assumption 1.1. Notice that for any sequence of integers kj , where 0 ≤ kj ≤ j, we have √ D n(Ukn +1:n − Ukn :n ) −→ Γ(1, 1). Fix h, set s = (p + h/ n) and denote √ Wn = (Us1 :n , Us1 +s2 :n , . . . , Us1 +···+sk :n ). A simple calculation shows that n(Wn − D b where gj = Pj hk and Σ b i,j = min(qi , qj )(1 − max(qi , qj )), q) −→ N (g, Σ), k=1 Pj with qj = k=1 pk . Thus by Slutsky’s theorem √
D
n(Rp (S) − p) −→ N (h, Σ).
The Assumptions 1 are verified. In particular we can conclude that fiducial confidence sets will have asymptotically correct frequentist coverage regardless of the choice of the distribution V ( ).
·
38
We now derive a weak fiducial distribution for a sample from a general one-parameter discrete distribution. Example 9. Assume that X is a discrete random variable and ξ ∈ R. Then the function G can be chosen to satisfy s = G(u, ξ) if and only if P (X < s|ξ) < u ≤ P (X ≤ s|ξ). Additionally assume that P (X ≤ s|ξ) is a monotone continuous function in ξ. Then we can define functions q+ and q− satisfying q+ (x, u) = ξ if P (X ≤ x|ξ) = u,
q− (x, u) = ξ if P (X < x|ξ) = u.
Finally, assume that V (a, b) is the uniform distribution on (a, b). Then Rξ (x) = V (Q(x, U ? )) = q− (x, U ? ) + {q+ (x, U ? ) − q− (x, U ? ))}U¯ , where x is the observed value of X and U ? and U¯ are independent uniform(0, 1) random variables. Example 10. Let X be a Poisson(λ) random variable. As in example 9 define q+ (x, u) = λ if P (X ≤ x|λ) = u and q− (x, u) = λ if P (X < x|λ) = u. Notice that q− (x, U ) < λ < q+ (x, U ) if and only if the Poisson random variable G(U, λ) = x. But using an appropriate Poisson process we can rewrite this equality as E(x) < λ < E(x) + E, where E(x) is Gamma(x, 1) random variable and E is exponential(1) random variable independent of E(x). Notice that E(x) and E are independent of X. Thus the weak FQ could be written as Rλ (x) = V (Q(x, U ? )) = E(x) + U¯ E,
(27)
where x is the observed value of X and U¯ is uniform(0, 1) independent of X, E(x) and E. If we choose U¯ to have Beta(1/2, 1/2) instead of Uniform(0, 1) in (27), Rp will have Gamma(x + 1/2) distribution which again corresponds to the Bayesian solution using Jeffreys prior. A particularly interesting case is U¯ = 1 which leads to Gamma(x + 1) distribution. This is the scaled likelihood R∞ function L(λ; x)/ 0 L(λ; x) dλ. Let us now consider X1 , . . . , Xn i.i.d. Poisson(λ) random variables and denote S = n1 (X1 + · · · + Xn ). Clearly Rλ (x) =
E(ns) + U¯ E n
is a weak FQ for λ. 39
By central limit theorem √ shows that if s = λ + h/ n
√
D
n(S − λ) −→ N (0, λ) and a simple calculation
G(ns) − λ D √ −→ N (h, λ) n This verifies the Assumptions 1.
7
Continuous Distributions
Example 11. The first example is motivated by an unbalanced variance components model. Such models arise in heritability studies in animal breeding experiments (Burch & Iyer 1997), quality improvement studies in manufacturing processes (Burdick, Borror & Montgomery 2005), characterizing sources of error in general variance components models (Liao & Iyer 2004), and in many other applications. In the simplest case one has the following normal components of variance model. Yij = µ + Ai + eij where µ is an unknown parameter, Ai are iid N (0, φ), eij are iid N (0, θ), and all random variables are jointly independent. In metrology, Yij might be the diameter measurement on a part (ball-bearing) and µ is the mean diameter of the population of ball-bearings output by the process. A random sample of a ball-bearings are selected. The true diameter of the ith ball-bearing is µ + Ai . Ball-bearing i is measured ni times. If ni = n for all i then we have a balanced one-way random effects model. In the case of unequal ni we have an unbalanced one-way random model. In the balanced case the complete sufficient statistics are well known (Searle, Casella & McCulloch 1992). In the unbalaneced case the minimal sufficient statistics are incomplete. Inference about φ and θ is typically based on K independent quadratic forms which have scaled chi-square distributions and whose expected values have the form θ + ci φ for some known ci , i = 1, . . . , K. The simplest challenging case is K = 3. Hence we consider the following scenario and illustrate our procedure for obtaining a weak fiducial distribution for (φ, θ). Let (c2 φ + θ)U2 U3 (c1 φ + θ)U1 , S2 = , S3 = θ, S1 = n1 n2 n3 40
where c1 > c2 > 0, and U1 , U2 , U3 are independent chi-square random variable with n1 , n2 , n3 numbers of degrees of freedom respectively. Solving the first two equations for φ and θ and then plugging the results into the third equation suggests defining n 2 s2 c 2 n 1 s1 c 1 n 2 s2 n 1 s1 − , W2 = − + , W1 = (c1 − c2 )U1 (c1 − c2 )U2 (c1 − c2 )U1 (c1 − c2 )U2 U3 c 2 n 1 s1 c 1 n 2 s2 + . W3 = − n3 (c1 − c2 )U1 (c1 − c2 )U2 Here s1 , s2 , s3 are again the observed values of the statistics S1 , S2 , S3 . The fiducial distribution of φ, θ defined by (5) then could be interpreted as the conditional distribution of W1 , W2 |W3 = s3 . A routine calculation shows that the joint density of W1 , W2 , W3 is fW (w1 , w2 , w3 ; s1 , s2 ) h i n3 n3 n2 n1 −1 n2 s2 n3 w3 1 + + |c1 − c2 |(n1 s1 ) 2 (n2 s2 ) 2 n32 w32 exp − 12 w1nc11 s+w w1 c2 +w2 w2 2 . = n3 n1 +n2 +n3 n1 n2 2 2 Γ n21 Γ n22 Γ n23 (w1 c1 + w2 ) 2 +1 (w1 c2 + w2 ) 2 +1 w22 The fiducial distribution of φ, θ has therefore a density RR
fW (w1 , w2 , s3 ; s1 , s2 ) . fW (w10 , w20 , s3 ; s1 , s2 ) dw10 dw20
Consequently, the fiducial distribution of φ is R f (w , w , s ; s , s )dw RR W 0 1 0 2 3 1 2 0 2 0 . fW (w1 , w2 , s3 ; s1 , s2 ) dw1 dw2 To set up confidence regions one can use numerical integration. The fiducial distribution of φ does not lead to exact frequentist inference. However, simulation results suggest a good practical properties. For details on the simulation and some generalization we refer reader to a technical report E, Hannig & Iyer (2006). To show that the fiducial distribution leads at least to asymptotically proper frequentist coverage define n = n1 + n2 + n3 and assume that ni /n → √ D pi ∈ (0, 1). Also notice that n (S − (c1 φ + θ, c2 φ + θ, θ)) −→ H, where 2(c1 φ+θ)2 0 0 p1 2(c2 φ+θ)2 H ∼ N 0, 0 (28) 0 p2 2θ2 0 0 p3 41
As in Assumptions 2 define √ √ Z1 = n(W1 − φ), Z2 = n(W2 − θ),
Z3 =
√
n(W3 − s3 ),
and set h = (h1 , h2 , h3 ) to satisfy h1 s1 = (c1 φ + θ) + √ , n
h2 s2 = (c2 φ + θ) + √ , n
h3 s3 = θ + √ . n
Then the density of (Z1 , Z2 , Z3 ) is z1 z2 h3 z1 f (z1 , z2 , z3 ) = fW φ + √ , θ + √ , θ + √ + √ ; n n n n h2 h1 c1 φ + θ + √ , c2 φ + θ + √ n−3/2 . n n This function is bounded from above by an integrable function. Moreover, if we set 1 1 − 0 c1 −c2 c1 −c2 Aθ c1 2 0 B= = − c1c−c c 2 1 −c2 A0 c2 c1 − c1 −c2 c1 −c2 −1 then f (z1 , z2 , z3 ) converges as n → ∞ to a multivariate normal density. In fact, it is the density of the random variable B(h − H), where H is defined in (28). Thus Lebesgue dominated theorem and Theorem 2 imply that confidence intervals based on the fiducial density of φ have asymptotically correct frequentist properties. Example 12. We include this last example to show that the regions defined in Assumptions 1.3 are flexible enough to allow for typical multiple comparison intervals. Suppose that for each i = 1, . . . , K, Yij , j = 1, . . . , ni is i.i.d. N(µi , σi2 ). The K samples are also assumed independent of each other. We are interested in the problem of constructing simultaneous confidence intervals for δij = µi − µj for all i 6= j. We first observe that by independence the fiducial distribution for δij is the same as the distribution of the FGPQ given by Rδij (S, S? , ξ) = Rµi − Rµj where
Sp Rµp = Y¯p − ? (Y¯p? − µp ) Sp 42
is the FGPQ for µp (see Example 6). Define (Y − Y ) − R (S, S∗ , ξ) i j δ p ij D(S, S∗ , ξ) = max i6=j Vij where Vij is a consistent estimator of the variance of Y i − Y j , i.e., Vij = Sj2 . nj
Si2 ni
+
The 100(1 − α)% two-sided simultaneous FGCIs for pairwise diferences δij , i 6= j of means of more than two independent normal distributions are [Lij , Uij ] where p Lij = Y i − Y j − d1−α Vij , p (29) Uij = Y i − Y j + d1−α Vij and dγ denotes the 100γ-percentile of the conditional distribution of D(S, S∗ , ξ) given S = s. To set up confidence regions one can use simulation. The simultaneous fiducial confidence intervals for δij do not lead to exact frequentist inference. However, simulation results suggest very good practical properties. For details on the simulation and some generalization we refer reader to (Abdel-Karim 2005) and (Hannig, E, Abdel-Karim & Iyer 2006a). To show that the fiducial distribution PKleads at least to asymptotically proper frequentist coverage define n = k=1 nk and assume that ni /n → 2 > ) pi ∈ (0, 1). It is fairly straightforward to see that S = (Y 1 , S12 , . . . , Y K , SK > satisfies Assumption 1.1. Similarly, R = (Rδ12 , Rδ13 . . . , Rδ(K−1)K ) satisfies the Assumptions 1.2, with a K(K − 1)/2 × 2K matrix 1 0 −1 0 0 0 · · · 0 0 0 0 1 0 0 0 −1 0 · · · 0 0 0 0 A = .. .. .. .. .. .. . . .. .. .. .. . . . . . . . . . . . . 0 0 0 0 0 0 · · · 1 0 −1 0 Similarly, the assumption in (19) will be satisfied with the function ζ(S) = A·S Finally, we need to show that the region described in (29) satisfies Assumptions 1.3. To that end observe that the conditional distribution of D(S, S∗ , ξ)|S could be represented as function of distribution of R, ζ(S) and S. Here, the estimator of variance nVij could be expressed as a continuous function of S. The various conditions of this assumption now follow by Slutsky’s lemma and simple algebra. 43
8
Non-uniqueness of fiducial distribution
The fiducial recipe of Section 3 seems to provide an approach for deriving statistical procedures that have good properties. Unfortunately, it does not lead to a unique fiducial distribution. There are two main sources of the non-uniqueness. The more obvious one is the fact that the sets Q(X, U ∗ ) might have more than one element. This means that we would not be able to find the exact value of ξ even if we knew both X and U . Consequently, the data itself is not able to tell us which distribution value of ξ was used. In order to resolve this non-uniqueness one has to have some apriori way of choosing between the elements of Q(X, U ∗ ). Fortunately, in most application when Assumptions 1 are satisfied we also √ observe that in particular n diam(Q(X, U ∗ )) → 0. This means that in these cases the role of the apriori information is negligible asymptotically. Of courses such a situation can be expected only in parametric problems. However, just like the choice of a prior in Bayesian methods, the apriori choice of V (Q(x, u)) will play a big role in non-parametric and semi-paramteric problems. Based on our experience with the problems we investigated we recommend the use of V (Q(x, u)) that is independent of the data and that maximizes the determinant of the variance of the fiducial distribution. Another useful option is to use the uniform distribution on Q(x, u). This second option should work reasonably well and be reasonably easy to implement even if we deal with higher dimensional problems. Another way of resolving this problem is using upper and lower probabilities, cf. Dempster (1968). In particular, instead of defining a single fiducial distribution on the parameter space we define an upper and lower fiducial distribution. In our setting the upper probability is obtained as the supremum over possible choices of the distributions V (Q(x, u)) while the lower probability is obtained as the infimum over possible choices of the distributions V (Q(x, u)). Therefore if one refuses to use any subjective prior information one can still use the fiducial recipe for obtaining statistical procedures using the upper and lower probabilities. The second source of non-uniqueness is caused by the Borel paradox. If in the fiducial recipe (5) we have P (Q(x, u) 6= ∅) = 0, the resulting fiducial distribution depends on the way we decide to interpret the conditioning. We consider this to be a more severe problem because it is much harder to investigate and resolve. To demonstrate the severity of the situation, consider 44
the following continuation of example 6. Example 13. Let X1 , . . . , Xn be i.i.d. N(µ, σ 2 ). In Example 6 we showed two different way of implementing the fiducial recipe that both led to the same desirable solution. Unfortunately, there are many other ways of implementing the fiducial recipe that do not lead to good solutions. We will demonstrate one of them here. We again write the structural equation as Xi = µ + σZi , i = 1, . . . , n. For simplicity of notation assume that n is even, i.e., n = 2k. Define 2 x2j−1 − x2j z2j−1 x2j − z2j x2j−1 , Hj = j = 1, . . . , k. Mj = z2j−1 − z2j z2j−1 − z2j Therefore we can write {(M1 , H1 )} Q(x1 , . . . , xn ; z1 , . . . , zn ) = if Mj = M1 , Hj = H1 , j = 2, . . . , k ∅ otherwise. Defining Dj,1 = Mj −M1 , Dj,2 = Hj −H1 , j = 2, . . . k, we can interpret the fiducial distribution (5) as the conditional distribution of (M1 , H1 )|D = 0. A simple calculation shows that this conditional distribution has density e−
(n−1)s2 (m−¯ xn )2 − 2h n 2h/n
3
((n − 1)s2n )n− 2 p fR(µ,σ2 ) (m, h) = I(0,∞) (h). (30) π/n Γ n − 32 2n−1 hn P P Here x¯n = ni=1 xi /n and s2n = ni=1 (xi − x¯n )2 /(n − 1). The distribution derived in (30) is different from the one derived in (14). In fact inference based on (30) will not lead to correct frequentist inference. In particular the confidence intervals on variance will be too large. In fact the coverage probability of any lower tail confidence interval will converge to 0 as n → ∞. The problem illustrated in examples 6 and 13 is an instance of Borel paradox – see for example Section 4.9.3 of Casella & Berger (2002) and also Hannig (1996) for a thorough discussion of this paradox. The main message of Borel paradox is that conditioning on an event of probability zero greatly depends on the context in which we interpret the conditions. 45
Consider in particular X|Y = 0, where (X, Y ) is jointly continuous. There is a random variable U , such that (X, U ) is jointly continuous, the sets {Y = 0} = {U = 0}, but the conditional density of X|Y = 0 is different from the conditional density X|U = 0, even though the condition is the same in both cases. Since there is no theoretical reason that would deem either X|Y = 0 or X|U = 0 as superior to the other, people often rely on the context of the problem to make the choice, e.g., conditional distributions in regression settings. However, one can often come up with modification of the “story” behind the problem that leads naturally to a different choice of the conditioning variable. This then can be then presented as a paradox – two apparently equivalent formulations of the same statistical problem lead to different answers. The interpretation of the conditioning we used in example 13 is “legal”. However, it does not appear intuitively desirable, because it is unnecessarily complicated in comparison to the conditioning in example 6. In the remainder of this section we will explore two more ways of interpreting the conditional distribution in (5). They also lead to different answers reaffirming Borel’s paradox. Example 14. Another important way of interpreting the conditional probability is through the following limiting process. Let x ∈ Rn and define a cube xε = (x1 − ε, x1 + ε) × · · · × (xn − ε, xn + ε). Let us also assume that the X ∈ Rd is a continuous random vector with distribution distribution indexed by parameter ξ ∈ Ξ, where the parameter space Ξ is an open subset of Rp . Denote its density by fX (x|ξ). Additionally assume the cardinality of the set (c.f., (4)) |Q(x, u)| ≤ 1. Finally assume Rthat for all x there is an ε > 0 and ¯ ξ¯ < C. Then the density C < ∞ such that for all y ∈ xε we have Ξ fX (y|ξ)d of the conditional distribution in the definition of fiducial distribution (5) can be interpreted as fX (x|ξ) P (G(U ? , ξ) ∈ xε ) =R ? ε ¯ ¯ ¯ ξ¯. ε→0 P (there exists ξ, G(U , ξ) ∈ x ) f (x|ξ)d Ξ X
r(ξ|x) = lim
(31)
The second equality of (31) follows from R the bounded convergence theorem ? ε and the fact that P (G(U , ξ) ∈ x ) = xε fX (y|ξ)dy. The result of (31) implies that, under our conditions, the Bayesian posterior with respect to the flat prior, i.e., the scaled likelihood, could be understood as a fiducial distribution. This is rather amusing as it was Fisher’s strong dislike of this particular Bayesian posterior that led to his invention of 46
fiducial inference. The conditions we imposed to derive (31) are very strong. In fact the same conclusion could be derived under much milder conditions. It is a well known fact that Bayesian posterior distribution with respect to a flat prior displays some unfavorable frequentist behavior. In fact other priors often lead to better performance. The fiducial setting allows us to give another argument for illustrating this phenomenon. Example 15. Let us assume that the parameter of interest ξ is p-dimensional. Recall the structural equation (3) X = G(U, ξ). Write G = (g1 , . . . , gn ) so that Xi = gi (U, ξ) for i = 1, . . . , n. Furthermore assume that X0 = (X1 , . . . , Xp ) and G0 = (g1 , . . . , gp ). We assume that, for each fixed u ∈ (0, 1), the mapping G0 (u, ) is invertible. We denote this inverse mapping by Q0 (x0 , u) = (q1 (s0 , u), . . . , qp (s0 , u)). Thus we have Q0 (G0 (u, ξ), u) = ξ. Now let Xc = (Xp+1 , . . . , Xn ) and Gc = (gp+1 , . . . , gk ). Substituting ξ = G0 (X0 , U ) in the equations Xj = gj (U, ξ), j = p+1, . . . , k, we get the identity
·
Xc = Gc (U, Q0 (X0 , U )).
(32)
Therefore the observed values x have to lie on a p-dimensional manifold in order for Q(x, u? ) 6= ∅ . Moreover the fiducial distribution (5) can be interpreted as the limiting distribution of lim Q0 (x0 , U ? )|Gc (U ? , Q0 (x0 , U ? ) ∈ xεc
ε→0
(33)
If the random vector Q0 (x0 , U ? ), Gc (U ? , Q0 (x0 , U ? )) is jointly continuous the limiting distribution in (33) is well defined and unique. In fact, it is the conditional density of Q0 (x0 , U ? )|Gc (U ? , Q0 (x0 , U ? ) = xc . We feel that this interpretation of the conditional distribution in (5) is the most appealing. Since the data must lie on a p-dimensional manifold it is much preferable to increase the width of only (n−p)-dimensional observation xc as opposed to increasing the width of the whole n-dimensional observation x as done in the limiting arguments used in example 14. This is based on the heuristic argument that increasing the width will unnecessarily lead to loss of information and is supported by the fact that in most practical situations we are aware that (33) gives rise to statistical procedures with better properties than those based on (31). Given u, by virtue of (32), it follows that X must lie on the manifold M(u). Note that the same manifold may be definable using a different set 47
of equations leading to possibly different defining distribution of (33). We suggest that in this case we assume that the selection of observations into x0 was done randomly, i.e., we average over all possible assignments. After taking this average we take the limit ε → 0 in (33). This procedure was used in P example 7 and is the reason why the terms of the type 1≤i