parameter. Furthermore, suppose we are trying to estimate θ. A âsufficientâ statistic, is a way of condensing information provided by the data set into a smaller.
Sufficiency and Completeness... Towards an UMVUE Sufficiency: A statistic is any function of the data X 1 , X2 , . . . , Xn that does not depend on any unknown parameters. Examples of statistics are: X, S 2 , X(1) , X(n) , X(n) − X(1) , X/S 2 (Note: “does not depend on any unknown parameters” means that βX, for example, for β unknown, is not a statistic. However, the statistics have distributions which may depend on unknown parameters. For example, the statistic X may have variance 1/β 2 n.)
Suppose that the random sample comes from a distribution with a pdf f (x; θ) where θ is an unknown parameter. Furthermore, suppose we are trying to estimate θ. A “sufficient” statistic, is a way of condensing information provided by the data set into a smaller amount of data in such a way that we have not lost any information about the unknown θ. For example(s): 1. If X1 , X2 , . . . , Xn is a random sample from a distribution with mean θ. We would probably P estimate θ with X = n1 Xi . So, it would probably be “sufficient” if we only knew the statistic X or even of the random sample is assumed to be known.)
P
Xi . (The size
2. If X1 , X2 , . . . , Xn is a random sample from the unif (0, θ) distribution, it would probably be “sufficient” if we knew only the statistic X (n) , the maximum value of the data set.
Let us now formalize these thoughts. Consider our “data” X1 , X2 , . . . , Xn . Let T denote a statistic. That is, T is some function of the data: T = t(X1 , X2 , . . . , Xn ) where “little t” will be used to indicate an observed value t = t(x1 , x2 , . . . , xn ). Now for these special “sufficient” statistics, we will use the letter S: S = s(X1 , X2 , . . . , Xn ),
s = s(x1 , x2 , . . . , xn ).
Definition: Let X1 , X2 , . . . , Xn be a random sample from a distribution with unknown parameter θ. S is a sufficient statistic for θ if the conditional distribution of (X1 , X2 , . . . , Xn ) given S does not depend on θ. That is, if we compute fX|S x|s), we should get a function that does not depend on θ. ~ (~
Let us redo the example that we saw in class. Example 1: Suppose that X1 , X2 , . . . , Xn is a random sample from the Bernoulli distribution with parameter p. Each Xi can take on the value 1 with probability p and 0 with probability 1−p. So, we can estimate p by computing pˆ, the sample proportion of 1’s observed in X 1 , X2 , . . . , Xn . Since they are all 1’s and 0’s, Pn Xi pˆ = i=1 . n In other words, for estimating p, it is probably “sufficient” that we know only the original list of 1’s and 0’s given by (X 1 , X2 , . . . , Xn ). Let us verify, by definition, that S := Since Xi ∼ Bern(p), the pdf is:
P
P
Xi as opposed to
Xi is a sufficient statistic.
f (xi ; p) = pxi (1 − p)1−xi · I{0,1} (xi ) and the joint pdf is therefore f (~x; p) = f (x1 , . . . , xn ; p) =
n Y
f (xi ; p) = p
i=1
Now S=
n X
Xi
P
xi
(1 − p)n−
P
xi
n Y
I{0,1} (xi ).
i=1
⇒ S ∼ bin(n, p)
i=1
so S has pdf fS (s; p) =
!
n s p (1 − p)n−s · I{0,1,...,n} (s). s
~ = (X1 , . . . , Xn ) given S equals some value s. We are ready to compute the conditional pdf for X fX|S x|s) = P (X1 = x1 , . . . , Xn = xn |S = s) ~ (~ =
P (X1 =x1 ,...,Xn =xn ,S=s) . P (S=s)
Since S = Xi and s = For example P
P
xi , the numerator is either going to be 0 or P (X 1 = x1 , . . . , Xn = xn ).
P (X = 2, Y = 3, X + Y = 7) = 0 So, fX|S x|s) = 0 if s 6= ~ (~
P
butP (X = 2, Y = 3, X + Y = 5) = P (X = 2, Y = 3).
xi . In the case that s = fX|S x|s) = ~ (~ = = =
P
xi ,
P (X1 =x1 ,...,Xn =xn ) P (S=s) f (~ x;p) fS (s;p) p
P
xi
1
·
(1−p)
n−
P
xi
·
Q
I{0,1}(xi )
(ns)ps (1−p)n−s ·I{0,1,...,n} (s)
n s
( )
Q
I{0,1} (xi )
About the indicators... 1. I didn’t divide one indicator into another, I simply replaced the information contained in both P by a single indicator based on the original assumption that s = xi .
2. Don’t let the indicator manipulation take away from your understanding about sufficient statistics. Either
(a) Leave them out altogether as long as they don’t depend on (hold information about) p. (b) Leave them in but don’t simplify the ratio. x|s) No matter what you do with the indicators though, we see that the ultimate expression for f X|S ~ (~ does not depend on the parameter p. Therefore, we have shown that for a random sample from the Bernoulli distribution with parameter P p, S = Xi is a sufficient statistic for p.
~ given S is given by Note that the conditional pdf for X fX|S x|s) = ~ (~
f (~x, s; p) fS (s; p)
for both discrete and continuous random variables, though we would not have been able to use probabilities like P (X1 = x1 , . . . , Xn = xn , S = s) in our calculations.
As in the above example, the “S = s” part of the numerator is always going to drop out of the calculation, so, in general, the computation of f X|S x|s) is ~ (~ fX|S x|s) = ~ (~
f (~x, s; θ) f (~x; θ) = fS (s; θ) fS (s; θ)
and if S is sufficient, this will reduce to some function of the x’s, say f (~x; θ) = h(~x). fS (s; θ) or f (~x; θ) = h(~x) · fS (s|θ).
(1)
(Note: h may contain n’s, but that is considered just a fixed, known number; we don’t consider h as a “function of n”. Also, h may contain s’s, but the s is just some function of the x’s, for example P s = xi and can just be replaced by that function of the x’s.)
Okay, so in the above example, we guessed at a sufficient statistic and then used the definition of a sufficient statistics to verify that it was, in fact, sufficient. Coming up with the sufficient statistic is not always so easy. Fortunately, we have the following criterion, inspired by (1). The Factorization Criterion for Sufficiency Let X1 , X2 , . . . , Xn be a random sample from a distribution with pdf f (x; θ) for some θ ∈ Ω. S = s(X1 , X2 , . . . , Xn ) is sufficient for θ ⇓ f (~x; θ) = h(~x) · g(s(~s); θ) for some functions h and g, where, as the notation suggests • h does not depend on θ • g depends on ~x only through s(~x) (In the above, Ω represents the parameter space. For example, some parameters like the rate for an exponential distribution or the variance for the normal distribution are required to be strictly positive. Some parameters like the mean for a normal distribution can be anything from −∞ to ∞.)
Let us now revisit the Bernoulli example... iid
X1 , X2 , . . . , Xn ∼ Bernoulli(p) ⇓ f (x; p) = px (1 − p)1−x · I{0,1} (x) ⇓ Qn
i=1 I{0,1} (xi ) · p
f (~x; p) =
↑ h(~x)
where s(~x) =
X
P
xi
(1 − p)n−
↑ g(s(~x); θ)
P
xi
xi .
We identify the h function by factoring out as much “x-stuff” that does not involve the parameter. We identify the g function by looking at everything that is left that couldn’t be separated. We identify the sufficient statistic by seeing how the x’s are represented in the g function. In this example, we have then that S =
P
Xi is a sufficient statistic!
(Note: Is is possible that we can’t separate any thing out. In this case the function h is simply 1; h(~x) = 1. This is what would have happened here if we ignored the indicators. However, if the indicator contains the parameter, we must use the indicator– it will become part of the function g as in the next example!)
Example 2: Suppose that X1 , X2 , . . . , Xn is a random sample from the unif (0, θ) distribution. Then 1 f (x; θ) = · I(0,θ) (x) θ and the joint pdf is n n Y 1 1 Y I (xi ). f (~x; θ) = ·I (xi ) = n θ (0,θ) θ i=1 (0,θ) i=1 Let’s simplify the indicator: •
Qn
i=1 I(0,θ) (xi )
equals one of all the xi ’s are between 0 and θ. We will be guaranteed this happening if both the min and max are between 0 and θ. So,
n Y
I(0,θ) (xi ) = I(0,θ) (x(1) ) · I(0,θ) (x(n) )
i=1
There are other ways we can simplify this, for example n Y
I(0,θ) (xi ) = I(0,θ) (x(1) ) · I(x(1) ,θ) (x(n) )
n Y
I(0,θ) (xi ) = I(0,θ) (x(n) ) · I(0,x(n) ) (x(1) )
i=1
or
i=1
• There is no wrong choice, or no need to simplify the indicator at all, but I will choose the final factorization because it has θ depending on a minimal amount of information. (I will say more about this after this example.) Okay, so we have f (~x; θ) =
1 θ n I(0,x(n) ) (x(1) )
· I(0,θ) (x(n) )
↑ h(~x)
↑ g(s(~x); θ)
Again, the g part is the part where we can’t separate the x’s and the parameter. The sufficient statistic is then given by the way the x’s appear in this function. Therefore, we have that S = X(n) is a sufficient statistic for θ. (At this point we celebrate! Don’t you think that if you wanted to estimate θ, the highest point for the unif (0, θ) distribution, it would be “sufficient” if we knew only the maximum value of the sample in place of the entire sample?)
Example 2 raises a few questions. What if we had factored that product of indicators differently? Q: Is the sufficient statistic unique? A: No. Q: Can we have a sufficient vector? A: Yes. For example (X(1) , X(n) ) is also sufficient for θ. This is what we would have gotten with either of the first two simplifications of the indicator. Q: So I suppose that the vector given by the entire sample, (X 1 , . . . , Xn ) is also sufficient for estimating θ? A: Yes!
We can condense the data by a lot or a little, or not at all. Q: Is there a greatest reduction? A: Yes. This would be called a minimal sufficient statistic.
Now, off the subject of sufficient statistics, I will remind you of (or introduce you to) two useful properties concerning expected value and variance.
1. Let X and Y be random variables. Then E[X] = E[E[X|Y ]] . Proof: (continuous case) Note that • E[X] is a non-random constant. • E[X|Y = y] is a non-random function of y. The X has been “averaged out” and would not appear in the answer. • E[X|Y ] is a function of the random variable Y . It is precisely the function from the last bullet with Y plugged in for y. Now
E[E[X|Y ]] = = = = = =
R
E[X|y]fY (y) dy
R hR
i
x · fX|Y (x|y) dx fY (y) dy
RR
x · fX|Y (x|y) · fY (y) dy dx
R
R
RR
R
x
x·
f (x,y) fY (y)
· fY (y) dy dx
f (x, y) dy dx
x · f (x) dx = E[X]
2. Let X and Y be random variables. Then V ar[X] = V ar(E[X|Y ]) + E[V ar(X|Y )]. Proof: V ar(E[X|Y ])
= by 1
E (E[X|Y ])2 − (E[E[X|Y ]])2
=
E (E[X|Y ])2 − (E[X])2
=
E[E(X 2 |Y ) − V ar[X|Y ]] − (E[X])2
=
E[E(X 2 |Y )] − E[V ar(X|Y )] − (E[X])2
=
E(X 2 ) − E[V ar(X|Y )] − (E[X])2
=
V ar[X] − E[V ar(X|Y )].