ESS 522. 15-1. 15. Probability Distributions and Hypothesis Testing with Means.
Much of the material for this lecture can be found in Chapter 1 and 2 of Paul ...
ESS 522
2014 15. Probability Distributions and Hypothesis Testing with Means
Much of the material for this lecture can be found in Chapter 1 and 2 of Paul Wessel’s notes One of the most important concepts in statistics is the idea of a probability distribution Discrete probability distribution If we imagine n discrete independent trials of a process that has a probability p of a “successful” outcome (p = 1/6 for throwing a 6 with a balanced die; p = ½ for getting tails with the toss of a coin), the probability of x successful outcomes is ! $ (15-1) p x = ## n && p x (1− p) n−x " x % where ! n$ n! (15-2) #" x &% = x! n − x !
()
(
)
This is the binomial probability distribution, which we can plot as a histogram. If you take a statistics class it will feature prominently, but you may not come across it as much in the geosciences. Continuous probabilities – the normal distribution For most geoscientific applications, we deal with observations that have a continuous spectrum of values. For these the probability distribution is continuous and we can write ∞
∫ p ( x ) dx = 1
(15-3)
−∞
The sum of the probabilities (or integral for a continuous probability distribution) has to sum to unity. The most common continuous distribution by far is the normal of Gaussian distribution which we can write as * 1$ x − µ'21 p x = exp , − & (15-4) ) / ,+ 2 % σ ( /. σ 2π If we normalize to the mean and standard deviation according to x−µ z= (15-5) σ equation (4) gives # z2 & 1 (15-6) p z = exp % − ( $ 2' 2π The normal distribution is the well-known bell shaped distribution (Figure 1). We can write the probability of z ≤ a as
()
()
15-1
ESS 522
2014
(
a
) ∫
P z≤a =
p(z) dz =
−∞
a a ⎛ z2 ⎞ 1 1 1 + ∫ p(z) dz = + exp ∫ ⎜⎝ − 2 ⎟⎠ dz 2 0 2 2π 0
(15-7) ⎛ a ⎞⎤ 1⎡ = ⎢1+ erf ⎜ ⎥ 2⎣ ⎝ 2 ⎟⎠ ⎦ The term erf refers to the error function (there is a matlab function erf), which you will have come across if you have taken a geophysics class that covered the cooling/heating of a half space. From equation (15-7), we can write p ( z > a) = 1 − P ( z ≤ a) (15-8) which gives the chance of z lying under one tail of the distribution. Since the distribution is symmetrical, we can also the two-sided probability. (15-9) p z > a = 2 ⎡⎣1− P z ≤ a ⎤⎦
(
)
(
)
The probabilities of z lying with 1, 2 and 3 standard deviations (i.e., a = 1σ, a = 2σ, and a = 3σ) are 68.27%, 95.45% and 99.73% respectively.
By convention we define critical probabilities z1-α as the z value above which a fraction of 1 - α of the data will lie, and zα as the z value above which a fraction α of the data will lie (Figure 1). Since the distribution is symmetric, z1-α = -zα There are several reasons why the normal distribution is used widely 1. It is analytical. 2. It often describes populations quite well. 3. When it does not work it will sometimes describe populations well if we change the variable (.e.g, √ x or log(x)). 4. If the populations has a lot of outliers, it will often work quite well if we use robust techniques to eliminate the outliers. 5. The central limit theorem – perhaps the most important reason Central limit theorem The central limit theorem states that whatever the probability distribution of x, the probability distribution of means of x for repeated samples of n random samples tends to become normally distributed as n increases with x −µ (15-10) z= σ n Student’s t distribution - Testing a mean Equation (15-10) gives us the means of testing if a sample mean belongs to a population with a mean µ. However, it assumes that we know the population variance σ 2. If our only estimate of the population variance comes from our sample variance s2, then equation (15-10) becomes
15-2
ESS 522
2014
t=
x −µ
(15-11) s n where t is Student’s t variable. The probability distribution of t is wider than the normal distribution (Figure 1) because our incomplete knowledge of σ introduces additional uncertainty. The t distribution is dependent on the number of degrees of freedom (n – 1) used to estimate s. As (n - 1) increases the t distribution narrows (Figure 1) and for n > 30 (n > 100 if you are really fussy), it is close to the normal distribution. Equation (15-11) leads to 95% confidence limits for the population mean based on our sample s (15-12) µ = x ± t0.025,n−1 n where t0.025,n-1 is the critical t value for n-1 degrees of freedom that leaves a probability of 0.025 in the upper tail at t > t0.025, n-1. The t-distribution is symmetric so there is a similar probability for t < - t0.025, n-1. Obviously we can get (1 - 2α) confidence limits by replacing t0.025 n-1 by tα. Matlab does not have many user-friendly statistical functions unless you purchase specialized toolboxes. Either you can use old fashioned tables to find critical values or you can find them on the web – I have used http://www.quantitativeskills.com/sisa/. Comparing two means We can also use the t-distribution to determine whether two sample means belong to the same population. We can write 95% confidence limits on the difference between the two sample means as
(µ − µ ) = ( x − x ) ± t 1
2
1
2
(n −1) s + (n −1) s 2 1
1
0.025,n1 +n2 −2
2
n1 + n2 − 2
2 2
⎛1 1⎞ ⎜n +n ⎟ ⎝ 1 2⎠
(15-13)
where the t distribution is calculated for n1 + n2 - 2 degrees of freedom. If the uncertainty in the difference in means encloses zero, then we cannot reject the hypothesis that the samples come from populations with the same mean at the 95% confidence level. Note, there is no way of proving the hypothesis that the samples come from a single population or two populations with the same means. We can just infer that it is possible or unlikely.
15-3