(2) be able to apply this knowledge to suitable problems in statistics. Examples ...
Freund, J. E. Mathematical Statistics with Applications, Pearson (2004). Hogg ...
Lecture Notes on MS237 Mathematical statistics Lecture notes by Janet Godolphin
2010
ii
Contents 1
2
Introductory revision material 1.1 Basic probability . . . . . . . . . . . . . . . . 1.1.1 Terminology . . . . . . . . . . . . . . 1.1.2 Probability axioms . . . . . . . . . . . 1.1.3 Conditional probability . . . . . . . . . 1.1.4 Self-study exercises . . . . . . . . . . . 1.2 Random variables and probability distributions 1.2.1 Random variables . . . . . . . . . . . . 1.2.2 Expectation . . . . . . . . . . . . . . . 1.2.3 Self-study exercises . . . . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
. . . . . . . . .
3 3 3 4 5 7 9 9 10 12
Random variables and distributions 2.1 Transformations . . . . . . . . . . . . . 2.1.1 Self-study exercises . . . . . . . 2.2 Some standard discrete distributions . . 2.2.1 Binomial distribution . . . . . . 2.2.2 Geometric distribution . . . . . 2.2.3 Poisson distribution . . . . . . . 2.2.4 Self-study exercises . . . . . . . 2.3 Some standard continuous distributions 2.3.1 Uniform distribution . . . . . . 2.3.2 Exponential distribution . . . . 2.3.3 Pareto distribution . . . . . . . 2.3.4 Self-study exercises . . . . . . . 2.4 The normal (Gaussian) distribution . . . 2.4.1 Normal distribution . . . . . . . 2.4.2 Properties . . . . . . . . . . . . 2.4.3 Self-study exercises . . . . . . . 2.5 Bivariate distributions . . . . . . . . . . 2.5.1 Definitions and notation . . . . 2.5.2 Marginal distributions . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
13 13 14 15 15 16 17 18 19 19 19 20 21 22 22 23 24 25 25 26
iii
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . .
CONTENTS
iv
. . . . . . . .
26 27 30 31 31 31 32 34
3 Further distribution theory 3.1 Multivariate distributions . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Mean and covariance matrix . . . . . . . . . . . . . . . . 3.1.3 Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 3.2 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 The univariate case . . . . . . . . . . . . . . . . . . . . . 3.2.2 The multivariate case . . . . . . . . . . . . . . . . . . . . 3.2.3 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 3.3 Moments, generating functions and inequalities . . . . . . . . . . 3.3.1 Moment generating function . . . . . . . . . . . . . . . . 3.3.2 Cumulant generating function . . . . . . . . . . . . . . . 3.3.3 Some useful inequalities . . . . . . . . . . . . . . . . . . 3.3.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 3.4 Some limit theorems . . . . . . . . . . . . . . . . . . . . . . . . 3.4.1 Modes of convergence of random variables . . . . . . . . 3.4.2 Limit theorems for sums of independent random variables 3.4.3 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 3.5 Further discrete distributions . . . . . . . . . . . . . . . . . . . . 3.5.1 Negative binomial distribution . . . . . . . . . . . . . . . 3.5.2 Hypergeometric distribution . . . . . . . . . . . . . . . . 3.5.3 Multinomial distribution . . . . . . . . . . . . . . . . . . 3.5.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . . 3.6 Further continuous distributions . . . . . . . . . . . . . . . . . . 3.6.1 Gamma and beta functions . . . . . . . . . . . . . . . . . 3.6.2 Gamma distribution . . . . . . . . . . . . . . . . . . . . 3.6.3 Beta distribution . . . . . . . . . . . . . . . . . . . . . . 3.6.4 Self-study exercises . . . . . . . . . . . . . . . . . . . . .
35 35 35 36 36 38 38 38 39 41 41 41 42 42 45 45 45 47 48 49 49 50 50 51 52 52 52 53 55
2.6
2.5.3 Conditional distributions . . . . 2.5.4 Covariance and correlation . . . 2.5.5 Self-study exercises . . . . . . . Generating functions . . . . . . . . . . 2.6.1 General . . . . . . . . . . . . . 2.6.2 Probability generating function . 2.6.3 Moment generating function . . 2.6.4 Self-study exercises . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
CONTENTS 4
Normal and associated distributions 4.1 The multivariate normal distribution . . . . . 4.1.1 Multivariate normal . . . . . . . . . . 4.1.2 Properties . . . . . . . . . . . . . . . 4.1.3 Marginal and conditional distributions 4.1.4 Self-study exercises . . . . . . . . . . 4.2 The chi-square, t and F distributions . . . . . 4.2.1 Chi-square distribution . . . . . . . . 4.2.2 Student’s t distribution . . . . . . . . 4.2.3 Variance ratio (F) distribution . . . . 4.3 Normal theory tests and confidence intervals . 4.3.1 One-sample t-test . . . . . . . . . . . 4.3.2 Two-samples . . . . . . . . . . . . . 4.3.3 k samples (One-way Anova) . . . . . 4.3.4 Normal linear regression . . . . . . .
v
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
57 57 57 58 60 61 62 62 64 65 65 65 66 67 68
CONTENTS
1
MS237 Mathematical Statistics Level 2
Spring Semester
Credits 15
Course Lecturer in 2010 D. Terhesiu
email:
[email protected]
Class Test The Class Test will be held on Thursady 11th March (week 5), starting at 12.00. Class tests will include questions of the following types: • examples and proofs previously worked in lectures, • questions from the self-study exercises, • previously unseen questions in a similar style. The Class Test will comprise 15% of the overall assessment for the course.
Coursework Distribution: Coursework will be distributed at 14.00 on Friday 26th March. Collection: Coursework will be collected on Thursday 29th April in Room LTB. The Coursework will comprise 10% of the overall assessment for the course.
Chapter 1 Chapter 1 contains and reviews prerequisite material from MS132. Due to time constraints, students are expected to work through at least part of this material independently at the start of the course.
Objectives and learning outcomes This module provides theoretical background for many of the topics introduced in MS132 and for some of the topics which will appear in subsequent statistics modules. At the end of the module, you should (1) be familiar with the main results of statistical distribution theory; (2) be able to apply this knowledge to suitable problems in statistics
Examples, exercises, and problems Blank spaces have been left in the notes at various positions. These are for additional material and worked examples presented in the lectures. Most chapters end
CONTENTS
2
with a set of self-study exercises, which you should attempt in your own study time in parallel with the lectures. In addition, six exercise sheets will be distributed during the course. You will be given a week to complete each sheet, which will then be marked by the lecturer and returned with model solutions. It should be stressed that completion of these exercise sheets is not compulsory but those students who complete the sheets do give themselves a considerable advantage!
Selected texts Freund, J. E. Mathematical Statistics with Applications, Pearson (2004) Hogg, R. V. and Tanis, E. A. Probability and Statistical Inference, Prentice Hall (1997) Lindgren, B. W. Statistical Theory, Macmillan (1976) Mood, A. M., Graybill, F. G. and Boes, D. C. Introduction to the Theory of Statistics, McGraw-Hill (1974) Wackerly, D.D., Mendenhall, W., and Scheaffer, R.L. Mathematical Statistics with Applications, Duxbury (2002)
Useful series These series will be useful during the course:
−1
(1 − x)
=
∞ ∑
xk = 1 + x + x2 + x3 + · · ·
k=0
(1 − x)−2 = x
e
=
∞ ∑ k=0 ∞ ∑ k=0
(k + 1)xk = 1 + 2x + 3x2 + 4x3 + · · · xk x2 x 3 =1+x+ + + ··· k! 2! 3!
Chapter 1 Introductory revision material This chapter contains and reviews prerequisite material from MS132. If necessary you should review your notes for that module for additional details. Several examples, together with numerical answers, are included in this chapter. It is strongly recommended that you work independently through these examples in order to consolidate your understanding of the material.
1.1
Basic probability
Probability or chance can be measured on a scale which runs from zero, which represents impossibility, to one, which represents certainty.
1.1.1
Terminology
A sample space, Ω, is the set of all possible outcomes of an experiment. An event E ∈ Ω is a subset of Ω. Example 1 Experiment: roll a die twice. Possible events are E1 = {1st face is a 6}, E2 = {sum of faces = 3}, E3 = {sum of faces is odd}, E4 = {1st face - 2nd face = 3}. Identify the sample space and the above events. Obtain their probabilities when the die is fair.
3
CHAPTER 1. INTRODUCTORY REVISION MATERIAL
4 Answer: 1 2 first 3 roll 4 5 6
(1,1) (2,1) (3,1) (4,1) (5,1) (6,1)
(1,2) (2,2) (3,2) (4,2) (5,2) (6,2)
p(E1 ) = 16 ; p(E2 ) =
1 ; 18
second roll (1,3) (1,4) (2,3) (2,4) (3,3) (3,4) (4,3) (4,4) (5,3) (5,4) (6,3) (6,4)
(1,5) (2,5) (3,5) (4,5) (5,5) (6,5)
p(E3 ) = 21 ; p(E4 ) =
(1,6) (2,6) (3,6) (4,6) (5,6) (6,6) 1 . 12
Combinations of events Given events A and B, further events can be identified as follows. • The complement of any event A, written A¯ or Ac , means that A does not occur. • The union of any two events A and B, written A ∪ B, means that A or B or both occur. • The intersection of A and B, written as A ∩ B, means that both A and B occur. Venn diagrams are useful in this context.
1.1.2 Probability axioms Let F be the class of all events in Ω. A probability (measure) P on (Ω, F) is a real-valued function satisfying the following three axioms: 1. P (E) ≥ 0 for every E ∈ F 2. P (Ω) = 1 3. Suppose the events E1 and E2 are mutually exclusive (that is, E1 ∩ E2 = ∅). Then P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) Some consequences: ¯ = 1 − P (E) (so in particular P (∅) = 0) (i) P (E) (ii) For any two events E1 and E2 we have the addition rule P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) − P (E1 ∩ E2 )
1.1. BASIC PROBABILITY
5
Example 1: (continued) Obtain P (E1 ∩ E2 ), P (E1 ∪ E2 ), P (E1 ∩ E3 ) and P (E1 ∪ E3 ) Answer: P (E1 ∩ E2 ) = P (∅) = 0 1 P (E1 ∪ E2 ) = P (E1 ) + P (E2 ) = 61 + 18 = 29 1 3 P (E1 ∩ E3 ) = P (6, 1), (6, 3), (6, 5) = 36 = 12 P (E1 ∪ E3 ) = P (E1 ) + P (E3 ) − P (E1 ∩ E3 ) =
1 6
+ 12 −
1 12
=
7 12
[Notes on axioms: (1) In order to cope with infinite sequences of events, it is necessary to strengthen axiom 3 to ∑∞ 3’. P (∪∞ i=1 ) = i=1 P (Ei ) for any sequence (E1 , E2 , · · · ) of mutually exclusive events. (2) When Ω is noncountably infinite, in order to make the theory rigorous it is usually necessary to restrict the class of events F to which probabilities are assigned.]
1.1.3
Conditional probability
Supose P (E2 ) 6= 0. The conditional probability of the event E1 given E2 is defined as P (E1 ∩ E2 ) . P (E1 |E2 ) = P (E2 ) The conditional probability is undefined if P (E2 ) = 0. The conditional probability formula above yields the multiplication rule: P (E1 ∩ E2 ) = P (E1 )P (E2 |E1 ) = P (E2 )P (E1 |E2 ) Independence Events E1 and E2 are said to be independent if P (E1 ∩ E2 ) = P (E1 )P (E2 ) . Note that this implies that P (E1 |E2 ) = P (E1 ) and P (E2 |E1 ) = P (E2 ). Thus knowledge of the occurrence of one of the events does not affect the likelihood of occurrence of the other. Events E1 , . . . , Ek are pairwise independent if P (Ei ∩Ej ) = P (Ei )P (Ej ) for all ∏ i 6= j. They are mutually independent if for all subsets P (∩j Ej ) = j P (Ej ). Clearly, mutual independence ⇒ pairwise independence, but the converse is false (see question 4 of the self study exercises).
6
CHAPTER 1. INTRODUCTORY REVISION MATERIAL
Example 1 (continued): Find P (E1 |E2 ) and P (E1 |E3 ). Are E1 , E2 independent? 1 ∩E2 ) 1 ∩E3 ) Answer: P (E1 |E2 ) = P (E = 0, P (E1 |E3 ) = P (E = 1/12 = 61 P (E2 ) P (E3 ) 1/2 P (E1 )P (E2 ) 6= 0 so P (E1 ∩ E2 ) 6= P (E1 )P (E2 ) and thus E1 and E2 are not independent.
Law of total probability (partition law) Suppose that B1 , . . . , Bk are mutually exclusive and exhaustive events (i.e. Bi ∩ Bj = ∅ for all i 6= j and ∪i Bi = Ω). Let A be any event. Then P (A) =
k ∑
P (A|Bj )P (Bj )
j=1
Bayes’ Rule Suppose that events B1 , . . . , Bk are mutually exclusive and exhaustive and let A be any event. Then P (Bj |A) =
P (A|Bj )P (Bj ) P (A|Bj )P (Bj ) =∑ P (A) i P (A|Bi )P (Bi )
Example 2: (Cancer diagnosis) A screening programme for a certain type of ¯ = 0.05, where D is the event cancer has reliabilities P (A|D) = 0.98 , P (A|D) “disease is present” and A is the event “test gives a positive result”. It is known that 1 in 10, 000 of the population has the disease. Suppose that an individual’s test result is positive. What is the probability that that person has the disease? Answer: We require P (D|A). First find P (A). ¯ (D) ¯ = 0.98 × 0.0001 + 0.05 × 0.9999 = P (A) = P (A|D)P (D) + P (A|D)P 0.050093. By Bayes’ rule; P (D|A) =
P (A|D)P (D) P (A)
=
0.0001×0.98 0.050093
= 0.002.
The person is still very unlikely to have the disease even though the test is positive. Example 3: (Bertrand’s Box Paradox) Three indistinguishable boxes contain black and white beads as shown: [ww], [wb], [bb]. A box is chosen at random
1.1. BASIC PROBABILITY
7
and a bead chosen at random from the selected box. What is the probability of that the [wb] box was chosen given that selected bead was white? Answer: E ≡ ’chose the [wb] box’, W ≡ ’selected bead is white’. By the partition law: P (W ) = 1 × 13 + 12 × 13 + 0 × 13 = 12 . Now using Bayes’ rule P (E|W ) =
P (E)P (W |E) P (W )
=
1 × 12 3 1 2
=
1 3
(i.e. even though a bead from the selected
box has been seen, the probability that the box is [wb] is still 13 ).
1.1.4
Self-study exercises
1. Consider families of three children, a typical outcome being bbg (boy-boygirl in birth order) with probability 81 . Find the probabilities of (i) 2 boys and 1 girl (any order), (ii) at least one boy, (iii) consecutive children of different sexes. Answer: (i) 38 ; (ii) 87 ; (iii) 14 . 2. Use pA = P (A), pB = P (B) and pAB = P (A ∩ B) to obtain expressions for: ¯ (a) P (A¯ ∪ B), (b) P (A¯ ∩ B), (c) P (A¯ ∪ B), ¯ (d) P (A¯ ∩ B), ¯ ∪ (B ∩ A)). ¯ (e) P ((A ∩ B) Describe each event in words. (Use a Venn diagram.) Answer: (a) 1−pAB ; (b) pB −pAB ; (c) 1−pA +pAB ; (d) 1−pA −pB +pAB ; (e) pA + pB − 2pAB . 3. (i) Express P (E1 ∪ E2 ∪ E3 ) in terms of the probabilities of E1 , E2 , E3 and their intersections only. Illustrate with a sketch. (ii)Three types of fault can occur which lead to the rejection of a certain manufactured item. The probabilities of each of these faults (A, B and C)
8
CHAPTER 1. INTRODUCTORY REVISION MATERIAL occurring are 0.1, 0.05 and 0.04 respectively. The three faults are known to be interrelated; the probability that A & B both occur is 0.04, A & C 0.02, and B & C 0.02. The probability that all three faults occur together is 0.01. What percentage of items are rejected? Answer: (i) P (E1 )+P (E2 )+P (E3 )−P (E1 ∩E2 )−P (E1 ∩E3 )−P (E2 ∩ E3 ) + P (E1 ∩ E2 ∩ E3 ) (ii) P (A ∪ B ∪ C) = .01 + .05 + .04 − (.04 + .02 + .02) + .01 = 0.12 1 4. Two fair dice rolled: 36 possible outcomes each with probability 36 . Let E1 = {odd face 1st}, E2 = {odd face 2nd}, E3 = {one odd, one even}, so P (E1 ) = 21 , P (E2 ) = 12 , P (E3 ) = 12 . Show that E1 , E2 , E3 are pairwise independent, but not mutually independent.
Answer: P (E2 |E1 ) = 21 = P (E2 ), P (E3 |E1 ) = 12 = P (E3 ), P (E3 |E2 ) = 1 = P (E3 ), so E1 , E2 , E3 are pairwise independent. But P (E1 ∩E2 ∩E3 ) = 2 0 6= P (E1 )P (E2 )P (E3 ), so E1 , E2 , E3 are not mutually independent. 5. An engineering company uses a ‘selling aptitude test’ to aid it in the selection of its sales force. Past experience has shown that only 65% of all persons applying for a sales position achieved a classification of ‘satisfactory’ in actual selling and of these 80% had passed the aptitude test. Only 30% of the ‘unsatisfactory’ persons had passed the test. What is the probability that a candidate would be a ‘satisfactory’ salesperson given that they had passed the aptitude test? Answer: A = pass aptitude test, S = satisfactory. P (S) = 0.65, P (A|S) = ¯ = 0.3. Therefore P (A) = (0.65 × 0.8) + (0.35 × 0.3) = 0.625 0.8, P (A|S) so P (S|A) = P (S)P (A|S)/P (A) = (0.65 × 0.8)/0.625 = 0.832.
1.2. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
1.2
9
Random variables and probability distributions
1.2.1 Random variables A random variable X is a real-valued function on the sample space Ω; that is, X : Ω → R. If P is a probability measure on (Ω, F) then the induced probability measure on R is called the probability distribution of X. A discrete random variable X takes values x1 , x2 , . . . with probabilities p(x1 ), p(x2 ), . . ., where p(x) = pr(X = x) = P ({ω : X(ω) = x}) is the probability mass function (pmf) of X. (E.g. X = place of horse in race, grade of egg.) Example 4: (i) Toss a coin twice: outcomes HH, HT, TH, TT. The random variable X = number of heads, takes values 0, 1, 2. 1 (ii) Roll two dice: X = total score. Probabilities for X are P(X = 2) = 36 , P(X = 2 3 3) = 36 , P(X = 4) = 36 etc. Example 5: X takes values 1, 2, 3, 4, 5 with probabilities k, 2k, 3k, 4k, 5k. Calculate k and P(2 ≤ X ≤ 4). ∑ Answer: 1 = 5x=1 P (x) = k(1 + 2 + 3 + 4 + 5) = 15k so k = 2 3 4 P (2 ≤ X ≤ 4) = P (2) + P (3) + P (4) = 15 + 15 + 15 = 35 .
1 . 15
A continuous random variable X takes values over an interval. E.g. X = time over racecourse, weight of egg. Its probability density function (pdf) f (x) is defined by ∫ b pr(a < X < b) = f (x)dx . Note that f (x) ≥ 0 for all x, and
∫∞ −∞
a
f (x)dx = 1.
Example 6: Let f (x) = k(1 − x2 ) on (−1, 1). Calculate k and pr(|X| > 1/2). ∫∞
∫1
⇒ k = 34 k(1 − x2 )dx = k[x − 31 x3 ]1−1 = 4k 3 ∫1 5 P (|X| > 1/2) = 1 − P (− 21 ≤ X ≤ 12 ) = 1 − −2 1 k(1 − x2 )dx = 1 − 11k = 16 12 Answer: 1 =
−∞
f (x)dx =
−1
2
A mixed discrete/continuous random variable is such that the probability is shared
CHAPTER 1. INTRODUCTORY REVISION MATERIAL
10
∫ ∑ between discrete and continuous components with p(x) + f (x)dx = 1, e.g. rainfall on given day, waiting time in queue, flow in pipe, contents of reservoir. The distribution function F of the random variable X is defined as F (x) = pr(X ≤ x) = P ({ω : X(ω) ≤ x}). Thus F (−∞) = 0, F (∞) = 1, F is monotone increasing, and pr(a < X ≤ b) = F (b) − F (a). Discrete case: F (x) =
∑ u≤x
Continuous case: F (x) =
p(u)
∫x −∞
f (u)du and F 0 (x) = f (x).
1.2.2 Expectation The expectation (or expected value or mean) of the random variable X is defined as ∑ xp(x) X discrete µ = E(X) = ∫ xf (x)dx X continuous The Variance of X is σ 2 = Var(X) = E{(X −µ)2 }. Equivalently σ 2 = E(X 2 )− {E(X)}2 (exercise: prove). σ is called the standard deviation. Functions of X: (i) E{h(X)} =
∑ h(x)p(x) ∫
X discrete
h(x)f (x)dx X continuous
(ii) E(aX + b) = aE(X) + b, Var(aX + b) = a2 Var(X). Proof (for discrete X) (i) h(X) takes values h(x1 ), h(x2 ), . . . with probabilities p(x1 ), p(x2 ), . . ., so, ∑ by definition, E{h(X)} = h(x1 )p(x1 ) + h(x2 )p(x2 ) + · · · = h(x)p(x). ∑ ∑ ∑ (ii) E[aX + b] = (aX + b)P (x) = a xP (x) + b P (x) = aE[X] + b
1.2. RANDOM VARIABLES AND PROBABILITY DISTRIBUTIONS
11
Var[aX+b] = E[(aX + b) − E[aX + b]2 ] = E[aX + b − aE[X] − b2 ] = E[a2 (X− E[X])2 ] = a2 Var[X] Example 7: X = 0, 1, 2 with probabilities 1/4, 1/2, 1/4. Find E(X), E(X − 1), E(X 2 ) and Var(X). Answer: E[X] = 0 × 41 + 1 × 21 + 2 ×
1 4
=1
E[X − 1] = E[X] − 1, E[X 2 ] = 02 × 14 + 12 × 12 + 22 ×
1 4
=
3 2
Var[X] = E[X 2 ] − E[X]2 = 12 . Example 8: f (x) = k(1+x)−4 on (0, ∞). Find k and hence obtain E(X), E{(1+ X)−1 }, E(X 2 ) and Var(X). Answer: 1 = k
∫∞ 0
(1 + x)−4 dx = k[− 31 (1 + x)−3 ]∞ 0 =
k 3
⇒k=3
∫∞ ∫∞ E[X] = 3 0 x(1 + x)−4 dx = 3 1 (u − 1)u−4 du = 3[− 12 u−2 + 31 u−3 ]∞ 1 = 1 1 1 3( 2 − 3 ) = 2 E[(1 + X)−1 ] = 3
∫∞ 0
(1 + x)−5 dx = 3[− 41 (1 + x)−4 ]∞ 0 =
3 4
∫ ∫∞ E[X 2 ] = 3 x2 (1+x)−4 dx = 3 1 (u−1)2 u−4 du = 3[−u−1 +u−2 − 13 u−3 ]∞ 1 = 1 Var[X] = E[X 2 ] − E[X]2 = 43 .
12
CHAPTER 1. INTRODUCTORY REVISION MATERIAL
1.2.3 Self-study exercises 3 1 1. X takes values 0, 1, 2, 3 with probabilities 14 , 15 , 10 , 4 . Compute (as fractions) E(X), E(2X + 3), Var(X) and Var(2X + 3).
31 61 73 Answer: E(X) = 20 , E(2X + 3) = 2E(X) + 3 = 10 , E(X 2 ) = 20 , so 2 499 499 2 Var(X) = E(X ) − E(X) = 400 , Var(2X + 3) = 4Var(X) = 100 .
2. The random variable X has density function f (x) = kx(1 − x) on (0,1), f (x) = 0 elsewhere. Calculate k and sketch f (x). Compute the mean and variance of X, and pr (0.3 ≤ X ≤ 0.6). Answer: k = 6, E(X) = 12 , Var(X) =
1 , 20
pr (0.3 ≤ X ≤ 0.6) = 0.432.
Chapter 2 Random variables and distributions 2.1
Transformations
Suppose that X has distribution function FX (x) and that the distribution function FY (y) of Y = h(X) is required, where h is a strictly increasing function. Then FY (y) = pr(Y ≤ y) = pr(h(X) ≤ y) = pr(X ≤ x) = FX (x) where x ≡ x(y) = h−1 (y). If X is continuous and h is differentiable, then it follows that Y has density
fY (y) =
dFY (y) dFX (x) dx = = fX (x) . dy dy dy
On the other hand, if h is strictly decreasing then FY (y) = pr(Y ≤ y) = pr(h(X) ≤ y) = pr(X ≥ x) = 1 − FX (x) which yields fY (y) = −fX (x)(dx/dy). Both formulae are covered by dx fY (y) = fX (x) . dy 13
14
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
Example 9: Suppose that X has pdf fX (x) = 2e−2x on (0, ∞). Obtain the pdf of Y = log X.
Probability integral transform. Let X be a continuous random variable with distribution function F (x). Then Y = F (X) is uniformly distributed on (0, 1). Proof. First note that 0 ≤ Y ≤ 1. Let 0 ≤ y ≤ 1; then pr(Y ≤ y) = pr(F (X) ≤ y) = pr(X ≤ F −1 (y)) = F (F −1 (y)) = y, so Y has pdf f (y) = 1 on (0, 1) (by differentiation), which is the density of the uniform distribution on (0, 1). This result has an important application to the simulation of random variables:
2.1.1 Self-study exercises 1. X takes values 1, 2, 3, 4 with probabilities
1 1 3 2 , , , 10 5 10 5
and Y = (X − 2)2 .
(i) Find E(Y ) and Var(Y ) using the formula for E{h(X)}. (ii) Calculate the pmf of Y and use it to calculate E(Y ) and Var(Y ) directly.
2.2. SOME STANDARD DISCRETE DISTRIBUTIONS
15
2. The random variable X has pdf f (x) = 31 , x = 1, 2, 3, zero elsewhere. Find the pdf of Y = 2X + 1. 3. The random variable X has pdf f (x) = e−x on (0, ∞). Obtain the pdf of Y = eX . 4. Let X have the pdf f (x) = pdf of Y = X 3 .
( 1 )x 2
, x = 1, 2, 3, . . . , zero elsewhere. Find the
2.2 Some standard discrete distributions 2.2.1
Binomial distribution
Consider a sequence of independent trials in each of which there are only two possible results, ‘success’, with probability π, or ‘failure’, with probability 1 − π (independent Bernoulli trials). Outcomes can be represented as binary sequences, with 1 for success and 0 for failure, e.g. 110001 has probability ππ(1 − π)(1 − π)(1 − π)π, since the trials are independent. Let the random variable X be the number of successes in n trials, with n fixed. r n−r The probability of a particular sequence ( ) of r 1’s and n − r 0’s is π (1 − π) , n and the event {X = r} contains such sequences. Hence r ( p(r) = pr(X = r) =
n r
) π r (1 − π)n−r , r = 0, 1, . . . , n .
This is the pmf of the binomial (n, π) distribution. The name comes from the binomial theorem ) n ( ∑ n {π + (1 − π)} = π r (1 − π)n−r , r n
r=0
from which
∑ r
p(r) = 1 follows.
The mean is µ = nπ:
16
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
The variance is σ 2 = nπ(1 − π) (see exercise 3). Example 10: A biased coin with pr(head) = 2/3 is tossed five times. Calculate p(r).
2.2.2 Geometric distribution Suppose now that, instead of a fixed number of Bernoulli trials, one continues until a success is achieved, so that the number of trials, N , is now a random variable. Then N takes the value n if and only if the previous (n − 1) trials result in failures and the nth trial results in a success. Thus
p(n) = pr(N = n) = (1 − π)n−1 π, n = 1, 2, . . . .
This is the pmf of the geometric (π) distribution: the probabilities are in geometric progression. Note that the sum of the probabilities over n = 1, 2, . . . is 1. The mean is µ = 1/π:
2.2. SOME STANDARD DISCRETE DISTRIBUTIONS
17
The variance is σ 2 = (1 − π)/π 2 (see exercise 4). Eg. Toss a biased coin with pr(head) = 2/3. Then, on average, it takes three tosses to get a tail.
2.2.3
Poisson distribution
The pmf of the Poisson (λ) distribution is defined as e−λ λr p(r) = , r = 0, 1, 2, . . . , r! where λ > 0. Note that the sum of the probabilities over r = 0, 1, 2, . . . is 1 (exponential series). The mean is µ = λ:
The variance is σ 2 = λ (see exercise 6). The Poisson distribution arises in various contexts, one being the limit of a binomial(n, π) as n → ∞ and π → 0 with nπ = λ fixed. Example 11: (Random events in time.) Cars are recorded as they pass a checkpoint. The probability π that a car is level with the checkpoint at any given instant is very small, but the number n of such instants in a given time period is large. Hence Xt , the number of cars passing the checkpoint during a time interval t minutes, can be modelled as Poisson with mean proportional to t. For example, if
18
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
the average rate is two cars per minute, find the probability of exactly 3 cars in 5 minutes.
2.2.4 Self-study exercises 1. In a large consignment of widgets 5% are defective. What is the probability of getting one or two defectives in a four-pack? 2. X is binomial with mean 2 and variance 1. Compute pr(X ≤ 1). 3. Derive the variance of the binomial (n, π) distribution. [Hint: find E{X(X − 1)}.] 4. Derive the variance of the geometric (π) distribution. [Hint: find E{X(X − 1)}.] 5. A leaflet contains one thousand words and the probability that any one word contains a misprint is 0.005. Use the Poisson distribution to estimate the probability of 2 or fewer misprints. 6. Derive the variance of the Poisson (λ) distribution. [Hint: find E{X(X − 1)}.]
2.3. SOME STANDARD CONTINUOUS DISTRIBUTIONS
2.3
19
Some standard continuous distributions
2.3.1 Uniform distribution The pdf of the uniform (α, β) distribution is f (x) = (β − α)−1 , α < x < β . The mean is µ = (β + α)/2:
The variance is σ 2 = (β − α)2 /12 (see exercise 1). Application. Simulation of continuous random variables via the probability integral transform: see Section 2.1.
2.3.2
Exponential distribution
The pdf of the exponential (λ) distribution is f (x) = λe−λx , x > 0,
where λ > 0. The distribution function is F (x) = (verify). The mean is µ = 1/λ:
∫x 0
λe−λu du = 1 − e−λx
20
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
The variance is σ 2 = 1/λ2 (see exercise 4). Lack of memory property. pr(X > a + b|X > a) = pr(X > b) Proof:
For example, if the lifetime of a component is exponentially distributed, then the fact that it has lasted for 100 hours does not affect its chances of failing during the next 100 hours. That is, the component is not subject to ageing. Application to random events in time. Example: cars passing a checkpoint. The distribution of the waiting time, T , for the first event can be obtained as follows: pr(T > t) = pr(Nt = 0) = e−λt , since Nt , the number of events occurring during the time interval (0, t), has a Poisson distribution with mean λt. Hence T has distribution function F (t) = 1 − e−λt , that of the exponential (λ) distribution.
2.3.3 Pareto distribution The Pareto (α, β) distribution has pdf f (x) =
α , x > 0, β(1 + βx )α+1
where α > 0 and β > 0. The distribution function is F (x) = 1 − (1 + βx )−α (verify).
2.3. SOME STANDARD CONTINUOUS DISTRIBUTIONS
21
The mean is µ = β/(α − 1) for α > 1:
The variance is σ 2 = αβ 2 /{(α − 1)2 (α − 2)} for α > 2.
2.3.4
Self-study exercises
1. Obtain the variance of the uniform (α, β) distribution. 2. The lifetime of a valve has an exponential distribution with mean 350 hours. What proportion of valves will last 400 hours or longer? For how many hours should the valves be guaranteed so that only 1% are returned under guarantee? 3. A machine suffers random breakdowns at a rate of three per day. Given that it is functioning at 10am what is the probability that (i) no breakdown occurs before noon? (ii) the first breakdown occurs between 12pm and 1pm? 4. Obtain the variance of the exponential (λ) distribution. 5. The random variable X has the Pareto distribution with α = 3 , β = 1. Find the probability that X exceeds µ + 2σ, where µ, σ are respectively the mean and standard deviation of X.
22
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
2.4 The normal (Gaussian) distribution 2.4.1 Normal distribution The normal distribution is the most important distribution in Statistics, for both theoretical and practical reasons. Its pdf is (x−µ)2 1 f (x) = √ e− 2σ2 , −∞ < x < ∞ . σ 2π
The parameters µ and σ 2 are the mean and variance respectively. The distribution is denoted by N (µ, σ 2 ). Mean:
The importance of the normal distribution follows from its use as an approximation in various statistical methods (consequence of Central Limit Theorem: see Section 3.4.2), its convenience for theoretical manipulation, and its application to describe observed data. Standard normal distribution The standard normal distribution is N (0, 1), for which the distribution function has the special notation Φ(x). Thus ∫ x u2 1 √ e− 2 du . Φ(x) = 2π −∞ The function Φ is tabulated widely (e.g. New Cambridge Statistical Tables). Useful values are Φ(1.64) = 0.95, Φ(1.96) = 0.975. Example 12: Suppose that X is N (0, 1) and Y is N (2, 4). Use tables to calculate pr(X < 1), pr(X < −1), pr(−1.5 < X < −0.5), pr(Y < 1) and pr(Y 2 > 5Y − 6).
2.4. THE NORMAL (GAUSSIAN) DISTRIBUTION
23
2.4.2 Properties (i) If X is N (µ, σ 2 ) then aX + b is N (aµ + b, a2 σ 2 ). In particular, the standardized variate (X − µ)/σ is N (0, 1). (ii) if X1 is N (µ1 , σ12 ), X2 is N (µ2 , σ22 ) and X1 and X2 are independent, then X1 + X2 is N (µ1 + µ2 , σ12 + σ22 ). [Hence, from property (i), the distribution of X1 − X2 is N (µ1 − µ2 , σ12 + σ22 ).] ∑ ∑ ∑ (iii) If Xi , i = 1, . . . , n, are independent N (µi , σi2 ), then i Xi is N ( i µi , i σi2 ). (iv) The moment generating function (see Section 2.6.3) of N (µ, σ 2 ) is M (z) = 1 2 2 E(ezX ) = eµz+ 2 σ z . (Properties (i) - (iii) are easily proved via mgfs - see Section 2.6.3.) (v) Central moments of N (µ, σ 2 ). Let µr = E{(X − µ)r }, the rth central moment of X. Then √ µr = 0 for r odd, µr = (σ/ 2)r r!/(r/2)! for r even. Note that µ2 = σ 2 , the variance of X.
Sampling distribution of the sample mean Let X1 , . . . , Xn be independently and identically distributed (iid) as N (µ, σ 2 ). ¯ = n−1 ∑ Xi is N (µ, n−1 σ 2 ). This is the sampling Then the distribution of X distribution of the sample mean, a result of fundamental importance in Statistics.
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
24 Proof:
2.4.3 Self-study exercises 1. The distribution of lengths of rivets is normal with mean 2.5cm and sd 0.02cm. In a batch of 500 rivets how many would you expect on average to have length (i)less than 2.46cm, (ii) between 2.46cm and 2.53cm, (iii) greater than 2.53cm? (iv) What length is exceeded by only 1 in 1000 rivets? 2. Suppose that X is N (0, 1) and Y is N (2, 4). Use tables to calculate pr(Y − X < 1) and pr(X + 21 Y > 1.5). 3. Two resistors in series have resistances X1 and X2 ohms, where X1 is N (200, 4) and X2 is N (150, 3). What is the distribution of the combined resistance X = X1 + X2 ? Find the probability that X exceeds 355.5 ohms. 4. The fuel consumption of a fleet of 150 lorries is approximately normally distributed with mean 15 mpg and sd 1.5 mpg. (i) Compute the expected number of lorries that average between 13 and 14 mpg. (ii) What is the probability that the average of a random sample of four lorries exceeds 16 mpg?
2.5. BIVARIATE DISTRIBUTIONS
2.5
25
Bivariate distributions
2.5.1 Definitions and notation Suppose that X1 , X2 are two random variables defined on the same probability space (Ω, F, P ). Then P induces a joint distribution for X1 , X2 . The joint distribution function is defined as F (x1 , x2 ) = P ({ω : X1 (ω) ≤ x1 , X2 (ω) ≤ x2 }) = pr(X1 ≤ x1 , X2 ≤ x2 ) . In the discrete case the joint pmf is p(x1 , x2 ) = pr(X1 = x1 , X2 = x2 ). In the 1 ,x2 ) continuous case, the joint pdf is f (x1 , x2 ) = ∂F∂x(x1 ∂x . 2 Example 13: (discrete) Two biased coins are tossed. Score heads = 1 (with probability π), tails = 0 (with probability 1 − π). Let X1 = sum of scores, X2 = difference of scores (1st - 2nd). The tables below show (i) the possible values of X1 , X2 and their probabilities, (ii) the joint probability table for X1 , X2 .
(i) Outcome X1
00
01
10
11
X2 Prob (ii) X2 -1 X1
0
1
0 1 2
Example 14: (continuous) Suppose X1 and X2 have joint pdf f (x1 , x2 ) = k(1 − x1 x22 ) on (0, 1)2 . Obtain the value of k.
26
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
2.5.2 Marginal distributions These follow from the law of total probability. Discrete case. Marginal probability mass functions ∑ ∑ p1 (x1 ) = pr(X1 = x1 ) = p(x1 , x2 ) and p2 (x2 ) = pr(X2 = x2 ) = p(x1 , x2 ) x2
x1
Continuous case. Marginal probability density functions ∫ ∫ f1 (x1 ) = f (x1 , x2 )dx2 and f2 (x2 ) = f (x1 , x2 )dx1 ∫ ∑ Marginal means and variances. µ1 = E(X1 ) = x1 p1 (x1 ) (discrete) or x1 f1 (x1 )dx1 (continuous) σ12 = var(X1 ) = E{(X1 − µ1 )2 } = E(X12 ) − µ21 Likewise µ2 and σ22 .
2.5.3 Conditional distributions These follow from the definition of conditional probability. Discrete case. Conditional probability mass function of X1 given X2 is p1 (x1 |X2 = x2 ) =
pr(X1 = x1 |X2 = x2 ) pr(X1 = x1 , X2 = x2 ) p(x1 , x2 ) = = . pr(X2 = x2 ) p2 (x2 )
Similarly p2 (x2 |X1 = x1 ) =
p(x1 , x2 ) . p1 (x1 )
Continuous case. Conditional probability density function of X1 given X2 is f1 (x1 |X2 = x2 ) =
f (x1 , x2 ) . f2 (x2 )
f2 (x2 |X1 = x1 ) =
f (x1 , x2 ) . f1 (x1 )
Similarly
Independence. X1 and X2 are said to be independent if F (x1 , x2 ) = F1 (x1 )F2 (x2 ). Equivalently, p(x1 , x2 ) = p1 (x1 )p2 (x2 ) (discrete), or f (x1 , x2 ) = f1 (x1 )f2 (x2 ) (continuous).
2.5. BIVARIATE DISTRIBUTIONS
27
Example 15: Suppose that R and N have a joint distribution in which R|N is binomial (N, π) and N is Poisson (λ). Show that R is Poisson (λπ).
2.5.4
Covariance and correlation
The covariance between X1 and X2 is defined as σ12 = Cov(X1 , X2 ) = E{(X1 − µ1 )(X2 − µ2 )} = E(X1 X2 ) − µ1 µ2 , ∫ ∑ where E(X1 X2 ) = x1 x2 p(x1 , x2 ) (discrete) or x1 x2 f (x1 , x2 )dx1 dx2 (continuous). The correlation between X1 and X2 is ρ = Corr(X1 , X2 ) = Example 13: (continued) Marginal distributions: x1 = 0, 1, 2 with p1 (x1 ) = x2 = −1, 0, 1 with p2 (x2 ) = Marginal means: ∑ µ1 = x1 p1 (x1 ) = ∑ µ2 = x2 p2 (x2 ) = Variances:
σ12 . σ1 σ2
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
28
σ12 = σ22 =
∑ ∑
x21 p1 (x1 ) − µ21 = x22 p2 (x2 ) − µ22 =
Conditional distributions: e.g. p(x1 |X2 = 0) =
x1 = 0
x1 = 2
Independence: e.g. p(1, 0) = 0 but p1 (0)p2 (1) 6= 0, so X1 , X2 are not independent. ∑ Covariance: σ12 = x1 x2 p(x1 , x2 ) − µ1 µ2 = Example 14: (continued) Marginal distributions: ∫1 f1 (x1 ) = 0 k(1 − x1 x22 )dx2 = f2 (x2 ) = Marginal means: µ1 = µ2 = Variances: σ12 = σ22 =
∫1 0
∫1 0
∫1 0
∫1 0
∫1 0
k(1 − x1 x22 )dx1 =
x1 f1 (x1 )dx1 = x2 f2 (x2 )dx2 = x21 f1 (x1 )dx1 − µ21 = x22 f2 (x2 )dx2 − µ22 =
Conditional distributions: e.g. f (x2 |X1 = 13 ) = Independence: f (x1 , x2 ) = k(1 − x1 x22 ) , which does not factorise into f1 (x1 )f2 (x2 ) so X1 , X2 are not independent. Covariance: σ12 =
∫
x1 x2 f (x1 , x2 )dx1 dx2 − µ1 µ2
2.5. BIVARIATE DISTRIBUTIONS
29
Properties (i) E(aX1 + bX2 ) = aµ1 + bµ2 , Var(aX1 + bX2 ) = a2 σ12 + 2abσ12 + b2 σ22 Cov(aX1 + b, cX2 + d) = acσ12 , Corr(aX1 + b, cX2 + d) = Corr(X1 , X2 ) (note: invariance under linear transformation) Proof:
(ii) X1 , X2 independent ⇒ Cov(X1 , X2 ) = 0 . The converse is false. Proof:
(iii) −1 ≤ Corr(X1 , X2 ) ≤ +1, with equality if and only if X1 , X2 are linearly dependent. Proof:
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
30
(iv) E(Y ) = E{E(Y |X)} and Var(Y ) = E{Var(Y |X)} + Var{E(Y |X)} Proof:
2.5.5 Self-study exercises 1. Roll a fair die twice. Let X1 be the number of times that face 1 shows, and let X2 = [sum of faces/4], where [x] denotes the integer part of x. (a) Construct the joint probability table. (b) Calculate the two marginal pmfs p1 (x1 ) and p2 (x2 ) and the conditional pmfs p1 (x1 |x2 = 1) and p2 (x2 |x1 = 1). Are X1 and X2 independent? (c) Compute the means, µ1 and µ2 , variances, σ12 and σ22 , and covariance σ12 . Are X1 and X2 uncorrelated? 2. X1 and X2 have joint density f (x1 , x2 ) = 4x1 x2 for 0 ≤ x1 ≤ 1, 0 ≤ x2 ≤ 1. Calculate the marginal and conditional densities of X1 and X2 , their means and variances, and their correlation. 3. Calculate, in terms of the means, variances and covariances of X1 , X2 and X3 , E(2X1 + 3X2 ), Cov(2X1 , 3X2 ), Var(2X1 + 3X2 ) and Cov(2X1 + 3X2 , 4X2 + 5X3 ).
2.6. GENERATING FUNCTIONS
2.6
31
Generating functions
2.6.1 General The generating function for a sequence (an : n ≥ 0) is A(z) = a0 + a1 z + a2 z 2 + ∑ n ··· = ∞ n=0 an z . Here z is a dummy variable. The definition is useful only if the series converges. The idea is to replace the sequence (an ) by the function A(z), which may be easier to analyse than the original sequence. Examples: (i) If an = 1 for n = 0, 1, 2, . . . , then A(z) = (1 − z)−1 for |z| < 1 (geometric series). ( ) m (ii) If an = for n = 0, 1, . . . , m, and an = 0 for n > m, then A(z) = n (1 + z)m (binomial series).
2.6.2
Probability generating function
Let (pn ) be the pmf of some discrete random variable X, so pn = pr(X = n) ≥ 0 ∑ and n pn = 1. Define the probability generating function (pgf) of X by ∑ P (z) = E(z X ) = pn z n . n
Properties (i) |P (z)| ≤ 1 for |z| ≤ 1 . Proof:
(ii) µ = E(X) = P 0 (1) . Proof:
32
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
(iii) σ 2 = Var(X) = P 00 (1) + P 0 (1) − {P 0 (1)}2 . Proof:
(iv) Let X and Y be independent random variables with pgfs PX and PY respectively. Then the pgf of X + Y is given by PX+Y (z) = PX (z)PY (z) . Proof:
Example 16: (i) Find the pgf of the Poisson (λ) distribution. (ii) Let X1 , X2 be independent Poisson random variables with parameters λ1 , λ2 respectively. Obtain the distribution of X1 + X2 .
2.6.3
Moment generating function
The moment generating function (mgf) is defined as M (z) = E(ezX ) . The pgf tends to be used more for discrete distributions, and the mgf for continuous ones, although note that the two are related by M (z) = P (ez ).
2.6. GENERATING FUNCTIONS
33
Properties (i) µ = E(X) = M 0 (0), σ 2 = Var(X) = M 00 (0) − µ2 . Proof:
(ii) Let X and Y be independent random variables with mgfs MX (z) , MY (z) respectively. Then the mgf of X + Y is given by MX+Y (z) = MX (z)MY (z) . Proof:
Normal distribution. We prove properties (i) - (iv) of Section 2.4.2.
34
CHAPTER 2. RANDOM VARIABLES AND DISTRIBUTIONS
2.6.4 Self-study exercises 1. Show that the pgf of the binomial (n, π) distribution is {πz + (1 − π)}n . 2. (Zero-truncated Poisson distribution) Find the pgf of the discrete distribution with pmf p(r) = e−λ λr /{r!(1 − e−λ )} for r = 1, 2, . . .. Deduce the mean and variance. 3. The random variable X has density f (x) = k(1 + x)e−λx on (0, ∞) with λ > 0. Find the value of k. Show that the moment generating function M (z) = k{(z − λ)−2 − (z − λ)−1 }. Use it to calculate the mean and standard deviation of X.
Chapter 3 Further distribution theory 3.1
Multivariate distributions
Let X1 , . . . , Xp be p real-valued random variables on (Ω, F) and consider the joint distribution of X1 , . . . , Xp . Equivalently, consider the distribution of the random vector X1 X2 X= · · Xp
3.1.1
Definitions
The joint distribution function F (x) = pr(X ≤ x) = pr(X1 ≤ x1 , . . . , Xp ≤ xp ) The joint probability mass function (pmf) p(x) = pr(X = x) = pr(X1 = x1 , . . . , Xp = xp ) (discrete case) The joint probability density function (pdf) f (x) is such that ∫ pr(X ∈ A) = f (x)dx A
(continuous case) 35
36
CHAPTER 3. FURTHER DISTRIBUTION THEORY
The marginal distributions are those of the individual components: Fj (xj ) = pr(Xj ≤ xj ) = F (∞, . . . , xj , . . . , ∞) The conditional distributions are those of one component given another: F (xj |xk ) = pr(Xj ≤ xj |Xk = xk ) ∏ ∏ The Xj s are independent if F (x) = j Fj (xj ). Equivalently, p(x) = j pj (xj ) ∏ (discrete case), or f (x) = j fj (xj ) (continuous case). Means: µj = E(Xj ) Variances: σj2 = Var(Xj ) = E{(Xj − µj )2 } = E(Xj2 ) − µ2j Covariances: σjk = Cov(Xj , Xk ) = E{(Xj −µj )(Xk −µk )} = E(Xj Xk )−µj µk σ Correlations: ρjk = Corr(Xj , Xk ) = σjjk σk
3.1.2 Mean and covariance matrix
µ1 µ2 The mean vector of X is µ = E(X) = · · µp The covariance matrix (variance-covariance matrix, dispersion matrix) of X is σ11 σ12 · · · σ1p σ21 σ22 · · · σ2p · · · Σ= · · · σp1 σp2 · · · σpp Since the (i, j)th element of (X − µ)(X − µ)T is (Xi − µi )(Xj − µj ), we see that Σ = E{(X − µ)(X − µ)T } = E(XX T ) − µµT .
3.1.3 Properties Let X have mean µ and covariance matrix Σ. Let a , b be p-vectors and A be a q × p matrix. Then (i) E(aT X) = aT µ (ii) Var(aT X) = aT Σa . It follows that Σ is positive semi-definite. (iii) Cov(aT X, bT X) = aT Σb
3.1. MULTIVARIATE DISTRIBUTIONS (iv) Cov(AX) = AΣAT
(v) E(X T AX) = trace(AΣ) + µT Aµ
Proof:
37
38
CHAPTER 3. FURTHER DISTRIBUTION THEORY
3.1.4 Self-study exercises 1. Let X1 = I1 Y, X2 = I2 Y , where I1 , I2 and Y are independent and I1 and I2 take values ±1 each with probability 21 . Show that E(Xj ) = 0, Var(Xj ) = E(Y 2 ), Cov(X1 , X2 ) = 0. 2. Verify that E(X1 + · · · + Xp ) = µ1 + · · · + µp and Var(X1 + · · · + Xp ) = ∑ ij σij , where µi = E(Xi ) and σij = Cov(Xi , Xj ). ¯ has mean µ and variance Suppose now that the Xi ’s are iid. Verify that X σ 2 /p, where µ = E(Xi ) and σ 2 = Var(Xi ).
3.2 Transformations 3.2.1 The univariate case Problem: to find the distribution of Y = h(X) from the known distribution of X. The case where h is a one-to-one function was treated in Section 1.2.3. When h is many-to-one we use the following generalised formulae: ∑ Discrete case: pY (y) = pX (x) ∑ Continuous case: fY (y) = fX (x) dx dy where in both cases the summations are over the set {x : h(x) = y}. That is, we add up the contributions to the mass or density at y from all x values which map to y. Example 17: (discrete) Suppose pX (x) = px for x = 0, 1, 2, 3, 4, 5 and let Y = (X − 2)2 . Obtain the pmf of Y .
3.2. TRANSFORMATIONS
39
Example 18: (continuous) Suppose fX (x) = 2x on (0, 1) and let Y = (X − 12 )2 . Obtain the pdf of Y .
3.2.2
The multivariate case
Problem: to find the distribution of Y = h(X), where Y is s × 1 and X is r × 1, from the known distribution of X. ∑ Discrete case: pY (y) = pX (x) with the summation over the set {x : h(x) = y}. Continuous case: Case (i): h is a one-to-one transformation (so that s = r). Then the rule is dx fY (y) = fX (x(y)) dy + ( ) dx ∂xi is the Jacobian of transformation, with . where dx = ∂y dy dy j ij
Case (ii): s < r. First transform the s-vector Y to the r-vector Y 0 , where Yi0 = Yi , i = 1, . . . , s , and Yi0 , i = s + 1, . . . , r , are chosen for convenience. Now 0 find the density of Y 0 as above and then integrate out Ys+1 , . . . , Yr0 to obtain the marginal density of Y , as required. Case (iii): s = r but h(·) is not monotonic. Then there will generally be more than one value of x corresponding to a given y and we need to add the probability contributions from all relevant xs.
40
CHAPTER 3. FURTHER DISTRIBUTION THEORY
Example 19: (linear transformation) Suppose that Y = AX, where A is an r × r nonsingular matrix. Then fY (y) = fX (A−1 y)|A|−1 + .
Example 20: Suppose fX (x) = e−x1 −x2 on (0, ∞)2 . Obtain the density of Y1 = 1 (X1 + X2 ). 2
Sums and products If X1 and X2 are independent random variables with densities f1 and f2 , then ∫ (i) X1 + X2 has density g(u) = f1 (u − v)f2 (v)dv (convolution integral) ∫ (ii) X1 X2 has density g(u) = f1 (u/v)f2 (v)|v|−1 dv . Proof:
3.3. MOMENTS, GENERATING FUNCTIONS AND INEQUALITIES
41
3.2.3 Self-study exercises 1. If fX (x) = 29 (x + 1) on (−1, 2) and Y = X 2 , find fY (y). 2. If X has density f (x) calculate the density g(y) of Y = X 2 when (i) f (x) = 2xe−x on (0, ∞); 2
(ii) f (x) = 12 (1 + x) on |x| ≤ 1; (iii) f (x) =
1 2
on − 12 ≤ x ≤ 32 .
3. Let X1 and X2 be independent exponential (λ), and let Y1 = X1 + X2 and Y2 = X1 /X2 . Show that Y1 and Y2 are independent and find their densities.
3.3 3.3.1
Moments, generating functions and inequalities Moment generating function
The moment generating function of the random vector X is defined as M (z) = E(ez
TX
).
∑ Here z T X = j zj Xj . Properties Suppose X has mgf M (z). Then T (i) X + a has mgf ea z M (z) and aX has mgf M (az). ∑ (ii) The mgf of kj=1 Xj is M (z, . . . , z). (iii) If X1 , . . . , Xk are independent random variables with mgfs Mj (zj ), j=1,. . . ,k, ∏ then the mgf of X = (X1 , . . . , Xk )T is M (z) = kj=1 Mj (zj ), the product of the individual mgfs. Proof:
42
CHAPTER 3. FURTHER DISTRIBUTION THEORY
3.3.2 Cumulant generating function The cumulant generating function (cgf) of X is defined as K(z) = log M (z). The cumulants of X are defined as the coefficients κj in the power series expan∑ j sion K(z) = ∞ j=1 κj z /j!. The first two cumulants are κ1 = µ = E(X), κ2 = σ 2 = Var(X)
Similarly, the third and fourth cumulants are found to be κ3 = E(X − µ)3 , κ4 = 3/2 E(X − µ)4 − 3σ 4 . These are used to define the skewness, γ1 = κ3 /κ2 , and the kurtosis, γ2 = κ4 /κ22 . Cumulants of the sample mean. Suppose that X1 , . . . , Xn is a random sample from ¯ = n−1 ∑n Xj a distribution with cgf K(z) and cumulants κj . Then the mgf of X j=1 is {M (n−1 z)}n , so the cgf is log{M (n−1 z)}n = nK(n−1 z) = n
∞ ∑
κj (n−1 z)j /j! .
j=1
¯ is κj /nj−1 and it follows that X ¯ has mean κ1 = µ, Hence the jth cumulant of X variance κ2 /n = σ 2 /n, skewness (κ3 /n2 )/(κ2 /n)3/2 = γ1 /n1/2 and kurtosis (κ4 /n3 )/(κ2 /n)2 = γ2 /n.
3.3.3 Some useful inequalities Markov’s inequality Let X be any random variable with finite mean. Then for all a > 0 pr(|X| ≥ a) ≤
E|X| . a
3.3. MOMENTS, GENERATING FUNCTIONS AND INEQUALITIES
43
Proof:
Cauchy-Schwartz inequality Let X, Y be any two random variables with finite variances. Then {E(XY )}2 ≤ E(X 2 )E(Y 2 ) . Proof:
Jensen’s inequality If u(x) is a convex function then E{u(X)} ≥ u(E(X)) . Note that u(·) is convex if the curve y = u(x) has a supporting line underneath at each point, e.g. bowl-shaped. Proof:
44
CHAPTER 3. FURTHER DISTRIBUTION THEORY
Examples 1. Chebyshev’s inequality. Let Y be any random variable with finite variance. Then for all a > 0 pr(|Y − µ| ≥ a) ≤
σ2 . a2
2. Correlation inequality. 2 2 {Cov(X, Y )}2 ≤ σX σY (which implies that |Corr(X, Y )| ≤ 1).
3. |E(X)| ≤ E(|X|). [It follows that |E{h(Y )}| ≤ E{|h(Y )|} for any function h(·).]
3.4. SOME LIMIT THEOREMS
45
4. E{(|X|s )r/s } ≥ {E(|X|s )}r/s . [Thus {E(|X|r )}1/r ≥ {E(|X|s )}1/s and it follows that {E(|X|r )}1/r is an increasing function of r.]
5. A cumulant generating function is a convex function; i.e. K 00 (z) ≥ 0. Proof. K(z) = log M (z), so K 0 = M 0 /M and K 00 = {M M 00 − (M 0 )2 }/M 2 . Hence M (z)2 K 00 (z) = E(ezX )E(X 2 ezX ) − {E(XezX )}2 ≥ 0, by the CauchySchwartz inequality. (on writing XezX = (ezX/2 )(XezX/2 ))
3.3.4
Self-study exercises
1. Find the joint mgf M (z) of (X, Y ) when the pdf is f (x, y) = y)e−λ(x+y) on (0, ∞)2 . Deduce the mgf of U = X + Y .
1 3 λ (x 2
+
2. Find all the cumulants of the N (µ, σ 2 ) distribution. 1
[You may assume the mgf eµz+ 2 σ
2 z2
.]
3. Suppose that X is such that E(X) = 3 and E(X 2 ) = 13. Use Chebyshev’s inequality to determine a lower bound for pr(−2 < X < 8). 4. Show that {E(|X|)}−1 ≤ E(|X|−1 ).
3.4 3.4.1
Some limit theorems Modes of convergence of random variables
Let X1 , X2 , . . . be a sequence of random variables. There are a number of alternative modes of convergence of (Xn ) to a limit random variable X. Suppose first that X1 , X2 , . . . and X are all defined on the same sample space Ω.
CHAPTER 3. FURTHER DISTRIBUTION THEORY
46
Convergence in probability p
Xn → X if pr(|Xn − X| > ) → 0 as n → ∞ for all > 0. Equivalently, pr(|Xn − X| ≤ ) → 1. Often X = c, a constant. Almost sure convergence a.s.
Xn → X if pr(Xn → X) = 1. Again, often X = c. Also referred to as convergence ‘with probability one’. Almost sure convergence is a stronger property than convergence in probability. i.e. a.s. ⇒ p, but p 6⇒ a.s. Example 21: Consider independent Bernoulli trials with constant probability of success 21 . A typical sequence would be 01001001110101100010 . . .. Here the first 20 trials resulted in 9 successes, giving an observed proportion of ¯ 20 = 0.45 successes. X Intuitively, as we increase n we would expect this proportion to get closer to 1. However, this will not be the case for all sequences: for example, the sequence 11111111111111111111 has exactly the same probability as the earlier sequence, ¯ 20 = 1. but X It can be shown that the total probability of all infinite sequences for which the ¯ n → 1 ) = 1 so proportion of successes does not converge to 21 is zero; i.e. pr(X 2 p 1 a.s. 1 ¯ ¯ Xn → 2 (and hence also Xn → 2 ). Convergence in rth mean r
Xn → X if E|Xn − X|r → 0 as n → ∞. [rth mean ⇒ p, but rth mean 6⇔ a.s. ] Suppose now that the distribution functions are F1 , F2 , . . . and F . The random variables need not be defined on the same sample spaces for the following definition. Convergence in distribution d
Xn → X if Fn (x) → F (x) as n → ∞ at each continuity point of F . We say that the asymptotic distribution of Xn is F . [p ⇒ d, but d 6⇒ p] A useful result. Let (Xn ), (Yn ) be two sequences of random variables such that
3.4. SOME LIMIT THEOREMS
47
p
d
Xn → X and Yn → c, a constant. Then d
d
d
Xn + Yn → X + c , Xn Yn → cX , Xn /Yn → X/c (c 6= 0).
3.4.2
Limit theorems for sums of independent random variables
Let X1 , X2 , . . . be a sequence of iid random variables with (common) mean µ. Let ∑ ¯ n = n−1 Sn . Sn = ni=1 Xi , X p ¯n → Weak Law of Large Numbers (WLLN). If E|Xi | < ∞ then X µ.
¯n) = µ Proof (case σ 2 = Var(Xi ) < ∞). Use Chebyshev’s inequality: since E(X we have, for every > 0, ¯ n − µ| > ) ≤ pr(|X
¯n) Var(X σ2 = →0 2 n2
as n → ∞. Example 21: (continued). Here σ 2 = Var(Xi ) = ¯ n , the proportion of successes. applies to X
1 4
(Bernoulli r.v.) so the WLLN
¯ n a.s. Strong Law of Large Numbers (SLLN). If E|Xi | < ∞ then X → µ. [The proof is more tricky and is omitted.] Central Limit Theorem (CLT). If σ 2 = Var(Xi ) < ∞ then Sn − nµ d √ → N (0, 1) . σ n Equivalently, ¯n − µ d X √ → N (0, 1) . σ/ n Proof. Suppose that Xi has mgf M (z). Write Zn = given by (
zZn
MZn (z) = E(e
Sn √ −nµ . σ n
The mgf of Zn is
)}n √ ){ ( zµ n z √ . ) = exp − M σ σ n
CHAPTER 3. FURTHER DISTRIBUTION THEORY
48 Therefore the cgf of Zn is
( ) √ zµ n z √ KZn (z) = log MZn (z) = − + nK σ σ n { ( ) ( )2 } ( ) √ 2 σ 1 zµ n z z √ √ + +O √ = − +n µ σ 2 σ n σ n n ) ( √ √ zµ n zµ n z 2 z2 1 = − → + + +O √ σ σ 2 2 n as n → ∞, which is the cgf of the N (0, 1) distribution, as required. [Note on the proof of the CLT. In cases where the mgf does not √ exist, a similar proof izX j can be given in terms of the function φ(z) = E(e ) where i = −1. φ(·) is called the characteristic function and always exists.]
Example 21: (continued). Normal approximation to the binomial Suppose now that the success probability is π, so that pr(Xi = 1) = π. Then √ √ ¯ µ = π and σ 2 = π(1 − π), so the CLT gives n(X n − π)/ {π(1 − π)} is approximately N (0, 1). p ¯n → Furthermore, X π by the WLLN, and it follows from the ‘useful result’ that √ √ ¯ ¯ ¯ n )} is also approximately N (0, 1). n(Xn − π)/ {Xn (1 − X Poisson limit of binomial. Suppose that Xn is binomial (n, π) where π is such that d nπ → λ as n → ∞. Then Xn → Poisson(λ). ∑ Proof. Xn is expressible as ni=1 Yi , where the Yi are independent Bernoulli random variables with pr(Yi = 1) = π. Thus Xn has pgf (1 − π + πz)n = {1 − n−1 λ(1 − z) + o(n−1 )}n → exp{−λ(1 − z)} as n → ∞, which is the pgf of the Poisson (λ) distribution.
3.4.3 Self-study exercises 1. In a large consignment of manufactured items 25% are defective. A random sample of 50 is drawn. Use the binomial distribution to compute the exact probability that the number of defectives in the sample is five or fewer. Use the CLT to approximate this answer. 2. The random variable Y has the Poisson (50) distribution. Use the CLT to find pr(Y = 50), pr(Y ≤ 45) and pr(Y > 60).
3.5. FURTHER DISCRETE DISTRIBUTIONS
49
3. A machine in continuous use contains a certain critical component which has an exponential lifetime distribution with mean 100 hours. When a component fails it is immediately replaced by one from the stock, originally of 90 such components. Use the CLT to find the probability that the machine can be kept running for a year without the stock running out.
3.5
Further discrete distributions
3.5.1 Negative binomial distribution Let X be the number of Bernoulli trials until the kth success. Then pr(X = x) = pr(k − 1 successes in first x − 1 trials, followed by success on kth trial) ) ( x−1 π k−1 (1 − π)x−k × π = k−1 (where the first factor comes from the binomial distribution). Hence define the pmf of the negative binomial (k, π) distribution as ( ) x−1 p(x) = π k (1 − π)x−k , x = k, k + 1, . . . k−1 The mean is k/π: The variance is k(1 − π)/π 2 (see exercise 1). The pgf is {π/(z −1 − 1 + π)}k :
The name “negative binomial” comes from the binomial expansion ∞ ∑ 1 = π k {1 − (1 − π)}−k = p(x) x=k
CHAPTER 3. FURTHER DISTRIBUTION THEORY
50
where p(x) are the negative binomial probabilities. (Exercise: verify)
3.5.2 Hypergeometric distribution An urn contains n1 red beads and n2 black beads. Suppose that m beads are drawn without replacement and let X be the number of red beads in the sample. Note that, since X ≤ n1 and X ≤ m, the possible values of X are 0, 1, ..., min(n1 , m). Then
no. of selections of x reds and m − x blacks total no. of selections of m beads ( )( ) n1 n2 x m−x ( ) = , x = 0, 1, ..., min(n1 , m) . n1 + n2 m
p(x) = pr(X = x) =
This is the pmf of the hypergeometric (n1 , n2 , m) distribution. The mean is n1 m/(n1 + n2 ) and the variance is n1 n2 m(n1 + n2 − m)/{(n1 + n2 )2 (n1 + n2 − 1)}.
3.5.3 Multinomial distribution An urn contains nj beads of colour j (j = 1, . . . k). Suppose that m beads are drawn with replacement and let Xj be the number of beads of colour j in the ∑ sample. Then, for xj = 0, 1, . . . , m and kj=1 xj = m, ( p(x) = pr(X = x) =
m x
) π1x1 π2x2 · · · πkxk ,
∑ where πj = nj / ki=1 ni . This is the pmf of the multinomial (k, m, π) distribution. Here ( ) m = no. of different orderings of x1 + · · · + xk beads x ( ) m! = x1 ! · · · x k !
3.5. FURTHER DISCRETE DISTRIBUTIONS
51
and the probability of any given order is π1x1 π2x2 · · · πkxk . The name “multinomial” m comes from the multinomial ( ) expansion of (π1 +· · ·+πk ) in which the coefficient m of π1x1 π2x2 · · · πkxk is . x The means are mπj :
The covariances are σjk = m(δjk πj − πj πk ). ∏ X ∑ The joint pgf is E( j zj j ) = ( kj=1 πj zj )m :
3.5.4
Self-study exercises
1. Derive the variance of the negative binomial (k, π) distribution. [You may assume the formula for the pgf.] 2. Suppose that X1 , . . . , Xk are independent geometric (π) random variables. ∑ Using pgfs, show that kj=1 Xj is negative binomial (k, π). [Hence, the waiting times Xj between successes in Bernoulli trials are independent geometric, and the overall waiting time to the kth success is negative binomial.] 3. If X is multinomial (k, m, π) show that Xj is binomial (m, πj ), Xj + Xk is binomial (m, πj + πk ), etc. [Either by direct calculation or using the pgf.]
52
CHAPTER 3. FURTHER DISTRIBUTION THEORY
3.6 Further continuous distributions 3.6.1 Gamma and beta functions ∫∞ Gamma function: Γ(a) = 0 xa−1 e−x dx for a > 0 Integration by parts gives Γ(a) = (a − 1)Γ(a − 1). √ In particular, for integer a, Γ(a) = (a − 1)! (since Γ(1) = 1). Also, Γ(1/2) = π. ∫1 Beta function: B(a, b) = 0 xa−1 (1 − x)b−1 dx for a > 0, b > 0 Relationship with Gamma function: B(a, b) = Γ(a)Γ(b) Γ(a+b)
3.6.2 Gamma distribution The pdf of the gamma (α, β) distribution is defined as f (x) =
β α xα−1 e−βx ,x>0 Γ(α)
where α > 0 and β > 0. When α = 1, this is the exponential (β) distribution. The mean is α/β:
The variance is α/β 2 (see exercise 2). The mgf is (1 − z/β)−α :
Note that the mode is (α − 1)/β if α ≥ 1, but f (0) = ∞ if α < 1.
3.6. FURTHER CONTINUOUS DISTRIBUTIONS
53
Example 22: The journey time of a bus on a nominal 12 -hour route has the gamma (3, 6) distribution. What is the probability that the bus is over half an hour late?
Sums of exponential random variables Suppose that X1 , . . . , Xn are iid exponen∑ tial (λ) random variables. Then ni=1 Xi is gamma (n, λ). Proof:
3.6.3
Beta distribution
The pdf of the beta (α, β) distribution is f (x) = where α > 0 and β > 0.
xα−1 (1 − x)β−1 , 0 < x < 1, B(α, β)
CHAPTER 3. FURTHER DISTRIBUTION THEORY
54 The mean is α/(α + β):
The variance is αβ/{(α + β)2 (α + β + 1)}. The mode is (α − 1)/(α + β − 2) if α ≥ 1 and α + β > 2.
Property If X1 and X2 are independent, respectively gamma (ν1 , λ) and gamma (ν2 , λ), then U1 = X1 +X2 and U2 = X1 /(X1 +X2 ) are independent, respectively gamma (ν1 + ν2 , λ) and beta (ν1 , ν2 ). Proof The inverse transformation is (
X1 X2
)
( =
U1 U2 U1 (1 − U2 )
)
with Jacobian dx u2 u1 = du 1 − u2 −u1
= −u1 .
3.6. FURTHER CONTINUOUS DISTRIBUTIONS
55
Therefore [
] [ ν2 ] λν1 (u1 u2 )ν1 −1 e−λu1 u2 λ {u1 (1 − u2 )}ν2 −1 e−λu1 (1−u2 ) fU (u) = | − u1 | Γ(ν1 ) Γ(ν2 ) { ν1 +ν2 ν1 +ν2 −1 −λu1 } { } λ u1 e Γ(ν1 + ν2 ) ν1 −1 ν2 −1 = u (1 − u2 ) Γ(ν1 + ν2 ) Γ(ν1 )Γ(ν2 ) 2 on (0, ∞) × (0, 1) and the result follows.
3.6.4
Self-study exercises
1. Suppose X has the gamma (2, 4) distribution. Find the probability that X exceeds µ+2σ, where µ, σ are respectively the mean and standard deviation of X. 2. Derive the variance of the gamma (α, β) distribution. [Either by direct calculation or using the mgf.] 3. Find the distribution of − log X when X is uniform (0,1). Hence show that if X1 , . . . , Xk are iid uniform (0,1) then − log(X1 X2 · · · Xk ) is gamma (k, 1). 4. If X is gamma (ν, λ) show that log X has mgf λ−z Γ(z + ν)/Γ(ν). 5. Suppose X is uniform (0, 1) and γ > 0 Show that Y = X 1/γ is beta (γ, 1).
56
CHAPTER 3. FURTHER DISTRIBUTION THEORY
Chapter 4 Normal and associated distributions 4.1
The multivariate normal distribution
4.1.1 Multivariate normal The multivariate normal distribution, denoted Np (µ, Σ), has pdf 1 f (x) = |2πΣ|−1/2 exp{− (x − µ)T Σ−1 (x − µ)} 2 on (−∞, ∞)p . The mean is µ (p × 1) and the covariance matrix is Σ (p × p) (see property (v)). Bivariate case, p = 2. Here ( X=
X1 X2
(
) ,µ=
µ1 µ2
)
( , Σ=
Σ11 Σ12 Σ21 Σ22
)
( =
σ12 ρσ1 σ2 ρσ1 σ2 σ22
)
|2πΣ| = (2π)2 σ12 σ22 (1 − ρ2 ) ( Σ
−1
2 −1
= (1 − ρ )
−ρ/(σ1 σ2 ) 1/σ12 −ρ/(σ1 σ2 ) 1/σ22
) , giving
{( [ )2 ( )( ) ( )2 }] x1 −µ1 x1 −µ1 x2 −µ2 x2 −µ2 1 − 2ρ + exp − 2(1−ρ2 ) σ1 σ1 σ2 σ2 √ f (x1 , x2 ) = 2πσ1 σ2 1 − ρ2 57
58
CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS
4.1.2 Properties i) Suppose X is Np (µ, Σ) and let Y = T −1 (X − µ), where Σ = T T T . Then Yi , i = 1, . . . , p, are independent N (0, 1).
T z+ 1 z T Σz 2
(ii) The joint mgf of Np (µ, Σ) is eµ
. (C.f. property (iv), Section 2.4.2.)
4.1. THE MULTIVARIATE NORMAL DISTRIBUTION
59
(iii) If X is Np (µ, Σ) then AX + b (where A is q × p and b is q × 1) is Nq (Aµ + b, AΣAT ). (C.f. property (i), Section 2.4.2.)
(iv) If X i , i = 1, . . . , n, are independent Np (µi , Σi ), then (C.f. property (iii), Section 2.4.2.)
∑ i
∑ ∑ X i is Np ( i µi , i Σi ).
60
CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS
(v) Moments of Np (µ, Σ). Obtain by differentiation of the mgf. In particular, differentiating w.r.t. zj and zk gives E(Xj ) = µj , Var(Xj ) = Σjj and Cov(Xj , Xk ) = Σjk . Note that if X1 , . . . , Xp are all uncorrelated (i.e. Σjk = 0 for j 6= k) then X1 , . . . , Xp are independent N (µj , σj2 ).
(vi) If X is Np (µ, Σ) then aT X and bT X are independent if and only if aT Σb = 0. Similarly for AT X and B T X.
4.1.3 Marginal and conditional distributions Suppose that X is Np (µ, Σ). Partition X T as (X T1 , X T2 ) where X 1 is p1 × 1, X 2 is ) ( Σ11 Σ12 T T T . p2 × 1 and p1 + p2 = p. Correspondingly µ = (µ1 , µ2 ) and Σ = Σ21 Σ22 Note that ΣT21 = Σ12 and X 1 and X 2 are independent if and only if Σ12 = 0 (since
4.1. THE MULTIVARIATE NORMAL DISTRIBUTION
61
the joint density factorises if and only if Σ12 = 0). The marginal distribution of X 1 is Np1 (µ1 , Σ11 ). Proof:
The conditional distribution of X 2 |X 1 is Np2 (µ2.1 , Σ22.1 ), where µ2.1 = µ2 + Σ21 Σ−1 11 (X 1 − µ1 ) Σ22.1 = Σ22 − Σ21 Σ−1 11 Σ12 (proof omitted). Note that µ2.1 is linear in X 1 .
4.1.4
Self-study exercises ((
1. Write down the joint density of the N2
0 1
)) ) ( 1 1 , distribution 1 4
in component form. 2. Suppose that X i , i = 1, . . . , n, are independent Np (µ, Σ). Show that the ¯ = n−1 ∑ X i is Np (µ, n−1 Σ). sample mean vector, X i 3. For the distribution in exercise 1, obtain the marginal distributions of X1 and X2 and the conditional distributions of X2 given X1 = x1 and X1 given X2 = x2 .
62
CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS
4.2 The chi-square, t and F distributions 4.2.1 Chi-square distribution The pdf of the chi-square distribution with ν degrees of freedom (ν > 0) is u 2 ν−1 e− 2 u 1
f (u) =
1
1
2 2 ν Γ( 12 ν)
, u > 0.
Denoted by χ2ν . Note that the χ2ν distribution is identical to the gamma ( ν2 , 12 ) distribution (c.f. Section 3.6). It follows that the mean is ν, the variance is 2ν and the mgf is (1 − 2z)−ν/2 . Properties (i) Let ν be a positive integer and suppose that X1 , . . . , Xν are iid N (0, 1). Then ∑ν 2 2 2 2 i=1 Xi is χν . In particular, if X is N (0, 1) then X is χ1 .
(ii) If Ui , i = 1, . . . , n, are independent χ2νi then
∑n i=1
Ui is χ2ν with ν =
∑n i=1
νi .
4.2. THE CHI-SQUARE, T AND F DISTRIBUTIONS
63
(iii) If X is Np (µ, Σ) then (X − µ)T Σ−1 (X − µ) is χ2p .
Theorem (Joint distribution of the sample mean and variance) ¯ = n−1 ∑ Xi be the sample Suppose that X1 , . . . , Xn are iid N (µ, σ 2 ). Let X i ∑ 2 −1 2 ¯ mean and S = (n − 1) i (Xi − X) the sample variance. ¯ is N (µ, σ 2 /n), (n − 1)S 2 /σ 2 is χ2n−1 and X ¯ and S 2 are independent. Then X Proof:
64
CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS
4.2.2 Student’s t distribution The pdf of the Student’s t distribution with ν degrees of freedom (ν > 0) is 1 f (t) = , −∞ < t < ∞ . 1 2 1 1 ν B( 2 , 2 )ν 2 (1 + tν ) 2 (ν+1) Denoted by tν . The mean is 0 (provided ν > 1):
The variance is ν/(ν − 2) (provided ν > 2). Theorem If X is N (0, 1), U is χ2ν and X and U are independent, then X ∼ tν . T ≡√ U/ν Proof:
4.3. NORMAL THEORY TESTS AND CONFIDENCE INTERVALS
65
4.2.3 Variance ratio (F) distribution The pdf of the variance ratio, or F distribution with ν1 , ν2 degrees of freedom (ν1 , ν2 > 0) is ( ) 12 ν1 f (x) =
ν1 ν2
B( ν21 , ν22 )(1 +
x 2 ν1 −1 1
ν1 x 12 (ν1 +ν2 ) ) ν2
, x > 0.
Denoted by Fν1 ,ν2 . The mean is ν2 /(ν2 − 2) (provided ν2 > 2) and the variance is 2ν22 (ν1 + ν2 − 2)/{ν1 (ν2 − 2)2 (ν2 − 4)} (provided ν2 > 4). Theorem. If U1 and U2 are independent, respectively χ2ν1 and χ2ν2 , then F ≡
U1 /ν1 ∼ Fν1 ,ν2 . U2 /ν2
Proof:
It follows from the above result that (i) Fν1 ,ν2 ≡ 1/Fν2 ,ν1 and (ii) F1,ν ≡ t2ν . (Exercise: check)
4.3 4.3.1
Normal theory tests and confidence intervals One-sample t-test
∑ Suppose that Y1 , . . . , Yn are iid N (µ, σ 2 ). Then, from Section 3.2, Y¯ = n−1 i Yi ∑ (the sample mean) and S 2 = (n − 1)−1 i (Yi − Y¯ )2 (the sample variance) are independent, respectively N (µ, σ 2 /n) and σ 2 χ2n−1 /(n − 1). Hence Z=
(Y¯ − µ) √ σ/ n
66
CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS
is N (0, 1),
(n − 1)S 2 U= σ2 is χ2n−1 and Z, U are independent. It follows that Z Y¯ − µ √ =√ T = S/ n U/(n − 1) is tn−1 . Applications: Inference about µ: one-sample z-test (σ known) and t-test (σ unknown). Inference about σ 2 : χ2 test.
4.3.2 Two-samples Two independent samples. Suppose that Y11 , . . . , Y1n1 are iid N (µ1 , σ12 ) and Y21 , . . . , Y2n2 are iid N (µ2 , σ22 ). Summary statistics: (n1 , Y¯1 , S12 ) and (n2 , Y¯2 , S22 ) (n −1)S12 +(n2 −1)S22 Pooled sample variance: S 2 = 1 n1 +n 2 −2 From Section 4.2, if σ12 = σ22 = σ 2 , say, then Y¯1 and (n1 − 1)S12 are indepen2 2 2 2 ¯ dent N (µ1 , n−1 1 σ ) , σ χn1 −1 respectively, and Y2 and (n1 − 1)S2 are independent 2 2 2 N (µ2 , n−1 2 σ ) , σ χn2 −1 respectively. Furthermore, (Y¯1 , (n1 − 1)S 2 ) and (Y¯2 , (n2 − 1)S 2 ) are independent. 1
2
−1 2 2 2 2 Therefore (Y¯1 − Y¯2 ) is N (µ1 −µ2 ), (n−1 1 +n2 )σ ), (n1 +n2 −2)S is σ χn1 +n2 −2 and (Y¯1 − Y¯2 ) and (n1 + n2 − 2)S 2 are independent.
Therefore T ≡
(Y¯1 − Y¯2 ) − (µ1 − µ2 ) √ S ( n11 + n12 )
is tn1 +n2 −2 . Also, since S12 , S22 are independent, F ≡
S12 ∼ Fn1 −1,n2 −1 . S22
Applications: Inference about µ1 − µ2 : two-sample z-test (σ known) and t-test (σ unknown). Inference about σ12 /σ22 : F (variance ratio) test.
4.3. NORMAL THEORY TESTS AND CONFIDENCE INTERVALS
67
Matched pairs Observations (Yi1 , Yi2 : i = 1, . . . , n) where the differences Di = Yi1 − Yi2 are independent N (µ, σ 2 ). Then T =
¯ −µ D √ S/ n
is tn−1 , where S 2 is the sample variance of the Di ’s. Application: Inference about µ from paired observations: paired-sample t-test.
4.3.3
k samples (One-way Anova)
Suppose we have k groups, with group means µ1 , . . . , µk . Denote the independent observations by (Yi1 , . . . , Yini : i = 1, . . . , k) with Yij ∼ N (µi , σ 2 ), j = 1, . . . , ni , i = 1, . . . k. Summary statistics: ((ni , Si2 ) : i = 1, . . . , k). ∑ ∑ Total sum of squares: ssT = ij (Yij − Y¯ )2 , where Y¯ = n−1 ij Yij (the overall ∑ mean) and n = i ni . Then ssT = ssW + ssB where ∑ ∑ ssW = ij (Yij − Y¯i )2 = i (ni − 1)Si2 (the within-samples ss) ∑ ssB = i ni (Y¯i − Y¯ )2 (the between-samples ss)
From Sections 4.1 and 4.2, (ni − 1)Si2 /σ 2 is χ2ni −1 independent of Y¯i . Hence ssW/σ 2 is χ2n−k independent of ssB. Also, by a similar argument to that of the Theorem in Section 3.2 (proof omitted), ssB is σ 2 χ2k−1 when µi = µ, say, for all i. Hence we obtain the F -test for equality of the group means µi : F =
ssB/(k − 1) ssW/(n − k)
is Fk−1,n−k under the null hypothesis µ1 = · · · = µk .
68
CHAPTER 4. NORMAL AND ASSOCIATED DISTRIBUTIONS
4.3.4 Normal linear regression Observations Y1 , . . . , Yn are independently N (α + βxi , σ 2 ), where x1 , . . . , xn are given constants. ˆ is found by minimizing the sum of squares The least-squares estimator (ˆ α, β) ∑n Q(α, β) = i=1 (Yi − α − βxi )2 . By partial differentiation with respect to α and β, we obtain βˆ = Txy /Txx ,
α ˆ = Y¯ − βˆx¯
∑ ∑ where Txx = i (xi − x¯)2 and Txy = i (xi − x¯)(Yi − Y¯ ) Note that, since both α ˆ and βˆ are linear combinations of Y = (Y1 , . . . , Yn )T , they are jointly normally distributed. ˆ T is Using properties of expectation and covariance matrices, we find that (ˆ α, β) bivariate normal with mean (α, β) and covariance matrix ( −1 ∑ 2 ) σ2 n x −¯ x i i V = −¯ x 1 Txx Sums of squares ∑ Total ss: Tyy = i (Yi − Y¯ )2 ; ˆ Residual ss: Q(ˆ α, β); ˆ Regression ss: Tyy − Q(ˆ α, β) Results: 2 (a) Residual ss = Tyy − Txx βˆ2 , Regression ss = Txx βˆ2 = Txy /Txx 2 2 (b) E(Total ss) = Txx β + (n − 1)σ , E(Regression ss) = Txx β 2 + σ 2 , E(Residual ss) = (n − 2)σ 2 (c) By a similar argument to that of the Theorem in Section 3.2 (proof omitted), Residual ss is σ 2 χ2n−2 and, if β = 0, Regression ss is σ 2 χ21 , independently of Residual ss. Application: The residual mean square, S 2 = Residual ss/(n − 2), is an unbiased estimator of √ σ 2 , βˆ is an unbiased estimator of β with estimated standard error S/ T xx , and α ˆ is √ ∑ 2 1/2 an unbiased estimator of α with estimated standard error (S/ T xx )( i xi /n) . If β = β0 then βˆ − β0 T = √ S/ T xx
4.3. NORMAL THEORY TESTS AND CONFIDENCE INTERVALS
69
is tn−2 , giving rise to tests and confidence intervals about β. If β = 0 then Regression ss F = S2 is F1,n−2 , hence a test for β = 0. ˆ (Alternatively, and equivalently, use T = S/√βT as tn−2 .) xx
The coefficient of determination is 2 Txy Regression ss r = = Total ss Txx Tyy 2
(square of the sample correlation coefficient). The coefficient of determination gives the proportion of Y -variation attributable to regression on x.