Probability Theory - Statistical Inference - Matrix Algebra. Prof. Dr. Christian ...
Hogg, R. V. and A. T. Craig, Introduction to Mathematical Statistics,. Prentice Hall
...
Preparatory Course Econometrics Probability Theory - Statistical Inference - Matrix Algebra
Prof. Dr. Christian Conrad Dipl.-Vw. Daniel Rittler University of Heidelberg Winter term 2011/12
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
1 / 87
Preparatory Course Econometrics
Contents 1. Probability framework for statistical inference 2. Fundamentals of asymptotic distribution theory 3. Point estimation 4. Interval estimation 5. Statistical hypothesis testing 6. Fundamentals of matrix algebra
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
2 / 87
Preparatory Course Econometrics Literature Statistics Hogg, R. V. and A. T. Craig, Introduction to Mathematical Statistics, Prentice Hall, 1995; Mosler, K. and F. Schmid, Wahrscheinlichkeitsrechnung und schließende Statistik, Springer, 2004. Econometrics W. H. Greene, Econometric Analysis, 6. edition, Prentice Hall, 2008; J. H. Stock and M. W. Watson, Introduction to Econometrics, 2. edition, Addison-Wesley, 2007; J. M. Wooldridge, Econometric Analysis of Cross Section and Panel Data, 2. edition, MIT Press, 2002.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
3 / 87
1 Probability framework for statistical inference
1 Probability framework for statistical inference 1.1 1.2 1.3 1.4 1.5
Calculus of probability Probability measure Random variables Joint distributions Conditional distributions
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
4 / 87
1 Probability framework for statistical inference 1.1 Calculus of probability
1.1.1 Notation Our starting point is a random experiment with possible outcomes ω and the set of all outcomes Ω, called the sample space. An event A is a collection of outcomes, and, hence a subset of Ω. Problem: Tossing a die one time Describe the sample space and the events A: "the outcome is an odd number" and B: "the outcome is an even number".
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
5 / 87
1 Probability framework for statistical inference 1.1 Calculus of probability
1.1.2 Elementary set operations Union: A ∪ B = {x : x ∈ A or x ∈ B}; Intersection: A ∩ B = {x : x ∈ A and x ∈ B};
Complement: A = {x : x ∈ / A}; Relative complement: A\B = {x : x ∈ A and x ∈ / B} 1.1.3 Properties of set operations A and B are disjoint if A ∩ B = ∅;
A and B are a partition of C if A ∩ B = ∅ and A ∪ B = C; Commutativity: A ∪ B = B ∪ A and A ∩ B = B ∩ A;
Associativity: A ∪ (B ∪ C) = (A ∪ B) ∪ C and A ∩ (B ∩ C) = (A ∩ B) ∩ C; Distributive laws: A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C) and A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C); De Morgan’s laws: A ∪ B = A ∩ B and A ∩ B = A ∪ B Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
6 / 87
1 Probability framework for statistical inference 1.2 Probability measure
Starting point: experiment with sample space Ω and relevant events A
Definition A function P that assigns a real number to each interesing event A is called probability measure (probability) if (i) 0 ≤ P(A) ≤ 1
for all events A
(ii) P(Ω) = 1 S∞ P∞ (iii) P( i=1 Ai ) = i=1 P(Ai ) for all A1 , A2 , ... with Ai ∩ Aj = ∅ for i 6= j
Implications: P(∅) = 0
P(A) = 1 − P(A) P(A ∪ B) = P(A) + P(B) − P(A ∩ B).
P(A) ≤ P(B), if A ⊆ B S P P( ni=1 Ai ) = ni=1 P(Ai ) for all A1 , A2 , ..., An with Ai ∩ Aj = ∅ for i 6= j
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
7 / 87
1 Probability framework for statistical inference 1.2 Probability measure
Problem: Tossing a fair coin two times Consider the random experiment "tossing a fair coin two times". Describe all possible outcomes; the sample space; all possible events; the probability measure.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
8 / 87
1 Probability framework for statistical inference 1.2 Probability measure
1.2.1 Conditional probability and statistical independence
Definition Let A and B be two events in Ω with P(B) > 0 then P(A|B) :=
P(A ∩ B) P(B)
is called conditional probability of event A under the condition of event B. Note that P(·|B) satisfies P(A|B) ≥ 0, P(Ω|B) = 1, and
P(A1 ∪ A2 |B) = P(A1 |B) + P(A2 |B) if A1 ∩ A2 = ∅. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
9 / 87
1 Probability framework for statistical inference 1.2 Probability measure
Example: Consider a sample space Ω with cardinality |Ω| = n. Further, A and B are events with |A| = k and |B| = l. Finally, |A ∩ B| = m. Each element of the sample space occurs with equal probability. Determine P(A), P(B), P(A ∩ B), and P(A|B); Illustrate the problem graphically; Interprete the conditional probability P(A|B).
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
10 / 87
1 Probability framework for statistical inference 1.2 Probability measure
Definition The definition of the conditional probability yields the multiplication theorem P(A ∩ B) = P(A|B) · P(B)
Definition The events A and B are called statistically independent if P(A|B) = P(A) Hence, the occurence of event B has no information about the likelihood of event A. Note: If A and B are statistically independent, then P(A ∩ B) = P(A) · P(B). Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
11 / 87
1 Probability framework for statistical inference 1.3 Random variables
1.3.1 Random variables
Definition Consider the sample space Ω and the probability measure P. A mapping X : Ω → Ω′ ⊆ R is called random variable.
The probability distribution (distribution) PX of X is given by PX (B) = P(X −1 (B)) = P({ω ∈ Ω|X(ω) ∈ B}),
with B ⊆ Ω′ . Interpretation: A random variable X assigns a real number X(ω) = x to each outcome ω ∈ Ω of a random experiment. While X is a function, x is a real number, called realised value. X induces a new sample space Ω′ and a new probability function PX on Ω′ . Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
12 / 87
1 Probability framework for statistical inference 1.3 Random variables
Example: Two gamblers, S and T, throw two fair coins. S pays T two Dollars if both coins show "heads". T pays S one Dollar if one coin shows "tails". If both coins show "tails" none of the gamblers receives a payment. X shows the payment of S. Determine the distribution of the random variable X. Sample space: Ω = {(h, h), (h, t), (t, h), (t, t)}; The function X assigns the corresponding payoffs of S to each of the outcomes ω X(h, h) = −2 New sample space: Distribution of X:
X(h, t) = 1
X(t, h) = 1
X(t, t) = 0
Ω′ = {−2, 0, 1}.
PX ({−2}) = P(X −1 ({−2})) = P({ω ∈ Ω|X(ω) = −2}) = P({(h, h)}) =
1 4
PX ({1}) = P(X −1 ({1})) = P({ω ∈ Ω|X(ω) = 1}) = P({(t, h), (h, t)}) =
1 2
PX ({0}) = P(X −1 ({0})) = P({ω ∈ Ω|X(ω) = 0}) = P({(t, t)}) =
Daniel Rittler (Universitity of Heidelberg)
1 4
Winter term 2011/12
13 / 87
1 Probability framework for statistical inference 1.3 Random variables
1.3.2 Cumulative density function
Definition Consider the random variable X : Ω → R with probability distribution PX . The function FX : R → [0, 1] with FX (x) := PX ((−∞, x]) = P(X ≤ x) is called cumulative density function (CDF) of X. Properties of the CDF: 1 FX is nondecreasing in x. 2
FX is right-continuous, that means x→x lim FX (x) = FX (x0 ). 0 x>x0
3
lim FX (x) = 0,
x→−∞ 4
lim FX (x) = 1.
x→∞
PX ((a, b]) = FX (b) − FX (a).
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
14 / 87
1 Probability framework for statistical inference 1.3 Random variables
1.3.3 Discrete and continuous random variables A random variable X is discrete if FX (x) is a step function. The range of X consists of a countable set of real numbers x1 , x2 , .... The probability function takes the form P(X = xi ) = πi , i = 1, 2, ... P where 0 ≤ πi ≤ 1 and i πi = 1 so that X FX (xi ) = P(X ≤ xi ) = P(X = xt ). xt ≤xi
A random variable X is continuous if FX (x) is continuous in x. Probabilities are represented by the the probability density function (PDF) d fX (x) = FX (x) dx so that Z x FX (x) = P(X ≤ x) = fX (u)du −∞ R A function fX (x) is a PDF if and only if fX (x) ≥ 0 for all x ∈ R and R fX (x)dx = 1. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
15 / 87
1 Probability framework for statistical inference 1.3 Random variables
Problem: Consider the function fX (x) =
(
e−x 0
if x ≥ 0 else
i) Check, whether fX is a probability density function. ii) Derive the cummulative density function of X. iii) Show fX and FX graphically. iv) Determine P(X > 0, 5) and P(0, 5 < X ≤ 1). v) Interprete the results of iv)graphically.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
16 / 87
1 Probability framework for statistical inference 1.3 Random variables
1.3.4 Moments of random variables
Definition For any real function g, we define the expectation E[g(X)] as follows. If X is discrete X E[g(X)] = g(xi )P(X = xi ), i
and if X is continuous
E[g(X)] =
Z
∞
g(x)fX (x)dx.
−∞
R∞ Note that −∞ g(x)fX (x)dx is well defined and finite if R∞ −∞ |g(x)|fX (x)dx < ∞. For g(x) = x, we have the ordinary definition of the expectation. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
17 / 87
1 Probability framework for statistical inference 1.3 Random variables
Definition For m > 0, we define the m′ th moment of X as E(X m ) and the m′ th central moment of X as E[(X − E(X))m ]. Special moments Mean µX = E(X) Variance σX2 = E[(X − µX )2 ] p σX = σX2 is called the standard deviation of X X Z = X−µ σX is called the standard score of X. Problem: Show that E(Z) = 0 and Var(Z) = 1. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
18 / 87
1 Probability framework for statistical inference 1.3 Random variables
Problem: Consider the continuous random variable X and the funcion g(X) = a + bX with a, b ∈ R. Derive E(g(X)) and Var(g(X)).
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
19 / 87
1 Probability framework for statistical inference 1.3 Random variables
1.3.5 The binomial distribution A random experiment is called a Bernoulli experiment if we are only inetrested in whether an event A occurs or not. Hence, the random variable Xi is given by ( 1 if A occurs Xi = 0 if A occurs
Definition A random variable Xi with codomain {0,1} and parameter p ∈ (0, 1) is Bernoulli distributed (Xi ∼ Be(p)) if P(X = 1) = p. Consider n independent P Be(p)-distributed random variables Xi , ..., Xn . The random variable X = i Xi with codomain {0, 1, ..., n} and parameters n ∈ N and p ∈ (0, 1) is binomially distributed (X ∼ B(n, p)) if n k P(X = k) = p (1 − p)n−k . k Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
20 / 87
1 Probability framework for statistical inference 1.3 Random variables
1.3.6 The normal distribution
Definition A continuous random variable with density 1 (x − µX )2 fX (x) = √ exp − 2σX2 2πσX with µX , x ∈ R and σX > 0 is called univariate normal. The mean and the variance of the normal distribution are µX and σX2 . It is conventional to write X ∼ N(µX , σX2 ). X The random variable Z := X−µ σX ∼ N(0, 1) is called standard normally distributed.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
21 / 87
1 Probability framework for statistical inference 1.4 Joint distributions
1.4.1 Joint CDF
Definition A pair of random variables (X, Y) is a function from the sample space Ω into R2 . The joint CDF of (X, Y) is given by F(X,Y) (x, y) = P(X ≤ x, Y ≤ y). Properties of the joint CDF: F(X,Y) (x, y) is nondecreasing in x and y. F(X,Y) (x, y) is right-continiuous in x und y. (i) (ii)
lim F(X,Y) (x, y) = 0
x→−∞
lim F(X,Y) (x, y) = 0
y→−∞
lim F(X,Y) (x, y) = 1.
x,y→∞
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
22 / 87
1 Probability framework for statistical inference 1.4 Joint distributions
1.4.2 Joint PDF
Definition The joint PDF f(X,Y) (x, y) of (X, Y) is given by f(X,Y) (x, y) =
∂2 F(X,Y) (x, y). ∂x∂y
if (X, Y) is continuous, and by f(X,Y) (x, y) = P(X = xi ∩ Y = yj ) if (X, Y) is discrete. Properties of f(X,Y) (x, y) for continuous random variables: f(X,Y) (x, y) ≥ 0 for all x and y. R∞ R∞ f (x, y)dxdy = 1 −∞ −∞ (X,Y) Rx Ry F(X,Y) (x, y) = −∞ −∞ f(X,Y) (u, v)dudv Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
23 / 87
1 Probability framework for statistical inference 1.4 Joint distributions
1.4.3 Marginal distributions
Definition Consider the bivariate continuous random variable (X, Y). The marginal distribution of X is given by Z x Z ∞ FX (x) = P(X ≤ x) = lim F(X,Y) (x, y) = f(X,Y) (u, y)dydu y→∞
−∞
−∞
Definition Consider the bivariate continuous random variable (X, Y). The marginal density of X is given by Z ∞ d fX (x) = FX (x) = f(X,Y) (x, y)dy dx −∞ Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
24 / 87
1 Probability framework for statistical inference 1.4 Joint distributions
1.4.4 Statistical independence
Definition The random variables X and Y with the joint CDF F(X,Y) (x, y) are statistically independent if F(X,Y) (x, y) = FX (x) · FY (y)
for all x, y ∈ R.
Interpretation: The statistical independence of X and Y is equivalent to the statistical independence of the events AX = {X ≤ x} and BY = {Y ≤ y} for all x, y ∈ R. Equivalently, X and Y are statistically independent if f(X,Y) (x, y) = fX (x) · fY (y) for all x, y ∈ R.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
25 / 87
1 Probability framework for statistical inference 1.4 Joint distributions
1.4.5 Expectation and Covariance
Definition For any real-valued function g(x, y) P P i Rj g(xi , yj )P(X = xi , Y = yj ), R∞ E[g(X, Y)] = ∞ g(x, y)f(X,Y) (x, y)dxdy, −∞ −∞
if X, Y are discrete if X, Y are continuous
Definition The covariance between X and Y is Cov(X, Y) = σX,Y = E[(X − µX )(Y − µY )]. The correlation between X and Y is Corr(X, Y) = ρXY = Daniel Rittler (Universitity of Heidelberg)
σXY . σX σY Winter term 2011/12
26 / 87
1 Probability framework for statistical inference 1.4 Joint distributions
Properties of Covariance and Correlation Corr(X, Y) ∈ [−1, 1] is a measure of linear dependence, free of units of measurement; Cov(a + bX, c + dY) = bdCov(X, Y); Cov(X, Y) = E[XY] − E[X]E[Y]; X, Y are statistically independent ⇒ Cov(X, Y) = Corr(X, Y) = 0; Cov(X, Y), Corr(X, Y) 6= 0 ⇒ X, Y are statistically dependent;
Cov(X, Y) = Corr(X, Y) = 0 6⇒ X, Y are statistically independent.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
27 / 87
1 Probability framework for statistical inference 1.4 Random variables
Problem: Consider the bivariate random variable (X, Y) with density function ( e−x−y if x ≥ 0, y ≥ 0 f(X,Y) (x, y) = 0 else i) Determine the marginal density functions fX (x) and fY (y). ii) Check, whether X and Y are statistically independent. iii) Determine Cov(X, Y) without computation.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
28 / 87
1 Probability framework for statistical inference 1.4 Joint distributions
Expectation and variance of X + Y and X − Y Consider the random variables X and Y. The expectation of the expression X + Y is given by E(X + Y) = E(X) + E(Y) The variance of X + Y is given by Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y) and the variance of X − Y is given by Var(X − Y) = Var(X) + Var(Y) − 2Cov(X, Y) An implication is that if X and Y are independent, then Var(X + Y) = Var(X) + Var(Y) since Cov(X, Y) = 0. Problem: Proof that Var(X + Y) = Var(X) + Var(Y) + 2Cov(X, Y). Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
29 / 87
1 Probability framework for statistical inference 1.4 Joint distributions
Expectation and variance of specifically distributed random variables Consider the random variables X ∼ B(n, p) and Y ∼ B(m, p) with X, Y indepedent. Then X + Y ∼ B(n + m, p) with E(X + Y) = (n + m)p and Var(X + Y) = (n + m)p(1 − p).
Consider the random variables X ∼ N(µX , σX2 ) and Y ∼ N(µY , σY2 ) with X, Y indepedent. Then X + Y ∼ N(µX + µY , σX2 + σY2 ) with E(X + Y) = µX + µY and Var(X + Y) = σX2 + σY2 .
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
30 / 87
1 Probability framework for statistical inference 1.5 Conditional Distributions
Definition Consider the random variables X and Y. The conditional density of Y given X = x is fX,Y (x, y) , for fX (x) > 0. fY|X (y|X = x) = fX (x) The conditional mean or conditional expectation of Y given X = x is given by Z ∞ m(x) = E(Y|X = x) = y · fY|X (y|X = x)dy −∞
The conditional variance of Y given X = x is given by σ 2 (x) = Var(Y|X = x) = E[(Y − m(x))2 |X = x)]. Evaluated at X = x, the conditional mean m(x) and conditional variance σ 2 (x) are realized values of the random variables m(X) = E(Y|X) and σ 2 (X) = Var(Y|X). Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
31 / 87
1 Probability framework for statistical inference 1.5 Conditional Distributions
Laws of iterated expectations Simple law of iterated expectations E(E(Y|X)) = E(Y) Extended law of iterated expectations E(E(Y|X, Z)|X) = E(Y|X) Conditioning theorem E(E(g(X)Y|X)) = g(X)E(Y|X) Problem: Proof the simple law of iterated expectations for continuous random variables X and Y. .
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
32 / 87
2 Fundamentals of asymptotic theory
2 Fundamentals of asymptotic theory 2.1 2.2 2.3 2.4
Convergence of random variables Laws of large numbers Central limit theorem Asymptotic transformations
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
33 / 87
2 Fundamentals of asymptotic theory 2.1 Convergence of random variables
Inequalities: Jensen inequality: If g is a convex function then g(E(X)) ≤ E(g(X)). Tschebyscheff inequality: P(|x − E(X)| ≥ ε) ≤
Daniel Rittler (Universitity of Heidelberg)
Var(X) . ε2
Winter term 2011/12
34 / 87
2 Fundamentals of asymptotic theory 2.1 Convergence of random variables
2.1.1 Convergence in distribution
Definition A sequence {Xn } of i.i.d. random variables is said to converge in distribution to a random variable X if lim FXn (x) = FX (x), n→∞
for every number x ∈ R at which F is continuous. Convergence in distribution d is denoted as Xn − → X. Properties: Since FX (a) = P(X ≤ a), the convergence in distribution means that the probability for Xn to be in a given range is approximately equal to the probability that the value of X is in that range, provided n is sufficiently large.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
35 / 87
2 Fundamentals of asymptotic theory 2.1 Convergence of random variables
2.1.2 Convergence in probability
Definition A sequence {Xn } of i.i.d. random variables converges in probability towards X if for all ε > 0 lim P |Xn − X| ≥ ε = 0. n→∞
p
Convergence in probability is denoted as Xn − → X. Properties:
Let Pn be the probability that Xn is outside the ball of radius δ centered at X. Then for Xn to converge in probability to X there should exist a number Nδ such that for all n ≥ Nδ the probability Pn is less than ε. Convergence in probability implies convergence in distribution.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
36 / 87
2 Fundamentals of asymptotic theory 2.1 Convergence of random variables
2.1.3 Convergence in square mean
Definition A sequence {Xn } of i.i.d. random variables converges in square mean towards X, if E|Xn |2 < ∞ for all n, and lim E (Xn − X)2 = 0 n→∞
where E denotes the expected value. Convergence in square mean is L2
denoted as Xn −→ X. Properties: Convergence in square mean tells us that the expectation of the square of the difference between Xn and X converges to zero. Convergence in square mean implies convergence in probability, and hence implies convergence in distribution. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
37 / 87
2 Fundamentals of asymptotic theory 2.2 Laws of large numbers
2.2 Law of large numbers
Definition The weak law of large numbers states that the sample average X n of a sequence i.i.d. random variables converges in probability towards the expected value µX p Xn − → µX when n → ∞. Properties: The weak law states that for a specified large n, the average X n is likely to be near µX . Thus, it leaves open the possibility that |X n − µX | > ε happens an infinite number of times. Problem: Proof the weak law of large numbers.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
38 / 87
2 Fundamentals of asymptotic theory 2.3 Central limit theorem
2.3 Central limit theorem
Definition Let {Xn } be a sequence of i.i.d. random variables with mean µX and variance σX2 . The central limit theorem states that the standardised sample average √ x will converge in distribution to the standard normal distribution as n Zn = Xσxn/−µ n approaches infinity, that means d
Zn − → N(0, 1). Pn Problem: Show that the binomially distributed random variable Xn = i=1 Xi , with Xi ∼ Be(p), converges in distribution towards a normally distributed random variable.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
39 / 87
2 Fundamentals of asymptotic theory 2.4 Asymptotic transformations
2.4.1 Continuous mapping theorem Let {Xn } be a sequence of random variables, X a random variable, and g(Xn ) a real valued function. The continuous mapping theorem states that d
i) Xn − → X
⇒
p
ii) Xn − → X
⇒
d
g(Xn ) − → g(X); p
g(Xn ) − → g(X);
2.4.2 Slutzky’s theorem Let {Xn } and {Yn } be sequences of random variables. If {Xn } converges in distribution to the random variable X, and {Yn } converges in probability to the constant c, then Slutzky’s theorem states that d
i) Xn + Yn − → X + c; d
ii) Yn Xn − → cX. Problem: Consider the i.i.d. random variables X1 , ..., PXnn with mean µ and variance σ 2 . Show that the sample variance Sn2 = n1 i=1 (Xi − X)2 converges in probability to σ 2 . Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
40 / 87
3 Point estimation
3 Point estimation 3.1 Fundamentals 3.2 Properties of an estimator
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
41 / 87
3 Point estimation 3.1 Fundamentals
3.1 Fundamentals Starting point: Independently and identically distributed random variables X1 , . . . , Xn ; Sample with sample size n: realized values x1 , . . . , xn with
Notes:
P sample mean x = 1n ni=1 xi ; P 2 sample variance SX = 1n ni=1 (xi − x)2 .
X1 , . . . , Xn are randomly drawn such that sample mean and sample variance are random as well. Population mean E(X) and variance Var(X) are unknown. Sample mean x and variance SX2 are used to estimate population mean E(X) and variance Var(X).
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
42 / 87
3 Point estimation 3.1 Fundamentals
Estimator and estimate
Definition Let X1 , . . . Xn be a sequence of i.i.d. random variables. ˆ 1 , ..., Xn ) is called estimator for ϑ ∈ Θ ⊆ R. A function ϑ(X ˆ 1 , ..., xn ) of an estimator ϑ(X ˆ 1 , ..., Xn ) is called the The realized value ϑ(x estimate for ϑ based on the sample x1 , ..., xn . Properties: ϑˆ is a random variable. Θ is called the parameter space. ˆ and variance Varϑ (ϑ) ˆ of ϑˆ depend on the distribution of ϑ. Mean Eϑ (ϑ)
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
43 / 87
3 Point estimation 3.2 Properties of an estimator
3.2 Properties of an estimator
Definition An estimator ϑˆ for an unknown parameter ϑ is unbiased if for all ϑ ∈ Θ
ˆ = ϑ; E(ϑ)
asymptotically unbiased if lim E(ϑˆn ) = ϑ;
n→∞
consisent if ϑˆn converges in probability to ϑ, that is, for all ε we have n o lim P ϑˆn − ϑ < ε = 1; n→∞
efficient if it is uniased and has the smallest variance of all unbiased estimators. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
44 / 87
3 Point estimation 3.2 Properties of an estimator
Problems: Show that an estimator ϑˆn with lim E(ϑˆn ) = ϑ
n→∞
and
lim Var(ϑˆn ) = 0
n→∞
is consistent for ϑ. Consider the following three estimators for E(X): a) µ ˆa = b) µ ˆb = c) µ ˆc =
Pn i=1 Xi P n 1 i=1 Xi n−2 P 3 1 i=1 Xi 3 1 n
Evaluate the properties of the estimators. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
45 / 87
3 Point estimation 3.2 Unbiased estimators
Unbiased estimators Consider the sequences X1 , ..., Xn and Y1 , ..., Yn of i.i.d. random variables with E(X) = µX , Var(X) = σX2 and E(Y) = µY , Var(Y) = σY2 . Unbiased estimators are P X = 1n ni=1 Xi for the population mean E(X); ′ 1 Pn 2 1 SX2 = n−1 i=1 (Xi − X) for the population variance Var(X); P n 1 SXY = n−1 i=1 (Xi − X)(Yi − Y) for the covariance between X and Y.
1 Note
that SX2 =
1 n
Pn
i=1 (Xi
Daniel Rittler (Universitity of Heidelberg)
− X)2 is not unbiased but asymptotically unbiased. Winter term 2011/12
46 / 87
4 Confidence intervals
4 Confidence intervals 4.1 Fundamentals 4.2 Confidence intervals for the population mean
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
47 / 87
4 Confidence intervals 4.1 Fundamentals
4.1 Fundamentals Starting point: ˆ 1 , ..., Xn ); So far: Point estimators ϑ(X ˆ 1 , ..., Xn ) = ϑ) = 0. For continuous random variables P(ϑ(X h i Now: Derivation of intervals ϑˆl , ϑˆu , such that h i P ϑ ∈ ϑˆl , ϑˆu = 1 − α.
h i The interval ϑˆl , ϑˆu is called confidence interval, and covers the true parameter ϑ with probability (1 − α) · 100.
Since we are interested in confidence intervals for the population mean, we do have to know the sample distributions of the estimator.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
48 / 87
4 Confidence intervals 4.1 Fundamentals
Sample Distributions
Definition i.i.d.
2 Consider Pnthe sequence X1 , ..., Xn of random variables with Xi ∼ N(µ, σ ). X n = 1n i=1 Xi is a point estimator for µ with 2
X n ∼ N(µ, σn ) and
Z=
X n −µ √ σ/ n
∼ N(0, 1).
Z is called the Gauss-statistic.
Definition i.i.d.
Consider the sequence X1 , ..., Xn of random variables with Xi ∼ N(0, 1). Then Y=
n X i=1
Xi2 ∼ χ2 (n).
We say that Y is χ2 -distributed with n degrees of freedom. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
49 / 87
4 Confidence intervals 4.1 Fundamentals
Definition Consider the random variables X ∼ N(0, 1) and Y ∼ χ2 (n); X, Y are independent. Then the distribution of the random variable T with X T= q
Y n
is called the t-distribution with n degrees of freedom.
Definition i.i.d.
Consider the sequence X1 , ..., Xn of random variables with Xi ∼ N(µ, σ 2 ). Then Xn − µ Xn − µ t= √ = ′ √ ∼ t(n − 1). S/ n S/ n − 1
The statistic is called t-statistic. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
50 / 87
4 Confidence intervals 4.1 Fundamentals
Definition Consider the random variables X ∼ χ2 (m) and Y ∼ χ2 (n); X, Y are independent. Then the distribution of the random variable F with F=
X/m Y/n
is called the F-distribution with m and n degrees of freedom.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
51 / 87
4 Confidence intervals 4.2 Confidence intervals for the population mean
4.2 Confidence intervals for the population mean Problem: i.i.d. Consider the sequence X1 , ..., Xn of random variables with Xi ∼ N(µ, σ 2 ). Derive a central (1 − α)-confidence interval for µ for σ 2 known; for σ 2 unknown.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
52 / 87
4 Confidence intervals 4.2 Confidence intervals for the population mean
Problem: Consider the analysis of the gas consumption X (in litre per 100 km) of a new car. Assume that X is normally distributed with E(X) = µ and Var(X) = σ 2 . Based on n = 16 test run, the following values are observed: 3.3; 4.1; 3.5; 4.0; 4.0; 3.6; 2.9; 3.1; 3.8; 4.1; 3.7; 4.2; 3.9; 3.5; 3.6; 3.0.
Compute a central confidence intervall (α = 0.05) for µ. Assume that σ 2 = 0.15 is known. Compute a central confidence intervall (α = 0.05) for µ. Assume that σ 2 is unknown.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
53 / 87
4 Confidence intervals 4.2 Confidence intervals for the population mean
Comments on the t-distribution: i.i.d.
If Xi ∼ N(µ, σ 2 ), then the t-distribution is the finite sample distribution of the t-statistic. Construction of exact confidence intervals for each sample size is possible. For n → ∞, the difference between the t-distribution and the N(0, 1) quantiles are negligible. The t-distribution is only relevant when the sample size is small. (However, for the t-distribution to be correct, you must be sure that the population distribution of Xi is normal.)
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
54 / 87
5 Statistical hypothesis testing
5 Statistical hypothesis testing 5.1 Fundamentals 5.2 Tests for the population mean
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
55 / 87
5 Statistical hypothesis testing 5.1 Fundamentals
5.1 Fundamentals Starting point: Independently and identically distributed random variables X1 , . . . , Xn ; Sample with sample size n and realized values x1 , . . . , xn ; Approach: Make a decision, based on the sample X1 , ..., Xn , whether a hypothesis concerning the parameter ϑ has to be rejected or cannot be rejected.
Definition Let ϑ be the unknown parameter, Θ the parameter space, and Θ0 , Θ1 a partition of Θ. The test problem can be written as H0 : ϑ ∈ Θ 0
against
H1 : ϑ ∈ Θ 1 ,
where H0 is the null hypothesis and H1 is the alternative hypothesis. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
56 / 87
5 Statistical hypothesis testing 5.1 Fundamentals
Central idea of hypothesis testing: Starting point is the unbiased estimator ϑˆ = ϑ(X1 , ..., Xn ) with known distribution. Based on the conditional distribution of ϑˆ (conditional on H0 ), we derive a decision rule which evaluates, whether the realized sample x1 , ..., xn is compatible with H0 , or not. Under validity of H0 , we determine the region of acceptance (A) and the region of rejection (R) for the conditional distribution of ϑˆ such that P(ϑˆ ∈ region of rejection |H0 is true) = α. α is called level of significance. Based on ϑˆ we derive the test statistic V = V(X1 , ..., Xn |H0 ) which is compared with the region of rejection. For v(x1 , ..., xn ) ∈ R, we reject H0 ; for v(x1 , ..., xn ) ∈ A, we do not reject H0 . Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
57 / 87
5 Statistical hypothesis testing 5.1 Fundamentals
Construction of a test: 1 2 3 4 5
Formulate the hypotheses H0 and H1 ; Determine the level of significance α; Choose the test statistic V = V(X1 , ..., Xn ) with known distribution fV (v|H0 ); Determine the region of rejection R with P(V ∈ R|H0 ) = α; Decision rule: Reject H0 ⇔ v(x1 , ..., xn ) ∈ R.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
58 / 87
5 Statistical hypothesis testing 5.1 Fundamentals
Type I and type II errors: Type I error: rejection of H0 even though H0 is true; Type II error: no rejection of H0 even though H0 is false; Probability of a type I error: α = P(V(X1 , ..., Xn ) ∈ R|H0 is true); Probability of a type II error: β = P(V(X1 , ..., Xn ) ∈ A|H0 is false). p-value The p-value is the probability of drawing a statistic that is at least as adverse to the null hypothesis as the value actually computed with the data of the sample, assuming that the null hypothesis is true.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
59 / 87
5 Statistical hypothesis testing 5.2 Tests for the population mean
5.2 Tests for the population mean Let X1 , ..., Xn be i.i.d. N(µ, σ 2 ). Case 1: Test for unknown µ; σ 2 known. Two-sided test 1 H0 : µ = µ0 ; against µ 6= µ0 ; 2 3
e.g. α = 5%; If H0 is true, then
V= 4
X − µ0 √ n ∼ N(0, 1); σ
Region of rejection: Reject H0 if |X − µ0 | is large α = P(|V| > z1− α2 );
5
Reject H0 ⇔ |v(x1 , ..., xn )| > z1− α2 .
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
60 / 87
5 Statistical hypothesis testing 5.2 Tests for the population mean
Case 2: Test for unknown µ; σ 2 unknown. Two-sided test 1 H0 : µ = µ0 ; against µ 6= µ0 ; 2
e.g. α = 5%;
3
If H0 is true, then V=
4
X − µ0 √ n ∼ t(n − 1); S′
Region of rejection: Reject H0 if |X − µ0 | is large α = P(|V| > t1− α2 (n − 1));
5
Reject H0 ⇔ |v(x1 , ..., xn )| > t1− α2 (n − 1).
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
61 / 87
5 Statistical hypothesis testing
Problem: A producer of ball bearings knows because of long lasting experience that the size X of the produced balls is normally distributed. The producer guarantees a size of µ0 = 2 cm and a standard deviation of σ0 = 0.16 cm. A random sample of n = 25 balls has yielded x = 1.89 cm and S = 0.2 cm. Check whether the claim of the producer concerning the size of the balls is statistically firm at the five percent level. Assume that the porducer’s information on σ is correct Assume that σ is not known.
Compute the p-values of both test statistics.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
62 / 87
5 Statistical hypothesis testing 5.2 Tests for the population mean
Difference-in-mean test Let X1 , ..., Xn be i.i.d. N(µX , σX2 ) and Y1 , ..., Ym be i.i.d. N(µY , σY2 ), X, Y independent, with σ2 σ2 X − Y ∼ N(µX − µY , X + Y ). n m Case 1: σX2 , σY2 known. 1 2 3
4
H0 : µX = µY ; against e.g. α = 5%; If H0 is true, then
Region of rejection:
µX 6= µY ; X−Y V= q 2 ∼ N(0, 1); σY2 σX + n m α = P(|V| > z1− α2 );
5
Reject H0 ⇔ |v(x1 , ..., xn , y1 , ..., ym )| > z1− α2 .
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
63 / 87
5 Statistical hypothesis testing 5.2 Tests for the population mean
Case 2: σX2 , σY2 unknown. µX 6= µY ;
1
H0 : µX = µY ;
2
e.g. α = 5%;
3
If H0 is true and n, m ≥ 40, then due to the central limit theorem
against
V= q 4
X−Y ′ SX2
n
+
approx. ′ SY2
∼ N(0, 1);
m
Region of rejection: α = P(|V| > z1− α2 );
5
Reject H0 ⇔ |v(x1 , ..., xn , y1 , ..., ym )| > z1− α2 .
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
64 / 87
5 Statistical hypothesis testing 5.2 Tests for the population mean
Problem: Consider the California test score data set of Stock and Watson (2007). Based on data of 420 districts, we analyse whether the class size has a significant effect on the test score of a student. We have the variables district average of the test score: TS student teacher ratio: STR Compare districts with "small" (STR < 20) and "large" (STR ≥ 20) class sizes: Class size i small large
TSi 657.4 650.0
STSi 19.4 17.9
ni 238 182
Check whether there is a significant difference between TSsmall and TSlarge for α = 0.05. Compute the p-value. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
65 / 87
6 Fundamentals of matrix algebra
6 Fundamentals of matrix algebra 6.1 Basic principles 6.2 Multivariate statistics
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
66 / 87
6 Fundamentals of matrix algebra 6.1 Basic principles
Basic principles A matrix A is a m × n rectangular array of numbers, written as a1,1 a1,2 · · · a1,n a2,1 a2,2 · · · a2,n A= . .. .. = (ai,j )i=1,...,m,j=1,...,n. .. .. . . . am,1 am,2 · · · am,n
The transpose of a matrix, denoted as A′ is obtained by flipping the matrix on its diagonal. Thus A = (aj,i )i=1,...,m,j=1,...,n.
Example: 1 2 3 A= 0 −6 7 Daniel Rittler (Universitity of Heidelberg)
1 A′ = 2 3
0 −6 7 Winter term 2011/12
67 / 87
6 Fundamentals of matrix algebra 6.1 Basic principles
Special matrices A matrix A is square if m = n; symmetric if A = A′ which requires ai,j = aj,i . diagonal if the off-diagonal elements are all zero, so that ai,j = 0 if i 6= j.
is upper (lower) diagonal if all elements below (above) the diagonal equal zero. An important diagonal matrix is the identity matrix, which has ones on the diagonal. The k × k identity matrix is denoted as 1 0 ··· 0 0 1 · · · 0 Ik = . . . .. . . . . . . . . 0 0 ··· 1 Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
68 / 87
6 Fundamentals of matrix algebra 6.1 Basic principles
Basic operations Matrix addition A = (ai,i )i=1,...,m,j=1,...,n, B = (bi,j )i=1,...,m,j=1,...,n. C = A + B = (ci,j )i=1,...,m,j=1,...,n = (ai,j + bi,j )i=1,...,m,j=1,...,n. Example: 1 3 1 0
1 0 0 + 0 7 5
5 1+0 = 0 1+7
3+0 1+5 0+5 0+0
=
1 3 8 5
6 0
Skalar multiplication A = (ai,i )i=1,...,m,j=1,...,n, λ ∈ R. λ · A = (λ · ai,j )i=1,...,m,j=1,...,n. Example:
Daniel Rittler (Universitity of Heidelberg)
1 4· 0
2 3 −6 7
=
4 8 12 0 −24 28 Winter term 2011/12
69 / 87
6 Fundamentals of matrix algebra 6.1 Basic principles
Matrix multiplication A = (ai,i )i=1,...,l,j=1,...,m, B = (bi,j )i=1,...,m,j=1,...,n. m X C = A · B = (ci,j )i=1,...,l, j=1,...,n ci,j = ai,k · bk,j k=1
Example:
1 0 −1 3
with
3 2 × 2 1 1
1 5 1 1 = . 4 2 0
5 = 1 · 3 + 0 · 2 + 2 · 1; 1 = 1 · 1 + 0 · 1 + 2 · 0;
4 = −1 · 3 + 3 · 2 + 1 · 1; 5 = −1 · 1 + 3 · 1 + 1 · 0. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
70 / 87
6 Fundamentals of matrix algebra 6.1 Basic principles
Properties Matrix addition i) A + B = B + A ii) (A + B) + C = A + (B + C)
Matrix multiplication i) (A · B) · C = A · (B · C) ii) A · B 6= B · A Example: 1 2 0 · 3 4 0
1 0 = 0 0
1 , 3
0 0
1 1 · 3 0
2 4
=
3 0
4 . 0
iii) A · (B + C) = A · B + A · C (B + C) · A = B · A + C · A iv) Multiplication with identity matrix for any m × n matrix M: M · In = Im · M = M. v) Idempotence: A · A = A. vi) Transponse of a product: (A · B · C)′ = C′ · B′ · A′ Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
71 / 87
6 Fundamentals of matrix algebra 6.1 Basic principles
Quadratic form Consider a symmetric matrix A ∈ Rn×n and a vector x ∈ Rn×1 . The expression x′ Ax =
n X n X
xi ai,j xj
i=1 j=1
is called quadratic form. Implications: A is positive definite if x′ Ax > 0 for all x 6= 0. A is negative definite if x′ Ax < 0 for all x 6= 0. Problem: Show that the matrix 2 −1 0 A = −1 2 −1 0 −1 2 is positive definite.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
72 / 87
6 Fundamentals of matrix algebra 6.1 Basic principles
Rank and inverse of a matrix The rank of the m × n matrix (n ≤ m)
A = (a1 , ..., an )
is the number of linearly independent columns aj and is written as rank(A). A has full rank if rank(A) = n. Properties: A square k × k matrix A is said to be nonsingular if it is has full rank, e.g. rank(A) = k. This means that there is no k × 1 vector c 6= 0 such that Ac = 0. If a square k × k matrix A is nonsingular then there exists a unique matrix k × k matrix A−1 called the inverse of A which satisfies AA−1 = A−1 A = Ik .
If A is positive or negative definite, then A is nonsingular. Problem: Compute the inverse of the matrix 8 2 A= . 2 1 Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
73 / 87
6 Fundamentals of matrix algebra 6.1 Basic principles
Trace of a matrix The trace of an k × k square matrix A is defined to be the sum of the elements on the main diagonal, i.e., k X tr(A) = ai,i i=1
Properties for square matrices A and B and real λ are : tr(λA) = λtr(A); tr(A′ ) = tr(A); tr(A + B) = tr(A) + tr(B);
tr(Ik ) = k; If A is an m × n matrix and B is an n × m matrix, then tr(AB) = tr(BA). Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
74 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
6.2 Multivariate statistics Mean vector
Definition Consider a n-dimensional random vector x′ = (x1 , ..., xn ). The vector µ′ = (µ1 , ..., µn ) with Z µi = E(xi ) =
xi fxi (xi )dxi
R
is called the mean vector of x.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
75 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Covariance matrix
Definition Consider a n-dimensional random vector x′ = (x1 , ..., xn ). The covariance matrix is given by Σ = cov(x) = E((x − µ)(x − µ)′ ) E[(x1 − µ1 )2 ] E[(x1 − µ1 )(x2 − µ2 )] E[(x2 − µ2 )(x1 − µ1 )] E[(x2 − µ2 )2 ] = .. .. . . E[(xn − µn )(x1 − µ1 )] E[(xn − µn )(x2 − µ2 )]
··· ··· .. . ···
E[(x1 − µ1 )(xn − µn )] E[(x2 − µ2 )(xn − µn )] . .. . E[(xn − µn )2 ]
Properties of Σ Σ = E(xx′ ) − µµ′ ; The diagonal entries are the variances E((xi − µi )2 ) =: σi2 , i = 1, ..., n; The off-diagonal entries are the covariances E((xi − µi )(xj − µj )) =: σij , i, j = 1, ..., n; i 6= j. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
76 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Covariance matrix of two random vectors X and Y
Definition Consider a n-dimensional random vector x′ = (x1 , ..., xn ) with E(x) = µ and a m-dimensional random vector y′ = (y1 , ..., yn ) with E(y) = ν. The covariance between x and y is defined as Σxy = cov(x, y) = E((x − µ)(y − ν)′ ) E[(x1 − µ1 )(y1 − ν1 )] · · · E[(x1 − µ1 )(ym − νm )] .. .. .. = . . . . E[(xn − µn )(y1 − ν1 )] · · · E[(xn − µn )(ym − νm )]
Properties of Σ cov(x, y) contains the covariances between components of x and y; cov(x) contains the covariances between components of x; Defining the vector z′ := (x′ y′ ) yields all covariances, that is cov(z) = Daniel Rittler (Universitity of Heidelberg)
cov(x) cov(y, x)
cov(x, y) cov(y)
=
Σx Σyx
Σxy Σy
Winter term 2011/12
77 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Mean vector and covariance matrix of a linear combination Consider a n-dimensional random vector x′ = (x1 , ..., xn ) with mean vector µ, covariance matrix Σ, and a vector of constants a′ = (a1 , ..., an ). z = a′ x is a random scalar with z = a′ x = a1 x1 + a2 x2 + ... + an xn . Hence the mean of z = a′ x is given by E(z) = a1 E(x1 ) + a2 E(x2 ) + ...an E(xn ) = a′ E(x) = a′ µ. The variance is Var(z) = E((z − E(z))2 ) = E((a′ x − a′ µ)2 ) = E((a′ (x − µ))2 ). Since a′ (x − µ) is a scalar it is identical with (x − µ)′ a. Hence, Var(z) = E(a′ (x − µ)(x − µ)′ a) = a′ E((x − µ)(x − µ)′ )a = a′ Σa. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
78 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Example: Consider the 2-dimensional random vector x′ = (x1 , x2 ) with mean vector µ, covariance matrix Σ, and a vector of constants a′ = (a1 , a2 ). Mean and variance of the linear combination are given by µz = a1 µ1 + a2 µ2 and Var(z) = a′ Σa
σ12 = (a1 a2 ) σ21
σ12 σ22
a1 a2
= a21 σ12 + a22 σ22 + 2a1 a2 σ12 .
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
79 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Mean vector and covariance matrix of a linear combination Consider a n-dimensional random vector x′ = (x1 , ..., xn ) with mean vector µ, covariance matrix Σ, and a n × p-matrix of constants A. z = A′ x is a random vector with
z = A′ x. The mean vector and the covariance matrix are given by E(A′ x) = A′ µ, Var(A′ x) = A′ ΣA. Problem: Name the dimensions of A′ µ and A′ ΣA.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
80 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Correlation matrix
Definition Consider a n-dimensional random vector x′ = (x1 , ..., xn ) with mean vector µ and covariance matrix Σ. The correlation matrix is given by
1 ρ21 ρ= . .. ρn1
ρ12 1 .. . ρn2
··· ··· .. . ···
ρ1n ρ2n .. . . 1
Properties of ρ σ ρij := σi σij j ∈ [−1, 1], i, j = 1, ..., n.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
81 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Relationship between Σ and ρ Consider a n-dimensional random vector x′ = (x1 , ..., xn ) with mean vector µ, covariance matrix Σ and correlation matrix ρ. We define the matrices σ1 0 · · · 0 σ2 · · · D=. .. . . .. . . 0
0
···
0 0 .. .
and D−1
σn
1/σ1 0 = . .. 0
0 1/σ2 .. .
··· ··· .. .
0 0 .. .
0
···
1/σn
where σi is the standard deviation of random variable xi .
,
The relationship between Σ and ρ is given by Σ = DρD; ρ = D−1 ΣD−1 .
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
82 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Rank of Σ and ρ Var(a′ x) ≥ 0 for all a;
Since Var(a′ x) = a′ Σa, Σ is positive semi-definite; Since D is nonsingular and ρ = D−1 ΣD−1 , ρ is positive semi-definite and rank(ρ) = rank(Σ) ≤ n; For rank(Σ) = n, Σ is positive definite, since a′ Σa > 0 a 6= 0; For rank(Σ) < n, there exists a a 6= 0 such that a′ x is a constant, and hence, a′ Σa = 0. This indicates that Σ is positive semi-definite but not positive definite.
If rank(Σ) < n at least one of the components of x is a linear combination of other components. That means that the information of this variable is redundant since the information of this variable is provided by other variables.
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
83 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Further properties E(x + y) = E(x) + E(y); E(Ax + b) = AE(x) + b; cov(Ax + b) = Acov(x)A′ ; cov(Ax + a, By + b) = Acov(x, y)B′ .
Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
84 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Fundamentals of the multivariate normal distribution
Definition A continuous random vector x′ = (x1 , ..., xn ) with density 1 1 ′ −1 p (x − µ) Σ (x − µ) , f(x1 ,...,xn) (x1 , ..., xn ) = exp − n 2 (2π) 2 det(Σ) where Σ is a positive definite n × n matrix, Σ−1 is the inverse of Σ and det(Σ−1 ) is the determinante of Σ, is called n-variate normal.
Mean vector and covariance matrix of the normal distribution are given by µ and Σ. It is conventional to write y ∼ Nn (µ, Σ); The transformation 1 y = Σ− 2 (x − µ) is called the standardization of x. We write: y ∼ Nn (0, In ). If x ∼ Nn (µ, Σ) and y = Ax + b where A ∈ Rn×m and b ∈ Rm×1 then y ∼ Nm (Aµ + b, AµA′ ). Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
85 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Marginal distributions and conditional density Consider the random vector z′ = (y′ , x′ ) with x′ = (x1 , ..., xn ) and y′ = (y1 , ..., ym ) where y µy Σy Σyx z= ∼ Nm+n (µ, Σ) with µ = , Σ= . x µx Σxy Σx
If Σx and Σy are positive definite then
x ∼ Nn (µx , Σx ), y ∼ Nm (µy , Σy ), that means, the marginal distributions are normal. The conditional distribution of y|x is given by y|x ∼ Nm (µy|x , Σy|x ), where
µy|x = µy + Σyx Σ−1 y (x − µy ) = b0 + Bx; Σy|x = Σy Σyx Σ−1 x Σxy
with Daniel Rittler (Universitity of Heidelberg)
b0 = µy − Bµx and B = Σyx Σ−1 x .
Winter term 2011/12
86 / 87
6 Fundamentals of matrix algebra 6.2 Multivariate statistics
Example Consider the random vector z′ = (y, x) where y µy z= ∼ N2 (µ, Σ) with µ = , x µx
Σ=
Σy Σxy
Σyx . Σx
In this case, we have Σy = σy2
Σyx = σyx
Σxy = σxy
Σx = σx2
Hence, the distribution of y|x is given by y|x ∼ N(µy|x , σy|x ), where
σyx σyx µy|x = µy − 2 µx + 2 x. σ σx | {z x } |{z} =β0
β1
Remember: β0 and β1 are the coefficients of a linear regression of y on x. Daniel Rittler (Universitity of Heidelberg)
Winter term 2011/12
87 / 87