example. Markov model of an SIR epidemic. S. I. R i r. 1 â i. 1 â r. 1.... St. It. Rt... = .... the average value of some function f(x) under a distribution p(x) ...
Pattern Recognition Prof. Christian Bauckhage
outline lecture 05 recap discrete Markov chains basic probability theory summary exercises
recap
basic terms and concepts from linear algebra vector space, inner product space, normed space linear combinations (convex, conic, affine, linear) span, linear independence, bases Lp norms / distances for Rm standard simplex in Rm
discrete Markov chains
stochastic vector
q ∈ Rm is a stochastic vector if q 0 and kqk1 = 1
stochastic vector
q ∈ Rm is a stochastic vector if q 0 and kqk1 = 1 ⇔ q ∈ Rm is stochastic if q ∈ ∆m−1
e3
q e1
e2
stochastic matrix
P ∈ Rm×n is column (row) stochastic if each of its columns (rows) is a stochastic vector P ∈ Rm×m is bi-stochastic if it is column- and row stochastic
stochastic matrix
P ∈ Rm×n is column (row) stochastic if each of its columns (rows) is a stochastic vector P ∈ Rm×m is bi-stochastic if it is column- and row stochastic
alternative terminology column stochastic ⇔ left stochastic row stochastic ⇔ right stochastic bi-stochastic ⇔ doubly stochastic
note
the literature typically considers row stochastic vectors and row stochastic matrices
note
the literature typically considers row stochastic vectors and row stochastic matrices in this course, we will consider column stochastic vectors and column stochastic matrices
note
the literature typically considers row stochastic vectors and row stochastic matrices in this course, we will consider column stochastic vectors and column stochastic matrices conceptually, this is no big deal . . . live with it
note
the literature typically considers row stochastic vectors and row stochastic matrices in this course, we will consider column stochastic vectors and column stochastic matrices conceptually, this is no big deal . . . live with it ⇒ from now on: stochastic matrix ⇔ column stochastic matrix
Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.
Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.
Proof. krk1 =
X i
ri
Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.
Proof. krk1 =
X i
ri =
XX i
j
Pij qj
Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.
Proof. krk1 =
X i
ri =
XX i
j
Pij qj =
X j
qj
X i
Pij =
X j
qj
Lemma If P ∈ Rm×n is a stochastic matrix and q ∈ Rn is a stochastic vector, then r ∈ Rm where r = P q is a stochastic vector.
Proof. krk1 =
X i
ri =
XX i
j
Pij qj =
X j
qj
X i
Pij =
X j
qj = 1
Lemma If P ∈ Rm×k and Q ∈ Rk×n are stochastic matrices, then R ∈ Rm×n where R = PQ is a stochastic matrix.
Lemma If P ∈ Rm×k and Q ∈ Rk×n are stochastic matrices, then R ∈ Rm×n where R = PQ is a stochastic matrix.
Proof. Since Q = q1 q2 . . . qn and R = r1 r2 . . . rn , we note that R = PQ ⇔ ri = Pqi and resort to the previous Lemma.
note
stochastic matrices and vectors play a crucial role in Markov process models
Markov chains
used to model systems that have m possible states and, at any one time, are in one and only one of their m states the set Q = q1 , . . . , qm of states is called the state space state transitions happen according to certain probabilities
Markov chains
used to model systems that have m possible states and, at any one time, are in one and only one of their m states the set Q = q1 , . . . , qm of states is called the state space state transitions happen according to certain probabilities for instance, Markov model of an SIR epidemic 1−i
S
1−r
i
I
1
r
R
types of Markov chains
discrete-time Markov chain ⇔ a stochastic model that has the Markov property p Xt+1 = qit+1 | Xt = qit , . . . , X1 = qi1 = p Xt+1 = qit+1 | Xt = qit
homogenous discrete-time Markov chain ⇔ a discrete-time Markov chain such that p Xt+1 = qi | Xt = qj = p Xt = qi | Xt−1 = qj = p qi | qj = pij
Markov processes
the dynamics of a homogenous DTMC are governed by qt = P qt−1 where P ⇔ transition matrix q ⇔ state vector such that pij ⇔ p i ← j qi ⇔ p i
example Markov model of an SIR epidemic 1−i
S
1−r
i
I
1
r
R
1−i 0 0 St−1 St 1 − r 0 It−1 It = i Rt 0 r 1 Rt−1
example Markovian SIR dynamics
1
1
St It Rt
St It Rt
t
i = 43 , r =
1 2
t
i = 14 , r =
1 2
example Markovian SIR dynamics
1
R
St It Rt
t
1
St It Rt
t
S
I
basic probability theory
probability
degree of belief in the truth of various propositions
probability
degree of belief in the truth of various propositions
examples of propositions A = it will rain this afternoon B = this is a fair coin C = this coin will come up heads twice as likely as tails D = this image shows a face Yi = party i will win the upcoming election
3 requirements for consistent reasoning
1) transitivity 2) closure 3) conditional probability
transitivity
if we believe X more than Y and Y more than Z, then we must believe X more than Z
transitivity
if we believe X more than Y and Y more than Z, then we must believe X more than Z ⇒ implies an ordering
transitivity
if we believe X more than Y and Y more than Z, then we must believe X more than Z ⇒ implies an ordering ⇒ assign real numbers to beliefs
transitivity
if we believe X more than Y and Y more than Z, then we must believe X more than Z ⇒ implies an ordering ⇒ assign real numbers to beliefs ⇔ the larger the value associated with a proposition, the more we believe it
transitivity
if we believe X more than Y and Y more than Z, then we must believe X more than Z ⇒ implies an ordering ⇒ assign real numbers to beliefs ⇔ the larger the value associated with a proposition, the more we believe it 0 = prob false disbelief 1 = prob true certainty
closure
if we specify, how much we believe that X is true, we implicitly specify our disbelief
closure
if we specify, how much we believe that X is true, we implicitly specify our disbelief ⇒ sum rule prob X + prob ¬X = 1
conditional probability
if we first state how much we believe that Y is true, and then state how much we believe that X is true given that Y is true, we implicitly specify, how much we believe that both X and Y are true
conditional probability
if we first state how much we believe that Y is true, and then state how much we believe that X is true given that Y is true, we implicitly specify, how much we believe that both X and Y are true ⇒ product rule prob X, Y = prob X Y prob Y
note
sum and product rule define the algebra of probability more results can be derived therefrom
Bayes’ theorem
prob Y X prob X prob X Y = prob Y
Bayes’ theorem
prob Y X prob X prob X Y = prob Y
this is because prob X, Y = prob Y, X ⇔ prob X Y prob Y = prob Y X prob X
marginalization
prob X, Y + prob X, ¬Y = prob Y X + prob ¬Y X prob X
marginalization
prob X, Y + prob X, ¬Y = prob Y X + prob ¬Y X prob X = prob X
marginalization n let Yi i=1 be a set of mutually exclusive propositions
marginalization n let Yi i=1 be a set of mutually exclusive propositions, then n X
prob Yi X = 1
i=1
and n X i=1
prob X, Yi = prob X
towards the continuum
if there are infinitely many mutually exclusive possibilities (e.g. Y = height of a person), then ∞ Z
prob Y X dY = 1 −∞
and ∞ Z
prob X, Y dY = prob X −∞
note
prob X, Y is technically a probability density function Zy2
pdf X, Y dY
prob X, y1 6 Y 6 y2 = y1
note
to get probabilities out of densities, we have to integrate
note
to get probabilities out of densities, we have to integrate we will henceforth drop this distinction and simply write p X, Y to indicate either prob X, Y or pdf X, Y
independence
if X and Y are independent, then p X, Y = p X p Y because p X, Y = p X Y p Y and p X Y =p X
random variable
a variable X whose value is subject to chance
random variable
a variable X whose value is subject to chance it can assume different values, each according to an associated probability
random variable
a variable X whose value is subject to chance it can assume different values, each according to an associated probability to express that X ∈ R is distributed according to p(x), we write X ∼ p(x)
or
pX (x)
expectation
the average value of some function f (x) under a distribution p(x) is called the expectation of f (x) we have X E f (x) = f (x) p(x) x
or
Z E f (x) = f (x) p(x) dx
special case
expectation of a random variable X Z E X = x p(x) dx ≡µ
example
p(x)
symmetric, unimodal distribution
E[x]
x
example
p(x)
skewed, unimodal distribution
E[x]
x
example
p(x)
multimodal distribution
E[x]
x
example
p(x)
multimodal distribution
E[x]
x
special cases
averaging a function of several variables Z E f (x, y) = f (x, y) p(x, y) dx dy
special cases
averaging a function of several variables Z E f (x, y) = f (x, y) p(x, y) dx dy
averaging a function of several variables over one variable Z Ex f (x, y) = f (x, y) p(x) dx conditional expectation Z Ex f | y = f (x) p(x | y) dx
note
E E f (x) = E f (x)
note
E E f (x) = E f (x)
because E E f (x) =
Z Z
f (x) p(x) dx p(z) dz
note
E E f (x) = E f (x)
because E E f (x) =
Z Z
f (x) p(x) dx p(z) dz Z Z = p(z) dz f (x) p(x) dx
note
E E f (x) = E f (x)
because E E f (x) =
Z Z
f (x) p(x) dx p(z) dz Z Z = p(z) dz f (x) p(x) dx Z = f (x) p(x) dx = E f (x)
variance
the variability of f (x) around the mean E f (x) is called the variance of f (x) we have 2 var f (x) = E f (x) − E f (x)
variance
the variability of f (x) around the mean E f (x) is called the variance of f (x) we have 2 var f (x) = E f (x) − E f (x) and note that h i var f = E f 2 − 2 f E f + E2 f = E f 2 − 2 E f E f + E2 f = E f 2 − E2 f
special case
variance of a random variable X 2 var X = E X − E X 2 =E X−µ ≡ σ2
special case
variance of a random variable X Z var X = (x − µ)2 p(x) dx Z
Z 2
= x p(x) dx − 2µ x p(x) dx + µ Z = x2 p(x) dx − µ2 ≡ σ2
Z 2
p(x) dx
note
once again
var = expected deviation from expected value
covariance
for two random variables X and Y, we have cov X, Y = EXY X − E X Y −E Y
= EXY XY − E X E Y
covariance matrix
for two random vectors x and y, we have T cov x, y = Exy x − E x y−E y
T = Exy xyT − E x E y ≡C
covariance matrix
in particular, we have cov x, x = E xxT − µµT where µ=E x
summary
we now know about
basic terminology and concepts of probability theory
exercises
show that all of the following are indeed identical p(X, Y, Z) = p(X | Y, Z) p(Y, Z) = p(X | Y, Z) p(Y | Z) p(Z) = p(Y | X, Z) p(Z | X) p(X) = p(Y, Z | X) p(X) .. .
exercises
show that, for a constant c and a random variable X a) E[c + X] = c + E[X] b) E[cX] = c E[X] c) var[c + X] = var[X] d) var[cX] = c2 var[X] show that, for two random variables X and Y a) E[X + Y] = E[X] + E[Y] show that, for two independent random variables X and Y a) E[XY] = E[X] E[Y]